An interview with Martin Deneroff, Chief Operating Officer.
Which criteria did the Emu team deem essential to delivering a highly scalable High Performance Computing / Big Data solution?
First and foremost, we needed a good match to our architecture to make the compiler implementation straightforward. We prioritized that it be easy to understand and learn, and, ideally, familiar to most programmers. Since our solution is a fine-grained, massively parallel implementation, we wanted a language which is inherently parallel without requiring that parallelism be invoked through external libraries.
Would you tell me a bit about Cilk’s history that makes it a great fit for Emu’s Migratory Memory Side Processing?
Cilk is based on C, which makes it relatively easy to compile, and is familiar to most programmers. C is different from Cilk insofar as Cilk is inherently parallel, but C requires use of a library like OpenMP. The parallel concepts in Cilk match well with the underlying concepts in our hardware for invoking parallelism – our hardware has a spawn instruction which corresponds to a Cilk_spawn almost exactly. The hardware implements a shared memory paradigm, which is what Cilk naturally supports.
Cilk is somewhat restrictive in the kinds of parallel constructs it allows you to create so that you can’t make foolish errors. The incidence of subtle programming errors like race conditions is much lower in Cilk programs than in programs using environments like OpenMP or MPI, and the simplified structures available in Cilk tend to make the programs easier to understand. While some developers see restrictions as a disadvantage, we see this as an advantage – the restrictions help engineers to focus on algorithmic development rather than worry about the intrinsic art of parallelization. While we recommend Cilk, for those who are committed to riskier approaches based on their personal expertise, we’ll certainly support them leveraging the underlying Emu advantages of migratory threads.
Which languages did we investigate/analyze?
There were a number of other languages we could have selected. We looked at X10, Chapel, Habanero and UPC – they are all less familiar than Cilk and, in our estimation, are much harder to compile. We didn’t see particular advantages to choosing them for our Migratory Memory-side Processing architecture.
We also looked at OpenMP together with C. While we are in the process of adding support for this to our platform, we find it to be both less efficient and harder to program. That said, it is more widely adopted, and as such it’s important to have in our tool chest.
The parallelism of C plus OpenMP is done through a library which makes it less feasible for the compiler to perform error checking. This opens the door to a variety of serious programming errors, including race conditions and non-determinism. Bottom line, Cilk looked like less work for both us and our customers than using C plus OpenMP.
Which Cilk enhancements are needed to make it straight forward to port OpenMP codes?
We see that it’s important to add Eurekas. Most parallel codes implement barriers which cause codes to wait until every thread completes. A Eureka kicks off a bunch of threads – when one of them finds the answer, that thread calls out “eureka” and everyone jumps to the barrier and finishes. We see this capability being added to Cilk in its next release.
We’re also looking to add co-routines through use of libraries. Using standard Cilk, when children finish they only return to the parent. OpenMP and Habanero have a notion of co-routines that run at the same level as the parent. Thise is sometimes useful, but frequently introduces bugs – it’s a religious position for those programmers who want complete control. As such, we want to support it for those who are confident in their abilities to handle the risks.
We’ll build the Cilk Race Detector right into the compiler. It can analyze your program and see if you’ve built it with a bug. Having a co-routine or other non-Cilk library operations makes it very difficult to utilize the Cilk Race Detector. That’s why we advise against giving up the great functionality it provides.
Are there other keywords or enhancements you envision are desirable and what would we use them for?
We will look to implement some vector capabilities that Intel introduced. There’s already work going on with Sparse Matrix Vector work for the Emu platform under Richard Vuduc at Georgia Tech.
We’re interested in reintroducing the Inlet. An Inlet sends data back to the parent before the child has completed. It creates a shared memory where a child can deposit a result and the parent can poll the data, analogous to Chapel’s “future.” The initial implementation can be improved upon.
A paper on Cilk basics can be found here:
The Implementation of the Cilk 5 Multi-threaded Language: http://supertech.csail.mit.edu/papers/cilk5.pdf
Find out more about Cilk at http://cilk.mit.edu/