Data Intensive Big Data Analytics
Massive data warehouses and unstructured databases containing 100s of TBs or more of disparate information are far too large to fit in the main memory of conventional computers, so researchers and data scientists often embrace the compromise of Fast Data. Fast Data techniques reduce a data set in size to allow more efficient computation. But for many applications, a “no compromise” approach to addressing data intensive Big Data is desirable. That means being able to handle data streaming and changes to the data set while analysis is underway. And, unlike applications of the past, modern applications are dominated by data access and movement rather than raw flops.
Emu is designed from the ground up to deal with data that has little or no locality – Referencing data spread across many memories is not a problem! Now we can solve more complex analytics on larger data sets.
Migratory Memory-Side Processing
Emu Technology has developed Migratory Memory-Side Processing, which is processing tightly coupled to a distributed shared memory, without needing buses or caches. We do this with:
Many lightweight cores tightly coupled to memory
- Minimizes latency and energy use
- Unnecessary data is not fetched to fill cache lines
- No cache coherency traffic is required
Executing thread moves to the data
- Network traffic is one way
- Moving a thread context moves less data than reading a data block from a remote memory
Programmed in a true parallel language, Cilk, instead of library calls
The cutting edge problems of today require real-time processing of massive Big Data sets which have characteristics of weak data locality – sparse data. These Exascale-like computing challenges are hampered by traditional architectures which were designed in a different era, when data locality was strong, and the limits of Moore’s Law and Dennard scaling were not being reached.
In EMU, reading a memory location on a different node causes the context to move to the node containing that data (at the Locale of the reference), instead sending a read across the network. This approach wins whenever more than one reference occurs at a locale. Processors never stall for long periods waiting for remote reads, and overall utilization is improved. The network is simplified because it no longer requires round trip read and response messages. Remote Writes can be performed directly or via migrations, under programmer (compiler) control.
Conventional cache-based computers and GPUs rely on Strong Locality for performance
Strong Locality is the situation where multiple data accesses come from a single cache line of 64 to 1024 bytes. Modern codes, especially sparse matrix and graph codes, increasingly fail to exhibit this situation.
Weak Locality is the situation where multiple data accesses come from the same locale (an entire bank of 4 GB or more). EMU gains performance from Weak Locality, and has no reliance on data adjacency.
Stationary Cores (SCs) execute the Operating System and File Systems, and Call or Spawn Gossamer Threadlets to access shared memory and perform migratory processing.
Gossamer Cores (GCs) execute Threadlets at Gossamer Nodelets, perform computations, migrate to other Nodelets, spawn new Threadlets and call System Services on SCs.
- 32 nodes/motherboard
- Each motherboard and its nodes makes up a Supergroup
- Supergroups are interconnected with a high radix RapidIO network, configurable to as many as 64k nodes