Md Abdul Motaleb Faysal

Ph.D. in CS, UNLV | Graduate Affiliate at Berkeley Lab (LBNL)

Fast Hash Accumulation | Md Abdul Motaleb Faysal

Fast Hash Accumulation

In this project, we demonstrated that an accelerator with fast content addressable memory (CAM) for hash accumulation can address the performance limitations in traditional hash operations for processing sparse graph and close the gap between expected and true performance in HPC architecture.

A couple of applications (HipMCL and HyPC-Map) were selected for compute kernel breakdown and performance analysis. This performance instrumentation was done in different supercomputing platforms and hardware architectures such as NERSC (National Energy Research Scientific Computing Center) Cori Haswell node, Cori KNL node, and LONI (Louisiana Optical Network Infrastructure) QB2 node. We observed that the software hash table with volumes of accumulation operation was thwarting both applications to reach higher computational throughput and better memory bandwidth utilization. For example, in our roofline modeling in Cori KNL (Knights Landing) node, we observed for a single software hash accumulation done in main memory could take as much as ~150 processing cycles where the same operation could be done in ~3 cycles using an accelerator with fast CAM memory. Another group of researchers proposed a design for an accelerator with fast hash accumulation for the SpGEMM compute kernel. We generalized the accelerator interface to handle non-SpGEMM kernel-based hash accumulation operations. For this purpose, first, we validated the results from the simulation software against the result from the actual hardware. Then we ran with different architectural configurations (different limits and latency on different levels of the cache hierarchy, memory bandwidth, processor clock speeds, and the number of threads) and articulated the empirical results against theoretical performance. The outcomes of the empirical analysis supported our theoretical performance modeling and the proposed design of the microarchitecture. By using the accelerator, we observed 3.28X−5.6X speedup for hash accumulation for the social networks used in the study while reducing the number of branch mispredictions by 59%, the CPI rate by 21%, and the total number of instructions by 24%.