How to Write Fast Code

● Created scripts to measure latency and Throughput of SIMD addition and FMA.
● Designed a fast matrix multiply kernel in SIMD as the sum of outer product.
● Implemented a cache oblivious program for matrix transpose using Morton Z ordering.
● Designed a Cache aware matrix multiplication kernel with L1 and L2 cache as set assciative cache.
● Implemented all gather collective communication in MPI.

Learning Outcomes:

  1. Designing Fast Kernels.
  2. SIMD programming.
  3. Compiler Optimization and Memory Hierarchy.
  4. Collective Communcations - Broadcast, all gather, all reduce, reduce-scatter and Algorithms - Minimum spanning tree and Bucket.
  5. OpenMP Parallelism.
  6. Basics on Parallel programmming for Heterogenous Architecture- CUDA, OpenCL.

Programming Language: C, SIMD