How to Write Fast Code

● Created scripts to measure latency and Throughput of SIMD addition and FMA.
● Designed a fast matrix multiply kernel in SIMD as the sum of outer product.
● Implemented a cache oblivious program for matrix transpose using Morton Z ordering.
● Designed a Cache aware matrix multiplication kernel with L1 and L2 cache as set assciative cache.
● Implemented all gather collective communication in MPI.

Learning Outcomes:

Designing Fast Kernels.
SIMD programming.
Compiler Optimization and Memory Hierarchy.
Collective Communcations - Broadcast, all gather, all reduce, reduce-scatter and Algorithms - Minimum spanning tree and Bucket.
OpenMP Parallelism.
Basics on Parallel programmming for Heterogenous Architecture- CUDA, OpenCL.

Programming Language: C, SIMD

Share on

Twitter Facebook LinkedIn

Vishnu M H

Share on