Project poster

ichida-algo

This is the submission repository for StartHack 2024 (June). Our team chose Track 1 - HPC, which involved writing a massively parallel implementation of neural network inference on a small demo model.

Development involved:

Optimising for the compiler (cache locality, throughput)
SIMD programming, vector intrinsics and alignment in C
Multithreading and task distribution
x86_64 assembly, in-depth profiling and tuning
Programming in CUDA
MPI (Message-Passing Interface) for multi-GPU utilisation

We decided to go down this path because it sounded like some high risk, high reward excitement. Before starting out, we didn't know almost anything about low level optimisation & GPU programming, so it has been a lot of active learning on the job!

Installation

In order to correctly run the code, please ensure that you have an x86_64 CPU if you want to test the CPU implementation (as well as OpenMP for multithreading), and a CUDA compatible GPU to test the GPU implementation (as well as the appropriate version of the CUDA toolkit). Please ensure that if you are using multiple GPUs you have an MPI implementation installed (we have verified OpenMPI as working).

To compile and run on CPU (multithreaded):

make
Run with the provided script: ./speed_demo_cpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a non-MPI setup:

make build_gpu
Run with ./speed_gpu ./weights_and_biases.txt ./tensors <iterations_per_input>

To compile and run with a MPI (multi-gpu) setup:

make
Run with the provided script: ./speed_demo_gpu.sh ./weights_and_biases.txt ./tensors <iterations_per_input>

[!IMPORTANT] We've found that running MPI on a large number of devices incurs a significant ~6s overhead. In order to minimise the effect of this overhead when measuring, we recommend running a large number of inferences per input (500M to 1B per input on 8 GPUs). CUDA incurs a similar but less severe ~2s penalty in some cases.

Implementation details

The CPU matmul kernel is written using SIMD intrinsics, entirely in C! It makes heavy use of memory alignment, cache locality with a transposition step, and is quite fast. In fact, as far as we're aware, it beats the inline asm version provided by cblas by a noticeable margin for this usecase!
To skirt around the problem of each matrix calculation being quite small, we went with a monolithic kernel design, where inferences are run essentially per thread. It took some wrestling with the way CUDA works, but we also managed to get it running at a satisfing speed. This was an interesting first CUDA experience, and there really wasn't much material about this approach, but we are happy with how it turned out in the end (especially since one of us only owns a MacBook).
We divide work evenly among the available GPUs for a speedup using MPI. After some struggles with setup, we managed to get it working by testing on target hardware.

Results

Thes are the best runs that we have achieved (all categories were tested on 52 inputs):

Hardware Used	Parallelism	Best Run / Nr. of iterations	Throughput (time for 1B)
Ryzen 5600x	1 thread	6.658s / 100k per input	21 minutes 20.34 seconds
Ryzen 5600x	12 threads	11.631s / 1M per input	3 minutes 43.62 seconds
EPYC 7J13*	240 threads	112.124s / 100M per input	21.56 seconds
A100 80GB	1 GPU	103.833s / 100M per input	19.96 seconds
A100 80GB	8 GPUs	70.388s / 500M per input	2.71 seconds

* Dual socket system with 2x CPUs each at 64 cores / 128 threads

Team members

All team members are from RMIT.

Artemis Rosman - rozukke

Project management
CPU optimisation (AVX2/SIMD, kernel, memory, multithreading, testing/profiling & tuning)
GPU optimisation (monolithic kernel design & work division, small tweaks)
MPI optimisation (work division)
Code rewrites & cleanup, code review/maintenance
Communication & design
A lot of textbook reading

Dan Dang - nhatdongdang

Core implementation in C (neural activation, I/O operations, loading model)
Benchmark implementation
CPU optimisation (AVX2/SIMD, testing/profiling & tuning)
GPU optimisation (memory, monolithic kernel implementation, testing/profiling & tuning)
MPI optimisation
Code review & CI pipeline
A lot of textbook reading

Johnathan Chan - Jinxto

Core CUDA implementation & optimization (CUDA memory allocation, Kernel & blocks)
Core MPI implementation & optimisation (Dynamic GPU detection & distribution of workloads)
Builds & CMake setup (Conditional build for CPU & GPU)
Teamwork :D
A lot of textbook reading

Built With

c
cuda
mpi

Submitted to

CISSA x MAC StartHack 2024
- Winner Track 1 (GPU 1st)
- Winner Track 1 (CPU 1st)

Created by

I installed KUDA instead of CUDA now my pc has virus

Johnathan Chan
First year student at RMIT.
Responsibilities:
- nitpicking
- deranged ideas
- talking too much

Artemis Rosman
Dev and student mentor at RMIT.
bug driven development

Dan Dang
Second year student at RMIT