CUDA ResNet

A from-scratch implementation of ResNet-18 inference using custom CUDA kernels, exploring GPU optimization techniques from naive direct convolution to tensor core-accelerated implicit GEMM.

Project Overview

This project implements ResNet-18 image classification entirely in CUDA C++, progressing through multiple optimization stages to understand GPU performance characteristics. The implementation handles the complete inference pipeline from image preprocessing to final classification.

Key Achievement: Achieved 30% of PyTorch's highly-optimized CUDA performance using custom kernels, with a 2.39x speedup through tensor core utilization over direct convolution approaches.

Usage

Clone:

git clone https://github.com/pamin1/CUDAResNet.git
cd CUDAResNet
git submodule update --init --recursive

Build:

mkdir build && cd build
cmake ..
make

Run:

./resnet

Performance Benchmarks

Accuracy

Using the same weights across all test groups so the inferences are deterministic, thus accuracy testing will be redundant.

Performance

Implementation	Latency (ms/img)	Throughput (img/s)	vs PyTorch CUDA	vs Direct Conv
PyTorch CUDA (cuDNN)	4.281	233.58	Baseline	5.46x faster
Custom CUDA - Tensor Cores	14.497	68.97	3.38x slower	2.39x faster
PyTorch CPU	18.550	53.92	4.33x slower	1.87x faster
Custom CUDA - Shared Memory	34.744	28.78	8.11x slower	Baseline
Custom CUDA - Naive	39.208	25.51	9.16x slower	1.13x slower

Benchmarked with 1000 samples after 50 warmup iterations

Implementation Highlights

Tensor Core Acceleration: Implicit GEMM algorithm using WMMA API for FP16 matrix operations
Full Inference Stack: Custom image preprocessing, model parsing (JSON + NPZ), and classification
Profiler-Driven Development: Extensive use of NSight Compute for bottleneck analysis

Documentation

Detailed documentation is available in the docs and also Medium articles:

Implementation Details (Medium Post): Architecture, data flow, and kernel design
Tensor Core Optimizations (Medium Post): Overview of Tensor Cores, optimization, and profiling steps.

Future Improvements

Experimenting with tile sizes (32×32, 64×64) for better occupancy and compute/memory usage
Improved memory coalescing and prefetching strategies
Profile PyTorch implementation to see runtime differences

Libraries and Applications

CUDA C++ with WMMA API for tensor cores
OpenCV for image preprocessing
nlohmann/json for model architecture parsing
cnpy for NumPy .npz weight loading
NSight Compute for performance profiling

Acknowledgments

Inspired by PyTorch's cuDNN implementation and the original ResNet paper (He et al., 2015).

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
assets		assets
docs		docs
include		include
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA ResNet

Project Overview

Usage

Performance Benchmarks

Accuracy

Performance

Implementation Highlights

Documentation

Future Improvements

Libraries and Applications

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CUDA ResNet

Project Overview

Usage

Performance Benchmarks

Accuracy

Performance

Implementation Highlights

Documentation

Future Improvements

Libraries and Applications

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages