This assignment has three parts:
- Part A: Vector add & matrix multiply kernels (
00vs01) and speedup analysis - Part B: CPU vs GPU performance with and without Unified Memory; generate two charts
- Part C: 2D convolution (naive CUDA, tiled CUDA, cuDNN) that prints exactly three lines
- GPU node required
- CUDA toolchain available either on the host or via Singularity
- cuDNN available/linked for Part C
NYU HPC (Burst Node) module+container usage:
cd /scratch/[NetID]
git clone https://github.com/aaf091/cuda-lab4-hpml.git
cd cuda-lab4-hpml
/scratch/work/public/singularity/run-cuda-12.2.2.bashPart A/
vecadd00, vecadd01, matmult00, matmult01 sources (starter + your edits)
Makefile
Part B/
vecaddcpu, vecaddgpu00 (no UM), vecaddgpu01 (UM)
collect_partB_results.sh # optional helper
plot_partB.py / plot_from_cli_log.py # optional plotting helpers
partB_results_step2.csv / partB_results_step3.csv # produced during runs
Part C/
convolution.cu → builds ./convolution (or ./conv)
Makefile
# Part A
cd "Part A"
make clean
make vecadd00 vecadd01 matmult00 matmult01
# Part B
cd "../Part B"
make clean
make
# Part C
cd "../Part C"
make clean
makecd "Part A"
make clean
make # vecadd (same "values per thread")
./vecadd00 500
./vecadd00 1000
./vecadd00 2000
./vecadd01 500
./vecadd01 1000
./vecadd01 2000
# matmul00 (FOOTPRINT_SIZE=16) → pass 16/32/64 for 256/512/1024
./matmult00 16
./matmult00 32
./matmult00 64
# matmul01 (FOOTPRINT_SIZE=32) → pass 8/16/32 for 256/512/1024
./matmult01 8
./matmult01 16
./matmult01 32Optional modernization (silence warnings):
sed -i 's/cudaThreadSynchronize()/cudaDeviceSynchronize()/g' vecadd.cu matmult.cu sed -i 's/cudaThreadExit()/cudaDeviceReset()/g' vecadd.cu
Executables:
vecaddcpu(CPU baseline)vecaddgpu00= no Unified Memory (device malloc/memcpy)vecaddgpu01= Unified Memory- Scenarios (second CLI arg for GPU progs):
1= 1 block × 1 thread,2= 1 block × 256 threads,3= many blocks × 256 threads
cd "Part B"
make clean
make# CPU baseline (K in millions)
./vecaddcpu 1
./vecaddcpu 5
./vecaddcpu 10
./vecaddcpu 50
./vecaddcpu 100
# Step 2 — WITHOUT Unified Memory
./vecaddgpu00 1 1
./vecaddgpu00 5 1
./vecaddgpu00 10 1
./vecaddgpu00 50 1
./vecaddgpu00 100 1
./vecaddgpu00 1 2
./vecaddgpu00 5 2
./vecaddgpu00 10 2
./vecaddgpu00 50 2
./vecaddgpu00 100 2
./vecaddgpu00 1 3
./vecaddgpu00 5 3
./vecaddgpu00 10 3
./vecaddgpu00 50 3
./vecaddgpu00 100 3
# Step 3 — WITH Unified Memory
./vecaddgpu01 1 1
./vecaddgpu01 5 1
./vecaddgpu01 10 1
./vecaddgpu01 50 1
./vecaddgpu01 100 1
./vecaddgpu01 1 2
./vecaddgpu01 5 2
./vecaddgpu01 10 2
./vecaddgpu01 50 2
./vecaddgpu01 100 2
./vecaddgpu01 1 3
./vecaddgpu01 5 3
./vecaddgpu01 10 3
./vecaddgpu01 50 3
./vecaddgpu01 100 3cd "Part C"
make clean
make
./convolutionExpected output (exactly three lines):
C1_<checksum>,<time_ms>
C2_<checksum>,<time_ms>
C3_<checksum>,<time_ms>
- Part A
-
Runtimes for
vecadd00vsvecadd01andmatmult00vsmatmult01
-
Analysis
• vecadd: The “01” variant is consistently slower , indicating the kernel is bandwidth-bound and the change reduced effective memory throughput(coalescing/ILP/occupancy didn’t improve).
• matmul: The “01” kernel wins at larger sizes because assigning a 2×2 tile per thread increases arithmetic intensity and data reuse, which pays off as N grows.
-
- Part B
-
Two charts (no UM and UM) including CPU

-
End-to-end time is dominated by allocation (~380–400 ms per run) and is almost independent of K; it accounts for ~ 85–90% of total, while kernel execution is only ~55–68ms. As a result, the kernel-only curves are nearly flat and the total-time curves barely change with K. Changing the launch configuration (scenario scaling) has modest impact: scenario 3 (many blocks×256) is not consistently faster because the kernel is memory-bandwidth–bound and the extra parallelism doesn’t reduce the fixed costs; measured differences are within a few milliseconds. Unified Memory vs explicit device memory shows negligible difference on this workload—UM allocation is similar to cudaMalloc, and with linear, single-touch access there’s little page migration overhead. Takeaway: to see GPU speedups, reuse allocations (time only the compute region), consider pinned host memory + async copies, and amortize setup across multiple iterations.
-
- Part C
-
Deprecated CUDA warnings
Replace in sources:
cudaThreadSynchronize() → cudaDeviceSynchronize()
cudaThreadExit() → cudaDeviceReset()If you see warnings like
cudaThreadSynchronize()is deprecated, they are harmless.To modernize:
cudaThreadSynchronize() → cudaDeviceSynchronize()andcudaThreadExit() → cudaDeviceReset().sed -i 's/cudaThreadSynchronize()/cudaDeviceSynchronize()/g' <list of files separated by spaces> sed -i 's/cudaThreadExit()/cudaDeviceReset()/g' <list of files separated by spaces>
-
cuDNN linking errors (Part C)
Ensure-L/usr/local/cuda/lib64 -lcudnnin the link line andLD_LIBRARY_PATHcontains CUDA libs:export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
-
Singularity doesn’t see your home files
Start shell with--home $HOMEor ensure the home is bind-mounted by default.
- Keep commands with spaces in folder names quoted, e.g.,
cd "Part B". - If your cluster forbids
giton compute nodes, pull/push from login nodes only.
