Personal Notes: Basic NVIDIA GPU Architecture

within the context of the Hopper Architecture

Jun 24, 2025

Why Hopper? Because that’s what I currently have on my hands and yeah this is going to be NVIDIA-specific.

Starter: CPUs Vs GPUs

CPUs are for general purpose computing, can generally have 1 to 128 cores that can handle complex instructions (or at least that’s how many they expose). Each of those is optimized for latency and serial tasks. They have sophisticated features like branch-prediction, speculative execution, and out-of-order execution. They optimized for finishing a task as fast as possible with deep & large cache hierarchies (L1, L2, L3) to reduce the time waiting for memory loads/stores. They also use high-clock speeds and complex control logic. Because each CPU core is so general purpose and expected to handle everything, it is significantly more expensive than a GPU core in almost every aspect - silicon area, power consumption, design complexity, and of course money. A single CPU core can occupy 5~20 mm² of die area and use a 1~5 Watts of power, even in energy-efficient CPUs.

GPUs on the other hand are for massively parallel data processing. They generally have thousands of cores (H100 has 16896 for example) that are a lot simpler and are optimized for running instructions in groups which makes them most suitable for high-throughput parallel tasks. They achieve that in three ways: (1) being a lot simpler - they do away with complex control logic and instructions which saves silicon area. (2) they focus on implementing wide SIMT/SIMD instructions where one instruction is sent to many cores at the same and it manipulates many data points within a single step. (3) They trade-off memory bandwidth for memory latency (i.e. they allow more data to be transferred but it may get more time to reach their destination). A single GPU core (which is really more of an ALU) can occupy 0.01~0.05 mm² of die area and use 10~100 mWatts of power. That said the actual GPU chip can get quite big and eat up a lot of power given that it has thousands of cores.

A simple (but not fully accurate) way to think about the two is that a CPU is like a fast maneuverable 2-seater sports car that you can use to get from point A to point B very quickly and do cool tricks and drifting along the way to by-pass road obstacles. A GPU on the other hand is like a huge freight-truck - slow to turn, can’t access all the roads, but it can transport a whole town of people with high top-speeds in straights.

NVIDIA GPUs - Hardware Model

The most basic unit of the GPU hardware is a CUDA core. It is similar to a CPU core but much simpler. A cluster of CUDA cores are bundled together into groups/units (couldn’t find a term for those units in the white-paper) that share resources - these groups on H100 look like this:

In the above we see the exact core types - 1 Tensor core (not sure what that contains currently), 16 x INT32 cores, 16 x FP64 cores and 32 x FP32 cores. The resources that are shared are:

The Register File: A low-latency memory bank that stores the core’s/ALU’s registers. It stores the thread’s temporary variables and state (much like registers in a CPU, but scaled for thousands of threads). For H100 we see that in a core group the register file is (16,384 x 32 bits) / (8 bits/byte) = 65,536 bytes or 65KBs.
Load/Store Units: These are responsible for executing memory instructions (like LD/ST) moving data between the register file and the L1 Cache, Shared Memory, or Global Memory (more on those later).
The Special Function Unit (aka SFU): A unit that handles special math operations like sin, cos, exp, sqrt, log, etc. that are very complex and less frequent floating-point operations that are inefficient to run on general-purpose FP units.
The Warp-Scheduler and the Dispatch Unit: As we’ll see later, NVIDIA GPUs chunk threads that are executing instructions in lock-step into groups of 32, called Warps. The Warp-Scheduler picks a Warp that’s ready for execution and using the Dispatch Unit it issues instructions from that warp into the execution units/cores each cycle. It is also responsible for ensuring that data dependencies are respected (memory latency, barriers, etc..) and can stall warps waiting for memory or control resolution (*). In the schematic above the 32 threads/clk signifies that this grouped unit can execute 32 thread instructions per cycle.
The L-0 Instruction Cache: A very small (assumed ~100s of bytes) ultra-low latency instruction cache private to the warp scheduler. It holds recently fetched instructions for the warps managed by the scheduler.

Each of these groups of cores + resources are then grouped into Streaming Multiprocessors (SMs). For the H100, an SM consists of 4 groups of cores that we saw above.

They also contain the following components:

One L1 Instruction Cache: If there is a cache miss in the L0 Cache the GPU will look here.
One Tensor Memory Allocator (TMA): Not fully sure yet what that is but it seems to be a chip for dedicated data movement for the tensor cores.
One L1 Data Cache / Shared Memory Unit: This is a 256KB SRAM chip that can be partitioned dynamically by the programmer or the CUDA runtime to act as an L1 Data Cache and Shared Memory for the warps in the SM.
Four Tex Units: There are Texture Units - fetching textures/images from memory. They also implement efficient interpolation, sampling, filtering, and format conversion. This may sound only useful for graphics workloads or AI use-cases that manipulate images but they can be generally useful. They can act as an extra read-only cache to global memory and may help with generic LD/ST pipelines.

Finally all SMs are put together (on H100 there are 132 SMs) and share the following components:

L2 Cache: A second level of caching shared across all SMs. It is on the same chip making it still faster than the global memory. For H100 this is around 50MB.
Memory Controllers to Global GPU Memory (HB3): The global memory is off-chip, thus the need for the memory controller. It is the slowest part in the memory-hierarchy within the GPU. For H100 the global memory is ~80GB with up to 3+ TB/s bandwidth.
GigaThread Engine with MiG Control: This is a hardware scheduler responsible for dispatching t work distributor at the chip level to SMs with the goal of maximizing SM occupancy and balance. The MiG (Multi-Instance GPU Control) Control part is a feature that allows a single GPU to be partitioned into multiple independent GPU instances. This is to provide isolation and I think of it as the hypervisor logic for a GPU.
NVLink units: These are high-speed GPU interconnects (much faster than PCIe Gen5 - 900GB/s vs 64GB/s) which allowing the GPU to connect to: (1) other GPUs (2) a CPU (3) NVSwitch link which is some kind of NVLink centralized router. Besides being used for transferring raw data among GPUs it also provides Unified Memory Access allowing the GPUs to share a single memory space and simplifying data management.

NVIDIA GPUs - Conceptual Programming Model

Let’s start with an example:

// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...<snip>...

    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    
    ...<snip>...
}

CUDA C++ extends C++ by allowing the programmer to define C++ functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C++ functions. The generic declaration of kernels look like this:

kernel_name<<<num_blocks, threads_per_block>>>(...);

A CUDA thread is the smallest unit of execution for NVIDIA GPUs. A group of 32 threads that execute instructions in lock-step (more on SIMT later) are called a Warp. A group of Warps are called a Block. A collection of blocks launched by a kernel is called a Grid. In the original example of VecAdd add, the hierarchy looks like this:

Grid
 └── Block[0]
     └── Warp[0]
         └── Thread[0..31]
     └── Warp[1]
         └── Thread[32..63]
     ...
     └── Warp[X]
         └── Thread[N-32..N-1]

Because we have one one block and N threads per block.

Hardware < - > Software Interface & SIMT

With all of the above context we can now read the Hardware Implementation Chapter of CUDA programming - pasted below that bridges the hardware with our conceptual programming model:

The NVIDIA GPU architecture is built around SMs. When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to SMs with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.
A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large number of threads, it employs a unique architecture called SIMT (Single-Instruction, Multiple-Thread). The instructions are pipelined, leveraging instruction-level parallelism within a single thread, as well as extensive thread-level parallelism through simultaneous hardware multithreading. Unlike CPU cores, instructions are issued in order and there is no branch prediction or speculative execution.
When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps and each warp gets scheduled by a warp scheduler for execution. The way a block is partitioned into warps is always the same; each warp contains threads of consecutive, increasing thread IDs with the first warp containing thread 0.
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp executes each branch path taken, disabling threads that are not on that path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread. In contrast with SIMD vector machines, SIMT enables programmers to write thread-level parallel code for independent, scalar threads, as well as data-parallel code for coordinated threads. For the purposes of correctness, the programmer can essentially ignore the SIMT behavior; however, substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge. In practice, this is analogous to the role of cache lines in traditional code: Cache line size can be safely ignored when designing for correctness but must be considered in the code structure when designing for peak performance. Vector architectures, on the other hand, require the software to coalesce loads into vectors and manage divergence manually.

Personal Notes & Practical Highlights in the above

Even if you use fewer than 32 threads in a warp (e.g., 33 threads per block → 2 warps), the second warp will be partially full, but still uses full warp scheduling leading to underutilization.
SIMT model: Warp executes the same instruction on 32 threads. Divergence causes serialization which leads to bad performance.
Each SM has multiple SFUs, and they run slower than dedicated FP32/INT32 units so it is not ideal to saturate them in inner loops.
Registers from the register file are allocated at compile time via PTX/SASS (the equivalent of GPU assembly) and can affect occupancy (too many registers used = fewer active warps).
The LD/ST units are highly optimized for coalesced memory access (grouping thread accesses for efficiency). Memory divergence across threads in a warp can reduce throughput.

Small Tech Journal

Discussion about this post

Ready for more?