LLM Labs

GPUs Part 3 - Going from here

2024-06-30T00:00:00+00:00

Hopefully, you have read part 1 and part 2 of Learning about GPUs series. This part provides an index of all the useful resources one can consider to get a more advanced understanding of GPUs.

Learning about the fundamentals

[Book] Programming Massively Parallel Processors, A Hands-on Approach By David B. Kirk, Wen-mei W. Hwu
1. This is the best resource to learn about parallel programming and GPUs. The first 4 chapters explain the fundamentals of GPU hardware and its programming model
[YouTube playlist] 12 to 14 videos in COS 436
CUDA Mode
1. Very good resource for learning about GPUs/CUDA/Triton. They also have a very active Discord
CUDA C++ programming guide
1. Official guide from Nvidia which can be used as a reference
[YouTube playlist] CUDA teaching center
1. Short series to get started in CUDA and get a refresher on GPU hardware

Notable Talks

Notable blogs

What every developer should know about GPU computing
1. Gentle introduction to the GPU programming model
What shapes do Matrix Multiplication Like?
1. Puzzles to test your understanding of GPU hardware
Making Deep Learning Go Brrrr From First Principles
How is LLaMa.cpp possible?

Programming tutorials

Tiled matrix multiplication in CUDA
Matrix multiplication in pure CUDA: How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
GPU puzzles by Srush
Triton puzzles by Srush
LLM.c LLM training in raw C/CUDA

Citations

For attribution, please cite this as

@article{romit2024gpus3,
  title   = {GPUs Part 3},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {June},
  url     = {https://cmeraki.github.io/gpu-part3.html}
}

GPUs Part 2 - Understanding the GPU programming model

2024-05-26T00:00:00+00:00

Written by Romit Jain

Part 1 in the series gives a basic understanding of the GPU hardware. This blog will describe the programming model that is used to run programs on GPUs.

Hardware to software mapping and programming model of the GPU

2 things to keep in mind before we start:

The physical concepts of hardware do not necessarily translate one-to-one to logical concepts in software.

In GPU programming, a kernel is a function that is written to be executed on the GPU. A program can have multiple kernels and they can be “launched” from the CPU.

Threads

Each kernel is executed by a thread in the GPU. And every thread executes the same kernel (assuming there is only a single kernel in the program). This makes it necessary to write kernels such that a single function can operate on all the data points. When a kernel is launched, multiple GPU threads are spawned that execute instructions written inside that kernel. The number of threads that are spawned at once is configurable.

All threads have some small memory associated with it which is called local memory. Apart from that, threads can also access the shared memory, L2 cache, and global memory.

Physically, threads are assigned to cores. Cores execute software threads.

Blocks

Threads are logically organized into blocks. Every block has a pre-defined number of threads assigned to it. Just for logical purposes, threads can be arranged inside a block in either a 1D, 2D, or 3D array layout. Blocks can be thought of as an array of threads. It’s important to understand that this 1D, 2D, or 3D arrangement is purely logical and for the developer’s convenience only. This arrangement is provided so it’s easier to visualize input and output data. For example, if the kernel needs to operate on a 100x100 matrix, then a kernel with a block size of 100 by 100 threads can be launched. That will start a total of $10^4$ (100x100) threads which can be mapped to the matrix. The kernel can be written such that every single thread operates on every single element of the matrix.

In the physical world, every block is assigned an SM (Streaming multiprocessor). Throughout its execution, the block will only be executed on the same SM. Since every block is assigned an SM, it also has access to the SM’s shared memory (refer to Part 1 of the series for more context). All the threads that are part of a single block can access and share this memory.

Grids

Similar to how threads are organized in blocks, blocks are themselves organized into a grid. That allows the GPU to launch multiple blocks at one time. A single GPU has multiple SMs, so multiple blocks can be launched at once so that all of the SMs and cores are utilized. Let’s assume that the program executes 25 blocks and the GPU has 10 SMs. Then the program will execute 10 blocks in the first wave, 10 blocks in the second wave, and 5 blocks in the third wave. The first two waves will have 100% optimization but the last wave will have 50% utilization.

Blocks inside a grid can be organized in the same way that threads are organized inside a block. A grid can have a 1D, 2D, or 3D array layout of the blocks. The arrangement of blocks and threads is just logical. A single program only executes a single grid at a time. The grid has access to the global memory or HBM of the GPU.

Figure 1: Grids/Blocks/Threads layout Source: Borrowed from this excellent blog.

During execution, a total of blocks per thread (b) * number of blocks (num) physical threads are spawned. Each physical thread is numbered from 0 to (b*num)-1. So, how is the 2D or 3D structure of logical thread blocks mapped to the physical thread? By unrolling.

A 2D array layout can be unrolled to 1D. If it’s row-major ordering, then a 2D matrix after unrolling will look like this:

Figure 2: Element A[2][3] in the 2D matrix will be A[5] in the flattened 1D array. This is how the mapping of 2D blocks of thread to the 1D thread array is accomplished.

When blocks and threads are arranged in this 1D, 2D, or 3D layout, CUDA maps them to the x-axis, y-axis, and z-axis in its programming model. This will be useful in the next section.

A simple example in CUDA

CUDA is a programming extension of C/C++ that helps write heterogeneous programs (that run on CPU and GPU). These programs allow to define and launch kernels from the CPU. CUDA is very powerful and offers a lot of ways to optimize the kernels. It’s just a bit … too verbose. Let’s implement a very naive implementation of matrix multiplication to understand how CUDA works. A few CUDA function calls will be used throughout the code. They should be self-explanatory, but in case they are not, just google the syntax. This is a relatively simple kernel, so should be easy to follow along.

Here are the general steps of writing and launching a kernel from CUDA:

Allocate the memory for the data (both input and output) on the CPU memory (also called as host). Allocate memory for the input (X), weight matrix (W), and output (O). Assuming B as the batch size, N as the number of rows or sequence length in transformers, D_in as the number of columns or embedding dimension, and D_out as the hidden dimension.

float *X = (float*)malloc(B*N*D_in*sizeof(float));      // Input data
float *W = (float*)malloc(D_in*D_out*sizeof(float));    // Weights
float *O = (float*)malloc(B*N*D_out*sizeof(float));     // Output data

Allocate the memory for the data on the GPU (also called as device)

float *d_X, *d_W, *d_O;

cudaMalloc((void**) &d_X, B*N*D_in*sizeof(float));      //cudaMalloc is a CUDA function and allocates memory on the GPU memory
cudaMalloc((void**) &d_W, D_in*D_out*sizeof(float));
cudaMalloc((void**) &d_O, B*N*D_out*sizeof(float));

Copy the relevant data from the CPU memory to the GPU memory. Let’s assume X and W are loaded with the relevant data. Next, transfer that data to the GPU. Just for convenience, I have prefixed the variable that will reside on GPU memory with d_. These variables are a copy of X and W but allocated in the GPU memory.

cudaMemcpy(d_X, X, B*N*D_in*sizeof(float), cudaMemcpyHostToDevice);     // cudaMemcpy is again a CUDA function
cudaMemcpy(d_W, W, D_in*D_out*sizeof(float), cudaMemcpyHostToDevice);

Launch the kernel. Assuming that the kernel is called matMul, grid defines how the blocks are arranged and blocks define how threads are arranged in each block. For this example, the grid will be a 1D array equal to the batch size. blocks will have the same layout as the output dimension of the output matrix (N*D_out). This means that every block will process a single output matrix from the batch and every thread will process a single cell of the output matrix.

// Launch B blocks, each block processing a single batch
dim3 grid(B);
/*
Arrange the threads inside a block in the same dimension as the output
i.e N*D_out, so that logically each thread corresponds to a single element in the
output matrix. Hence, each thread is responsible for computing a single element of the output.
*/
dim3 blocks(D_out, N); //D_out is first instead of N, because the function dim3 takes input in x, y, z notation. x axis is the columnar axis and y axis is the row axis

matMul<<<grid, blocks>>>(
    d_X,
    d_W,
    d_O,
    B,
    N,
    D_in,
    D_out
);

In total B*N*D_out threads are spawned, arranged in B blocks.

Copy the relevant data (usually only the output) from the GPU memory to the CPU memory. Once the kernel execution is completed, the output is copied from the GPU memory back to the CPU memory so that it can be used for any downstream processing.

cudaMemcpy(O, d_O, B*N*D_out*sizeof(float), cudaMemcpyDeviceToHost);

These 5 steps are followed in almost all GPU programs. Let’s now dive deep into the actual kernel:

__global__ void matMul(
    float* X,
    float* W,
    float* OO,
    int B,
    int N,
    int D_in,
    int D_out
) {
    /*
    This kernel takes a batch of data: (B x N x Din)
    and a weight matrix: (Din X Dout)
    and produces: (B x N x Dout)
    */

    int batch = blockIdx.x;
    int row = threadIdx.y;
    int col = threadIdx.x;

    int out_offset = N*D_out*batch + row*D_out + col;

    if ((batch < B) && (col < D_out) && (row < N)) {
        float sum = 0.0f;
        for (int i = 0; i < D_in; i++) {
            sum += X[N * D_in * batch + row * D_in + i] * W[i * D_out + col];
        }
        OO[out_offset] = sum;
    }
}

Remember that physically there is no 2D or 3D arrangement of threads. That construct is just provided by CUDA to help developers map the problems appropriately. Physically it’s just a single 1D array of threads. Since B*N*D_out threads are spawned, it maps exactly with the 1D layout of the output matrix.

To figure out which data a particular thread should process, the kernel just needs to figure out which thread is it executing. Depending on the batch, row, and column, each thread will load different parts of the input and weight matrix. These are called offsets and there are 4 offsets calculated in the code:

batch: Figure out which matrix in the batch this kernel is processing.blockIdx.x gives the block ID in the x-axis of the grid layout. Since there is a 1D grid, this is the only direction available.
row: Figure out within a matrix, which row is the kernel processing. Rows are mapped to the y-axis of the block layout.
col: Figure out within a matrix, which column is the kernel processing. Columns are mapped to the x-axis of the block layout.
out_offset: Finally, map the thread ID to the exact cell in the output matrix:
1. Skipping batch matrices to arrive at the current matrix. To skip one single matrix, move ahead N*D_out number of elements in the flattened 1D array
2. Skipping row number of rows. In a 1D flattened layout, a row can be skipped by moving ahead D_out elements.
3. Finally, adding col to the summation of the above two to arrive at the element.

Hopefully, this figure will make it clearer about the offset calculation.

Figure 3: If the output data and threads have the exact length (which in this case is true), they can be mapped 1 to 1. B, N, D_out, are the batch size, number of rows, and number of columns in the output data respectively. b, n, d is i th batch, row, and column respectively.

After calculating these offsets, the corresponding row from X and the corresponding column from W are loaded followed by a single vector multiplication in a for loop. It is similar to out_offset calculation and should be easy to follow.

The complete code is present here. Running the code requires nvcc (the compiler for CUDA programs), an NVIDIA GPU to run the program, the CUDA drivers, and the CUDA toolkit installed.

A simple example in Triton

CUDA is amazing and allows a lot of optimizations. But it is quite verbose. Plus, it might not be comfortable for those coming from the machine learning or data science domain. Open AI released a package called Triton that provides a Python environment to write kernels and compile them for any GPU. Triton allows us to write very performant kernels in Python directly.

But instead of working with individual threads, Triton works with blocks. Instead of each kernel being assigned a thread, in Triton each kernel is assigned a block. Triton abstracts out the thread computation completely.

In the above example of matrix multiplication, instead of computing a single element of the output in the kernel, Triton can compute values for small “blocks” of the output matrix at once.

Figure 4: (Left) CUDA execution model vs (Right) Triton execution model Source: Triton documentation

Let’s reimplement the matrix multiplication example using Triton. The steps for Triton are very simple.

Implement a “wrapper” function to call the kernel. Below, the Triton’s kernel is being called with matmul_kernel. Define the grid and the block sizes similar to how it is done in CUDA. There are some assert statements to make sure that no errors are raised when input is passed to the kernel. Triton implicitly converts all torch tensors into a pointer. It just needs to be verified that all tensors passed to the kernel are already on the GPU (by x.to('cuda:0')).
1. Unlike CUDA however, the grid has 3 axes in this implementation. The first axis corresponds to the batch size, and in second axis corresponds to the number of times it will take BLOCK_SIZE_ROW to cover all the rows (similarly for BLOCK_SIZE_COL for the third axis).
2. During execution, this means, that for kernel will process - BLOCK_SIZE_ROW x BLOCK_SIZE_COL sub-matrix in the input for every input in the batch.

def matmul(input: torch.Tensor, weight: torch.Tensor) -> torch.Tensor:
    """
    Implements matrix multiplication between two matrices. The input matrix is 3 dimension where
    first dimension is the batch size. The weight matrix will be multiplied with each of the batches
    of the input matrix.

    Args:
        input (torch.Tensor): Matrix with dimension (B x N x D_in)
        weight (torch.Tensor): Matrix with dimension (D_in x D_out)

    Returns:
        torch.Tensor: Ouptut matrix with dimension (B x N x D_out)
    """
    assert input.is_cuda, 'Inputs are not on GPU, ensure the input matrix is loaded on the GPU'
    assert weight.is_cuda, 'Weights are not on GPU, ensure the weight matrix is loaded on the GPU'
    assert input.shape[-1] == weight.shape[-2], 'Input and weight matrix are not compatible'

    B, N, D_in = input.shape
    _, D_out = weight.shape

    output = torch.empty((B, N, D_out), device=input.device, dtype=input.dtype)

    BLOCK_SIZE_ROW, BLOCK_SIZE_COL = 16, 16
    # Grid is aligned with the ouput matrix
    grid = lambda meta: (B, triton.cdiv(N, meta["BLOCK_SIZE_ROW"]), triton.cdiv(D_out, meta["BLOCK_SIZE_COL"]))

    matmul_kernel[grid](
        input_ptr=input,
        input_batch_stride=input.stride(0),
        input_row_stride=input.stride(1),
        input_col_stride=input.stride(2),
        weight_ptr=weight,
        weight_row_stride=weight.stride(0),
        weight_col_stride=weight.stride(1),
        output_ptr=output,
        output_batch_stride=output.stride(0),
        output_row_stride=output.stride(1),
        output_col_stride=output.stride(2),
        num_rows=N,
        num_input_cols=D_in,
        num_output_cols=D_out,
        BLOCK_SIZE_ROW=BLOCK_SIZE_ROW,
        BLOCK_SIZE_COL=BLOCK_SIZE_COL
    )

    return output

That’s it. Tensor strides¹ are used which is useful to figure out the step size needed between the next batch or row in the 1D flattened view of the 3D matrix. This will come in handy in the actual kernel. Once the kernel’s execution is complete, the output will be available in the tensor passed (output).

The Triton kernel is decorated with a function @triton.jit for Triton to know that this is a function that will be executed on the GPU.

@triton.jit
def matmul_kernel(
    input_ptr,
    input_batch_stride,
    input_row_stride,
    input_col_stride,
    weight_ptr,
    weight_row_stride,
    weight_col_stride,
    output_ptr,
    output_batch_stride,
    output_row_stride,
    output_col_stride,
    num_rows,
    num_input_cols: tl.constexpr,
    num_output_cols,
    BLOCK_SIZE_ROW: tl.constexpr,
    BLOCK_SIZE_COL: tl.constexpr,
):
    # Getting block indexes in all 3 dimensions
    batch_idx = tl.program_id(0)
    row_idx = tl.program_id(1)
    col_idx = tl.program_id(2)

    # Offsets for input data
    input_batch_offset = batch_idx * input_batch_stride                                 # Offsets to reach to the correct batch. Similar to CUDA, but instead strides are being used here

    input_row_offset = row_idx*BLOCK_SIZE_ROW + tl.arange(0, BLOCK_SIZE_ROW)
    input_row_mask = input_row_offset[:, None] < num_rows
    input_row_offset = input_row_offset[:, None] * input_row_stride # Selecting relevant rows from input

    input_col_offset = tl.arange(0, num_input_cols)
    input_col_mask = input_col_offset[None, :] < num_input_cols
    input_col_offset = input_col_offset[None, :] * input_col_stride # Selecting all columns from input

    input_data_ptr = input_ptr + input_batch_offset + input_row_offset + input_col_offset
    input_data = tl.load(input_data_ptr, mask=(input_row_mask & input_col_mask)) # BLOCK_SIZE_ROW x D_in

    # Offsets for weight data
    weight_row_offset = tl.arange(0, num_input_cols)
    weight_row_mask = weight_row_offset[:, None] < num_input_cols
    weight_row_offset = weight_row_offset[:, None] * weight_row_stride # Selecing all rows from weight

    weight_col_offset = col_idx*BLOCK_SIZE_COL + tl.arange(0, BLOCK_SIZE_COL)
    weight_col_mask = weight_col_offset < num_output_cols
    weight_col_offset = weight_col_offset[None, :] * weight_col_stride # Selecting relevant columns from input

    weight_data_ptr = weight_ptr + weight_row_offset + weight_col_offset
    weight_data = tl.load(weight_data_ptr, mask=(weight_row_mask & weight_col_mask)) # D_in x BLOCK_SIZE_COL

    # Computation
    result = tl.dot(input_data, weight_data) # Matmul of a small block, BLOCK_SIZE_ROW x BLOCK_SIZE_COL

    # Offsets for output data
    output_batch_offset = batch_idx * output_batch_stride                               # Offsets to reach to the correct batch. Similar to CUDA, but instead strides are being used here

    output_row_offset = row_idx*BLOCK_SIZE_ROW + tl.arange(0, BLOCK_SIZE_ROW)
    output_row_mask = output_row_offset[:, None] < num_rows
    output_row_offset = output_row_offset[:, None] * output_row_stride

    output_col_offset = col_idx*BLOCK_SIZE_COL + tl.arange(0, BLOCK_SIZE_COL)
    output_col_mask = output_col_offset[None, :] < num_output_cols
    output_col_offset = output_col_offset[None, :] * output_col_stride

    output_data_ptr = output_ptr + output_batch_offset + output_row_offset + output_col_offset
    tl.store(output_data_ptr, result, mask=(output_row_mask & output_col_mask))

Similar to CUDA, calculate the current index of the block. But keep in mind, unlike CUDA where a single element of the output matrix is processed, here a single block (which is a 2D arrangement of a few elements) is processed. tl.program_id function helps in getting the index position in every axis.

batch_idx gets the output matrix in the batch
row_idx gets the block number along the rows. Remember, this is not equal to the row number as in CUDA
col_idx gets the block number along the columns. Remember, this is not equal to the column number as in CUDA

Once these 3 numbers are calculated, a 2D representation is created of the data that needs to be processed by each block. Let’s take some dummy numbers to understand how that is achieved. Assume that B = 1, N = 16, and D_out = 12. Block size in both column and row dimensions is 4 (i.e. BLOCK_SIZE_ROW and BLOCK_SIZE_COL is 4). So each block will be a 2D matrix of dimension (4 x 4).

Based on this

grid = lambda meta: (B, triton.cdiv(N, meta["BLOCK_SIZE_ROW"]), triton.cdiv(D_out, meta["BLOCK_SIZE_COL"]))

Based on the assumptions, the grid configuration is (1, 4, 3). A total of 12 blocks will be launched. Now, what would it take to load the block with rows 8 to 11 and columns 4 to 7? Based on simple arithmetic, it looks like (1, 2, 1)th block should be loaded where the first dimension corresponds to the batch dimension, the second dimension corresponds to the row dimension and the third dimension corresponds to the column dimension. This would correspond to

tl.program_id(axis=0) == 1
tl.program_id(axis=1) == 2
tl.program_id(axis=2) == 1

Figure 5: (1 x 16 x 12) matrix is divided into blocks of size (4 x 4). 1, 2, 1th block is highlighted. The value at every place is the index of that position in the 1D flattened array.

For this (1, 2, 1)th block, how to prepare the correct offsets? In the 1D representation of the matrix, the element numbers highlighted in green needs to be loaded.

# Offsets for output data
output_batch_offset = batch_idx * output_batch_stride                           

output_row_offset = row_idx*BLOCK_SIZE_ROW + tl.arange(0, BLOCK_SIZE_ROW)       # This arangement happens in 1D, tl.arange is like Python's arange
output_row_mask = output_row_offset[:, None] < num_rows                         # Think of masks as prevention against reading invalid data from memory
output_row_offset = output_row_offset[:, None] * output_row_stride              # This arangement converts a 1D vector to a 2D vector with (n, None) shape

output_col_offset = col_idx*BLOCK_SIZE_COL + tl.arange(0, BLOCK_SIZE_COL
output_col_mask = output_col_offset[None, :] < num_output_cols
output_col_offset = output_col_offset[None, :] * output_col_stride

Let’s decode what is happening here

output_row_offset = row_idx*BLOCK_SIZE_ROW + tl.arange(0, BLOCK_SIZE_ROW)
# row_idx = tl.program_id(1) = 2, BLOCK_SIZE_ROW = 4
# output_row_offset = 2*4 + (0, 1, 2, 3) = (8, 9, 10, 11)

If output_row_offset is added to the output_ptr directly the 8th, 9th, 10th, and 11th elements will be loaded from the 1D flattened array. But that is not desired. So how to get to the desired offsets:

output_row_offset = output_row_offset[:, None] * output_row_stride
# This multiplies each element by the output_row_stride which is equal to 12 (number of columns), the number of elements to skip in 1D array to reach the start of next row
# ouput_row_offset becomes (96, 108, 120, 132). It also gets transformed into a row vector

A similar transformation is done for the columns:

output_col_offset = col_idx*BLOCK_SIZE_COL + tl.arange(0, BLOCK_SIZE_COL)
# col_idx = tl.program_id = 1, BLOCK_SIZE_COL = 4
# output_col_offset = 1*4 + (0, 1, 2, 3) = (4, 5, 6, 7)

output_col_offset = output_col_offset[None, :] * output_col_stride
# This multiplies each element by the output_col_stride which is equal to 1, the number of elements to skip in 1D array to advance by one column.
# Since this is a row major ordering, columns are adjacent to each other.
# ouput_col_offset becomes (4, 5, 6, 7). It also gets transformed into a column vector

Finally,

output_data_ptr = output_ptr + output_batch_offset + output_row_offset + output_col_offset       # Adds all the offsets to the pointer

First, add output_row_offset and output_col_offset. Since one of them is a row vector and the other is a column vector, on addition a 2D array is produced with all the desired indices of all the elements that need to be loaded. After that, add output_batch_offset to get to the correct matrix in the batch.

Figure 6: How 2D blocks are created from 2 1D offsets

This gives the appropriate offsets for the data this block is interested in computing. Similarly, the relevant data for the other two tensors can be computed. The core idea is understanding the block calculation and offset calculation. The rest of the code is more about syntax rather than any core logic.

The complete code is present here. Triton and PyTorch are needed to run this code.

How you can rewrite the complete architecture using optimized kernel

Congrats on making this far away. Now that you understand the basics of GPU hardware and its programming model, you can go ahead and implement any network from scratch, this time not relying on PyTroch for operations but writing your kernels in CUDA or Triton.

In case, you want to implement a transformer encoder network, you would need to implement all the basic layers and operations in Triton or CUDA.

Matrix multiplication
Layernorm
Softmax
Addition
Concatenation

You can then wrap these kernels in the PyTorch module and load weights from HF to compare your implementation with other PyTorch/TF native implementations. If this sounds interesting, this is exactly what we did too. We implemented most of the operations used in Vision Transformer (ViT) including patching and addition operations in Triton and loaded weights from a checkpoint to run a forward pass. You can look at the code at ViT.triton and maybe implement your favorite model too using custom kernels!

Citations

For attribution, please cite this as

@article{romit2024gpus2,
  title   = {GPUs Part 2},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {May},
  url     = {https://cmeraki.github.io/gpu-part2.html}
}

References

Tensor strides ↩

GPUs Part 1 - Understanding GPU internals

2024-05-25T00:00:00+00:00

Written by Romit Jain

LLMs are pretty big and can use a lot of computing power. This makes them slow in terms of latency and makes them tougher (than ML models) to deploy. Hence, there is some alpha in learning how to run them as fast as possible, because that is what the real bottleneck currently is. If you can reduce latency or increase throughput, that opens up a lot of doors for LLM applications.

To learn how to run these big models as fast as possible, understanding the hardware (both CPU and GPU) on which they run is crucial.

This blog and others in the series (part 2) will help you learn about the basic layout of GPU hardware, a mental model of how the GPU programming model works, and how to progress from there to become a kernel master. (If you are asking, what’s a kernel, read till the end of the series)

PS, there is just a deep satisfaction in knowing how things work on the hardware. It gives you a deeper understanding of the models and an immense appreciation of all the abstractions.

Hardware

What is so special about GPUs that makes them extremely efficient for certain applications, especially LLMs? Understanding the hardware of GPUs is essential to answer this question. In one line, “GPUs are optimized for throughput whereas CPUs are optimized for latency”. In more lines -

Why are GPUs faster for LLMs?

The fastest way to run LLMs currently is to run them on GPUs. But why are GPUs faster than CPUs for LLMs? One valid answer is that GPU can process data parallelly because it operates in the SIMD fashion. CPUs are mostly designed to work with sequential tasks. But even then, what makes GPU process data parallelly? Majorly 2 things:

CPUs have a lot of space on their chip dedicated to cache and registers. GPUs make a design choice that reduces the size of the cache and increases the number of cores. This way they can fit more cores in the same chip area. Cores are essentially the processing units that process data.
CPUs have a lot of functionalities in their cores. These functionalities help them operate in a variety of different tasks and hence CPUs are very robust. GPUs reduce these special functionalities which helps it to reduce the size of the cores. If the cores are smaller, GPUs can fit more cores in the same area.

Figure 1: The figure above illustrates the differences in cache and control logic sizes between GPUs and CPUs. The GPU features significantly reduced cache and control logic sizes, as well as smaller core sizes. These tradeoffs allow for a higher number of cores in the GPU. Source

The more cores a GPU has, the greater the potential for parallel execution, leading to improved performance. However, it’s not solely about the number of cores. Other factors contribute to the overall efficiency and performance of GPUs.

GPU hardware layout

Let’s now understand how these cores are organized and arranged on the hardware.

CUDA Cores

Here is where the magic actually happens. These are the processing units of the GPU and come in different flavors, eg: Tensor Cores, Single precision cores, Double precision cores, etc. All of these cores handle different kinds of operations. The GPU decides where to send the operation based on the data type and instruction. The amount of operations these bad boys can do per second is what gives rise to FLOP numbers. Each of these different flavors has different performance numbers in terms of FLOPs because all of them do different kinds of operations.

For example, H100 has 16986 FP32 CUDA cores that can each do 2 floating point operations per cycle. The clock speed of the GPU is 1593 MHz. Theoretically, in total if all the cores are processing data at all times, it can achieve $ 1.593 * 10^9 * 2 * 16.986 * 10^3 = 54.1 * 10^{12} FLOPS$ or 54 teraFLOPs

This is close but not the same as what is shown on the official specs of H100. (I am not able to figure out the reason for the difference. If you know, please drop me an email!)

Streaming Multiprocessors (SMs)

All the cores in a GPU are organized into groups. Each of these groups is called a streaming multiprocessor (SM). Every SM has some memory associated with it. This memory can be shared amongst all the cores inside an SM but not by any other core outside this SM. This memory is called shared memory and is extremely fast in terms of data transfer speed or memory bandwidth. But this is also small in terms of capacity. So it’s essential to use this memory judiciously.

Why are GPUs divided like this? It’s to enable smaller groups of cores to share memory amongst themselves and work together. With every new generation of GPUs, typically SMs and cores per SMs go up in a GPU.

Let’s take some real numbers to understand the capacity. An H100 SXM GPU contains:

132 streaming multiprocessors (SM)
Each SM has 128 FP32 CUDA cores (so a total of 16896 (132 * 128) CUDA cores)
Each SM has 227 KB of shared memory
And this memory has a bandwidth of 33 TB/s

SMs are also grouped into TPCs (Texture/Processor Cluster). For reference, the above hardware has 2 SMs per single TPC. But that can be safely skipped for now.

Memory

There are three kinds of memory on the GPU

HBM/Global memory - This can be thought of as the equivalent of CPU memory. This is the slowest and largest memory available on the GPU.
1. For reference, H100 SXM has 80GB of HBM with 3 TB/s of bandwidth (i.e. it can transfer 3 TB per second either to or from HBM)
2. This is where the model is loaded when we do model.to(device='cuda:0')
L2 Cache - Faster than HBM but limited in size. This is shared among all the SMs.
1. For reference, H100 SXM has 50 MB (lol, in comparison to HBM) of L2 cache with 12 TB/s of bandwidth.
Shared memory - Fastest and smallest memory available on the GPU. Every SM has its shared memory and all the cores executing instructions in an SM have access to it.

Back to LLMs

Let me cite working examples to drive home a point - For LLMs, one should probably not worry about teraFLOPs. This answers the question that we asked at the end of the section Why are GPUs faster for LLMs? Take an example of the H100 SXM GPU that can do 67 teraFlops (FP32) of computation. The memory bandwidth of the HBM is 3 TB/s. That means the GPU can transfer about 3 TB of data to the compute layer per second. Considering FP32 (4 bytes), we can transfer about 750 billion numbers to the compute layer in one second. In contrast, the compute layer can perform 67 trillion operations per second. Just to break even with the computation speed, we would either:

Need to transfer ~90x the data (67 trillion/750 billion) from the memory to the computer layer per second
Or perform, ~90 operations on every data point each second

So, it’s tough to keep up with the computing power of the GPU. The bottleneck comes in transferring the data. There are three good resources on this topic to understand it better:

NVIDIA docs
An article by Horace He here.
Another practical example is stated in the article: How is Llama.cpp possible?

Apart from the above, we also have warps in GPUs. Warps are a collection of 32 threads that are executed at once by the GPU. It’s slightly more complex to understand how warps work, so I will leave it out of the scope of this blog.

By now, you should be able to understand how GPU hardware is organized. There are a few other hardware concepts that I did not go through like warp scheduler, register files, etc. here, but that are not crucial to get started.

You are now all ready to start with the part 2 of this series.

Citations

For attribution, please cite this as

@article{romit2024gpus1,
  title   = {GPUs Part 1},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {May},
  url     = {https://cmeraki.github.io/gpu-part1.html}
}

Throughput is all you need

2024-04-11T00:00:00+00:00

Written by Romit Jain

Throughput, why?

If we want to build efficient applications on top of current LLMs, there are currently two challenges:

Improving Inference latency: The speed with which the model returns the tokens per second
Improving Inference throughput: The total number of requests that the model can serve in parallel

Inferencing LLMs with lower latency comes down to working around the limitations of the GPU’s memory bandwidth ¹. FlashAttention, speculative decoding, and KV caching are ways in which one can improve the latency of the model.

Increasing inference throughput comes down to effectively managing the available VRAM of the GPU. Given a limited budget of GPU VRAM, there are various areas where improvements can be made:

Reducing the size of the model: By quantization or knowledge distillation eg: GPTQ
Batching²: Batching more requests in the same amount of GPU VRAM
Separating prefill and decoding stages of generation³

One can refer to blog⁴ or ⁵ for an overview of the above concepts.

For this blog, let’s zoom into one specific aspect of improving throughput, i.e. batching. After the model is loaded in the GPU VRAM, whatever remaining memory is available to us is reserved for the KV cache and serving the requests. The only lever that we can control here apart from the model size is the KV cache. Efficiently managing this KV cache can help us dramatically increase throughput by enabling us to batch more requests. For certain use cases, it can increase the throughput by 20x compared to native HuggingFace implementation.

vLLM is one such library that helps us achieve very high throughout. vLLM deploys LLMs on GPUs and focuses on:

Allocating the KV cache in the most efficient way possible
This, in turn, allows us to increase the batch size and server more requests per minute

In this blog, we will learn about the intuition behind vLLM, and its inner workings and also simulate it for a real-world application to understand the nuances and limitations of the library.

Setup

Taking real-world numbers around model sizes and GPU VRAM can help visualize and validate the workings of vLLM. Let us consider a case of deploying a Mistral 7B model on the highest-end consumer-grade GPU (Nvidia RTX 4090). If we choose to deploy the model at half-precision (FP 16, each parameter taking 2 bytes), the model would occupy ~14 GB of the VRAM from the available 24 GB VRAM on a 4090 GPU. Assuming an overhead of 3 GBs, the GPU would have 7 GB of VRAM available. This 7 GB of available VRAM will be reserved for the KV cache.

Figure 1: Memory layout of the GPU

In our scenario, we would assume 8k as the context length to serve the model. Whenever a request arrives, the model computes the attention scores for all the prompt tokens and then generates one token at a time using autoregressive decoding. While decoding, it requires some VRAM on the GPU to store the token. A single token would take 0.125MB of VRAM to be stored in the KV cache.

Token size calculation

For every token, we need to store its corresponding tokens for K and V matrices. We also need to store it for all the layers and all the attention heads.

The general formula is: 2*2*n*h*d, where the first 2 is for FP 16 weights (2 bytes), the second 2 is for the K/V matrix, n is for the number of layers, h is for the number of heads, d is the embedding dimension
For Mistral 7B, 2*2* 32*8*128 = 0.125 MB

The KV cache for a single request on the complete context length of the model would be 1 GB (8k * 0.125 MB).

A case for a single GPU serving a single request

If we decide to serve only a single request at a time with this GPU, we would be wasting a lot of resources. Given that 7 GB of VRAM is available for KV cache, the model can store cache for 56k tokens (7 GB/ 0.125 MB). Considering all of the VRAM to be reserved for a single request, the space for 48k tokens (56k-8k) would be wasted since the model has a context length of only 8k tokens. The throughput of the model would be very low (only a single request is being processed at a time) and it is not using all of the VRAM of the GPU available to it. It would be wasting 6 GB of memory for every request.

This is termed as external fragmentation. This is clearly not the best way to utilize the GPU for serving LLMs. Figure 2 shows the extreme version of external fragmentation.

Figure 2: Inside the KV cache: Single request

A case for a single GPU serving multiple requests

How can we improve upon this? Enter batching. In batching, we serve multiple requests at the same time taking advantage of the parallelism of GPUs. Let’s consider a scenario where we are serving multiple requests at the same time of 8k context length each. GPU would need to pre-allocate the space for 8k tokens for every request. For every request, the GPU would need 1 GB of VRAM to store the KV cache. Hence, it would be able to serve 7 requests concurrently (7 GB/ 1 GB). This would avoid external fragmentation in our scenario, but it could lead to another problem.

One thing to note here is that every request might not generate 8k tokens. Request 1 may end up generating 4k tokens, Request 2 may end up just generating 2k tokens, and so on. But since we had already reserved space for all the 8k tokens, we are wasting the memory and not utilizing the complete memory. This is called internal fragmentation.

There can be another scenario where after allocating the memory for all the requests, the available VRAM of the GPU is less than the memory required for a single request. In this scenario, the memory for the request will not be allocated and the remaining memory will be wasted. This is again a case of external fragmentation.

Figure 3: Inside the KV cache: Multiple requests

A case for a single GPU serving multiple requests efficiently

So, is there any improvement possible over the naive batching method we discussed earlier? Yes, indeed there is a way. Enter vLLM.

Let’s assume that the complete memory of the GPU is broken down into small chunks of memory called blocks. Each block is equivalent to the memory required for 16 tokens (i.e. in our example, 0.125 MB * 16 = 2 MB). Once we allocate memory for a block, even partially, it won’t be available for any other allocation.

Since every request might not need 8k tokens, let’s assume that on average every request would require 5000 tokens. GPU will allocate 313 blocks (5000/16) of memory for the request. These blocks are not stored in a contiguous layout in the memory. Hence, we would need to maintain an address book that maps every request to its corresponding blocks. There’s another optimization in here. Since this memory is not stored in a contiguous memory, we don’t need to allocate all of the memory at once. We can allocate memory as and when required once the previous blocks are filled to the capacity. This is the core of how vLLM allocates memory.

Figure 4: vLLM token to block mapping. Source ⁶

The above solves 2 problems:

The request only allocates memory required for its generation instead of pre-allocating for the complete context length of the model. The memory allocation happens at the block level, so technically memory is allocated for 16 tokens at a time. This reduces internal fragmentation significantly
1. If the request uses 1.5k tokens, we need to allocate memory only for 94 blocks i.e. 94 * 2 MB = 184 MB, instead of 1 GB for the complete 8k context length of the model
2. A single request’s tokens can be stored in multiple blocks
The complete memory is broken down into equally sized blocks, so even external fragmentation is minimized. The block size is chosen such that it fills the available GPU memory evenly.

The approaches defined above help in utilizing the GPU VRAM efficiently. Given the block size of 2 MB, vLLM can store a total of ~3500 blocks in the available memory of 7 GB. If each request needs 313 blocks (5k tokens on average) during its lifetime, the GPU would have memory to serve 11 requests in parallel. By using the KV cache more effectively and allocating memory in blocks instead of complete context length, vLLM has increased the throughput from 7 to 11 in our example.

This is how vLLM helps in increasing the batch size and throughput of any model. For computing attention over tokens distributed in non-contagious blocks, vLLM has introduced Paged Attention. Paged Attention are optimized CUDA kernels to access tokens from different blocks and compute attention scores over them.

Inside the simulation

To understand the behavior of vLLM in production, let us simulate a real scenario of a chat application. This chat application uses an LLM and is being served by vLLM. For chat applications, we have another dimension where a single chat can have multiple turns of conversation alternating between user and assistant messages.

Figure 5: A multi-turn conversation. From the perspective of an LLM, all of these messages are a part of a single request. As the conversation progresses, every new message from the user gets appended to the same request and is sent to the LLM again

Our objective is to predict the behavior of vLLMs and try to replicate them in the experiments. To start with, let’s consider some simulation parameters (similar to our example in the previous section):

Block size (number of tokens stored together in one block): 16
The average number of turns in each chat: 10
Average input token length at each turn in the chat: 150
Average output token length at each turn in the chat: 350
Average latency for each turn in the chat: 10s ⁷
Average number of tokens required for a single chat session: (150 + 350) * 10 = 5000
The average number of blocks required for a single chat session: is 313 (5000/16)

For serving an LLM, let’s take any flavor of the Mistral 7B model deployed at half precision. Taking the model parameters,

Model dimension: 128
Number of layers: 32
Number of KV heads: 32
Input sequence length: 8192

According to these parameters, we would require:

0.125 MB of memory per token in KV cache
2 MB of memory per block (assuming block size to be 16 tokens, 0.125 * 16 = 2 MB)

Assuming 7 GB of KV cache available for our use

We can store ~3500 blocks in GPU VRAM (7GB/2MB)
As calculated above, given an average of 313 blocks per chat session and 3500 blocks available, we can hold 11 (floor(3500/313)) conversations in a single GPU and serve them in parallel

Based on our simulation, we calculated that an LLM served by vLLM can serve 11 requests in parallel for our setup. If we were implementing a naive batching, it would have not been able to serve more than 7 requests parallelly (which we discussed in the previous sections). Let’s experiment with this simulation to test the calculation. I send N number of requests at once to a model hosted using the vLLM backend. Note that these requests are long-running (each request has multiple turns).

Below you can find the results from the experiments, where you can see two things:

Scheduler State: Number of requests being served concurrently by vLLM
Cache Utilization: % of GPU memory being used. Note that this percentage is based on the KV cache space we calculated earlier (i.e. 7 GB is the total GPU memory for the KV cache in our setup. If the utilization is 50%, that would translate to 3.5 GB of KV cache being used)

N = 10, we can see that the GPU utilization never reached 100%.

N = 12, we can see that the GPU utilization reached 100% utilization, and 1 of the requests is moved to a waiting queue for some time (where it is not processed). This indicates that the results we got are similar to what we got from the experiments.

N = 14, we can see that the GPU utilization hits 100% and then approximately 2 requests are moved to the waiting queue

We can notice two things here:

It takes some time for the GPU to reach 100% utilization. This is because currently we have deployed a chat application where each turn takes 10 seconds and we have a total of 10 turns. So, the KV cache keeps on getting larger and larger as time goes by. But once the chat conversation ends after 10 turns, we will notice a drop in the GPU utilization.
If we go above the calculated parallel limit of our chats, we will eventually see some requests being transferred to a waiting queue. That implies the GPU is completely utilized and it can not process all the requests in a single batch.

The complete experiment can be rerun and you can find the code used to run the experiments here.

An overview of all the parameters we discussed is mentioned below for reference. You can make a copy of the following sheet and play with simulation parameters to understand the requirements. Yellow blocks can be updated, and green blocks are calculated ones.

Model Parameters	Value	Units
Model size	7.00	B
Model dim	128
Model layers	32
Model KV heads	8
Bytes per parameter	2
Input sequence length	8192

vLLM Parameters
Block size	16

GPU Parameters
Memory	24	GB
Utilization	100%
Buffer	3	GB

Simulation params
Total turns in a chat	10
Input tokens in a turn	150
Output tokens in a turn	350

Experimental results
Average latency per turn	10	s

Calculations
Memory per token	0.125	MB
Memory per block	2	MB
Memory remaining for KV cache	7	GB
Total token length of a chat	5000
Total blocks required for a chat	313
Blocks that can be stored in KV cache	3584
Total chats that can be served concurrently at full context length	11

Notes

vLLM does a few more things:

KV cache reuse: By reusing the KV cache for different requests, a new request can skip computing the attention scores for the common tokens. This translates to lower latency. However, this is not the contribution of this paper. KV caching is a common technique used during LLM serving
1. Single prompt, multiple generations: vLLM can cache a common prompt or prefix and use that for multiple generations. This is similar to the above and helps in reducing latency
2. Parallel sampling and beam search: Following on from the above, vLLM also implements KV cache reuse for parallel sampling and beam search.
Pause the world: Whenever a new request comes in between the decoding stage of ongoing requests in the batch, vLLM pauses the generation of requests in the batch and computes the KV cache for the new request. Once the KV cache is computed, it adds it to the batch and continues decoding the new batch
1. This results in higher latency if too many requests are coming back to back
2. vLLM is working to update this behavior
Queue: vLLM also provides a FastAPI server on top of its backend. It implements queues that store the request that vLLM can not serve if the GPU memory is full

Citations

For attribution, please cite this as

@article{romit2024throughput,
  title   = {Throughput is all you need},
  author  = {Jain, Romit},
  journal = {cmeraki.github.io},
  year    = {2024},
  month   = {April},
  url     = {https://cmeraki.github.io/throughput-is-all-you-need.html}
}

References

These are some of the references that I have linked throughout the blog and some general recommended reading for getting a better understanding of the concepts we discussed in the blog.

Making Deep Learning go Brrrr From First Principles ↩
How continuous batching enables 23x throughput in LLM inference while reducing p50 latency ↩
Throughput is Not All You Need: Maximizing Goodput in LLM Serving using Prefill-Decode Disaggregation ↩
LLM Inference Performance Engineering: Best Practices ↩
Mastering LLM Techniques: Inference Optimization ↩
vLLM ↩
For a single token generation, the latency is usually bound by the memory bandwidth of the GPU. Considering Nvidia 4090 which has a memory bandwidth of 1008 GB/s and Mistral 7B which has 14 GB parameters, the ideal estimate of latency would be 72 tok/s (1008/14). In the real world, you can expect to get around 60 tok/s
1. For 600 tokens, the total time comes around to be 10s (600/60)
2. Refer to this blog for more explanation: Transformer Inference Arithmetic
↩

Building Tts

2021-11-21T00:00:00+00:00

Indri TTS / ASR

Nov 21, 2024

Today, we are releasing Indri TTS model series, which are 124M/350M param, multilingual, fully autoregressive TTS models, that can produce hyper realistic human voices. You can try out the models here : https://indrivoice.ai . Or download and use it on your machine from github / hf. Currently the model supports English, Hindi and Kannada. New languages are easy to add using scripts provided in git repo.

Indri can generate hyper-realistic audio that is very hard to differentiate from real speech. It faithfully reproduces background noises, echoes, music and non-speech sounds alongwith speech. Here are a few examples of generations :

Data

We have used 20k hours of available English TTS data, alongwith 5k hours of per language data.

We collected videos with clean audio from sharing websites and passed it through whisper-v3-turbo to generate transcriptions. These transcriptions are limited to 15s in length. We also post process the chunks and remove any silences longer than 250ms.

What to look for in data ?

Clearly spoken speech. You should be able to make out the words that are being spoken. E.g. podcasts, talks etc. make for great sources, whereas on-site news or action movie clips do not.
No background music, sounds etc. Although we can remove the background using separation, it leaves artifacts which the model learns to replicate.

Modelling

Audio Tokenizer

A lot about tokenizers has been covered in previous blog. If you haven’t, go through the tokenizers blog to understand how to decide on an audio tokenizer.

Impact of tokenizer

Small context length : Using a tokenizer which has low frequency, results in small sequences. This makes them easier to model. E.g. Hubert is 50Hz, and encoded
Speed :
Final model size :

We use Mimi tokenizer (link), which produces 32 codebooks at 12.5Hz. We found 8 codebooks to be sufficient to faithfully reproduce audio under consideration.

Handling audio tokens

Transformers are good at modelling 1-D sequences. Audio tokenizers convert audio into n-codebooks at kHz, giving a 2D sequence of tokens.

We convert this to a 1D sequence by weaving codebooks together. Tokens of n-th codebook are offset by (n-1 x n_tokens_per_codebook). Both semantic and audio tokens are weaved together in a single sequence.

For n_codebooks = 2, tokens_per_codebook = 16 :

$\begin{bmatrix} 1 & 5 & 3 \\ 12 & 8 & 9 \\ \end{bmatrix}$ converts to $\begin{bmatrix} 1 & 5 & 3 & 12 + 16 & 8 + 16 & 9 + 16 \\ \end{bmatrix}$

This results in an audio vocab of size n_codebooks x tokens_per_codebook.

We bring text and audio tokens into a common embedding space and train a small transformer (gpt2) over text+audio sequences.

Sequences

Indri is a multimodal decoder only transformer (gpt2 arch), that consumes and generates both audio and text tokens as part of same sequence. We convert different problems such as tts/asr/continuation into sequence to sequence problems, indicating tasks by special tokens.

TTS systems such as spear-tts use a tiered approach where they train two models :

text to semantic tokens : learns to read
semantic to acoustic tokens : learns to speak

This separates the speaker voice characteristics (e.g. pitch) from reading (e.g. speed, accent) etc. But first model has to complete its generation, for the next model to start producing output. Hence streaming output can only start when all semantic tokens are ready.

We use a single model to generate both semantic and acoustic tokens. Hence we can stream output from the moment first audio has been generated.

Token Sequence

We use special tokens to indicate:

start of modality ,
a common stop token
speaker identifier
task ,