salykova

Matrix Core Programming on AMD CDNA3 and CDNA4 architecture

2025-09-30T23:35:01+00:00

TL;DR In this blog post, we walk through how to use Matrix Cores in HIP kernels, with a focus on low-precision data types such as FP16, FP8, and FP4, as well as the new family of Matrix Core instructions with exponent block scaling introduced in the AMD CDNA™4 architecture. Through code examples and illustrations, we provide the necessary knowledge to start programming Matrix Cores, covering modern low-precision floating-point types, the Matrix Core compiler intrinsics, and the data layouts required by the Matrix Core instructions. The blog post is also available on ROCm Blogs.

1. Matrix Cores

Matrix multiplication is an essential part of AI and HPC workloads. The AMD CDNA™ architecture features special-purpose hardware, the Matrix Cores, to accelerate matrix fused-multiply-add (MFMA) operations defined as D:=A*B+C. Please note that MFMA instructions are often used to update a matrix in-place (=accumulation) so that D=C and C:=A*B+C. The matrices A and B are called input matrices, while the matrix D is referred to as the output matrix or accumulator.

The performance gains from using Matrix Cores are especially significant in mixed-precision mode, where the input matrices use lower-precision data types instead of FP32. The output matrix, however, is stored in FP32 to minimize accuracy loss during accumulation. The tables below show the theoretical peak performance of Matrix Cores with different input data types on both AMD CDNA™3 and AMD CDNA™4 architectures. On the AMD Instinct™ MI325X, using FP16 input matrices delivers nearly an 8x performance increase compared to single-precision, with only minimal accuracy loss. Switching to FP8 further doubles the performance providing a 16x increase when compared to FP32. The AMD CDNA™4 architecture further improves Matrix Core performance, delivering up to 2x higher throughput for FP16 and FP8 compared to the AMD CDNA™3 architecture. In addition, AMD CDNA™4 introduces new low-precision data types such as FP6 and FP4, enabling up to 64x performance gain relative to FP32. Please refer to the AMD CDNA™3 and AMD CDNA™4 white papers for detailed architecture specifications.

Type	AMD Instinct™ MI325X (CDNA™3)	Speedup vs. FP32
Matrix FP64	163.4 TF	1x
Matrix FP32	163.4 TF	1x
Matrix FP16	1307.4 TF	~8x
Matrix FP8	2614.9 TF	~16x

Type	AMD Instinct™ MI355X (CDNA™4)	Speedup vs. FP32
Matrix FP64	78.6 TF	~0.5x
Matrix FP32	157.3 TF	1x
Matrix FP16	2.5 PF	~16x
Matrix FP8	5 PF	~32x
Matrix FP6	10 PF	~64x
Matrix FP4	10 PF	~64x

2. Low-Precision Floating-Point Types

A binary representation of a floating-point number consists of n bits, where m of n bits represent the mantissa, 1 bit determines the sign and n-m-1 bits represent the exponent. The following image illustrates the binary format of a floating-point number and how the exponent and mantissa are calculated based on its binary representation.

Figure 1: Binary representation of a floating-point number.

Floating-point types are characterized by the number of bits used for the exponent and for the mantissa. Increasing the exponent width extends the range of representable values, while increasing the mantissa width improves precision. Since all floating-point types include the sign bit, a shorthand notation typically specifies only the exponent and mantissa widths. For example, the E4M3 type is an 8-bit floating-point type with 4-bit exponent and 3-bit mantissa. Additionally, a floating-point type is specified by exponent bias - a number that is subtracted from the exponent during conversion from binary format to real value. Given the exponent width, mantissa width, and exponent bias, one can convert the binary representation of a floating-point type (except E8M0) into its real value using the following equation:

Figure 2: Conversion to real value from binary representation for floating-point numbers.

Please note that the equation takes different forms depending on whether the exponent is zero or not. Often, certain exponent and mantissa values are reserved for special values (e.g. NaN, Infinity), which limits the range of representable real numbers. For example, the FP16 type has 5-bit exponent with a nominal range of [0, 1, ... 2^5-1] = [0, 1, ... 31]. However, the exponent value E = 31 is reserved for NaN (if the mantissa M != 0) and infinity (if the mantissa M = 0). Therefore, the largest exponent value that can represent a real number is E = 30.

The following table summarizes low-precision types commonly used in modern AI/ML workloads:

Width	Shorthand	Exp. bias	Range	Zero	NaN	Infinity
16-Bit
	E5M10 (FP16)	15	±65504	S 00000 0000000000	S 11111 xxxxxxxxxx	S 11111 0000000000
	E8M7 (BF16)	127	±3.3895 * 10^38	S 00000000 0000000	S 11111111 xxxxxxx	S 11111111 0000000
8-Bit
	E4M3FN (FP8, OCP)	7	±448	S 0000 000	S 1111 111	n/a
	E4M3FNUZ (FP8)	8	±240	0 0000 000	1 0000 000	n/a
	E5M2 (BF8, OCP)	15	±57344	S 00000 00	S 11111 {01, 10 11}	S 11111 00
	E5M2FNUZ (BF8)	16	±57344	0 00000 00	S 00000 00	n/a
	E8M0	127	2^(±127)	n/a	11111111	n/a
6-Bit
	E2M3	1	±7.5	S 00 000	n/a	n/a
	E3M2 (BF6)	3	±28	S 000 00	n/a	n/a
4-Bit
	E2M1 (FP4)	1	±6	S 00 0	n/a	n/a

Please note that the E4M3 type has two variants: E4M3FN and E4M3FNUZ. Both E4M3FN and E4M3FNUZ use 4 bits for the exponent and 3 bits for the mantissa. They use different exponent biases and differ in the special values they can represent. Neither variant supports infinities, which is why their notations include FN (FiNite). However, E4M3FN supports +0, -0, +NaN and -Nan, while E4M3FNUZ supports only +0 and NaN, hence UZ (Unsigned Zero). The image below demonstrates how to convert a binary sequence into a real value, using E4M3FNUZ type as an example:

Figure 3: E4M3FNUZ encoding details.

FP8 types are divided into E4M3 and E5M2 formats. The E5M2 format is sometimes referred to as BF8, similar to BF16, where exponent width is larger compared to FP16. Similar to E4M3, E5M2 is further subdivided into two variants: E5M2 (OCP) and E5M2FNUZ. The AMD CDNA™3 architecture uses FNUZ variants for both E4M3 and E5M2, whereas the CDNA™4 architecture uses E4M3FN and E5M2 (OCP) variants. E4M3FN and E5M2 are standardized formats defined by the Open Compute Project (OCP). For detailed specifications, see the OCP Microscaling Formats (MX) Specification and the ONNX documentation. For visualization of FP8 values and their binary representations please refer to the FP8 Data table. Additionally, see the chapter “Low-precision floating-point types” in the AMD ROCm™ documentation for details on using low-precision types in HIP.

There is a special 8-bit format, E8M0, which is not used as a standard element data type but instead serves as a scale factor for microscaling types and block-scaled MFMA operations (discussed later in this article). Its value is calculated according to the equation below:

Figure 4: E8M0 encoding details.

The exponent value E = 255 is reserved for NaN values, limiting the range of representable real numbers to [2^-127 ... 2^127].

Similar to FP8, FP6 has two formats: E2M3 and E3M2. The latter, E3M2, is often referred to as BF6 due to its larger exponent width compared to E2M3.

3. Matrix fused-multiply-add (MFMA) Instructions

The AMD CDNA™3 and CDNA™4 architectures support a variety of MFMA operations, which are characterized by the matrix dimensions M, N, K and the data type of input/output matrices. The following table lists all available floating-point MFMA instructions for the AMD CDNA™3 and CDNA™4 architectures. As can be seen from the table, the AMD CDNA™4 architecture extends the set of available MFMA instructions by adding new FP16/BF16 instructions with larger matrix dimensions. Furthermore, it introduces FP6/FP4 data types and provides a completely new set of FP8/FP6/FP4 instructions where the types can be independently used for the matrices A and B. Finally, the AMD CDNA™4 architecture enables MFMA with block exponent scaling.

Type (C,D) ← (A,B)	MxNxK (CDNA™3)	MxNxK (CDNA™4)	Cycles	Note
FP64 ← FP64	16x16x4	16x16x4	64
FP32 ← FP32	32x32x2	32x32x2	64
	16x16x4	16x16x4	32
FP32 ← FP16 (BF16)	32x32x8	32x32x8	32	Both A and B are either FP16 or BF16
	16x16x16	16x16x16	16
	-	16x16x32	16
	-	32x32x16	32
FP32 ← FP8	16x16x32	16x16x32	16	FP8 (E4M3) or BF8 (E5M2) can be used independently for A and B
	32x32x16	32x32x16	32
FP32 ← FP8/FP6/FP4	-	16x16x128	16 or 32	FP4, FP6 or FP8 can be used independently for A and B. Larger cycle count if either matrix A or B is FP8
	-	32x32x64	32 or 64
FP32 ← MXFP8/MXFP6/MXFP4	-	16x16x128	16 or 32	FP4, FP6 or FP8 can be used independently for A and B. Larger cycle count if either matrix A or B is FP8
	-	32x32x64	32 or 64

Please note that the table lists only floating-point type MFMA instructions with batch size = 1. In addition to them, the AMD CDNA™3 and CDNA™4 architectures support batched MFMA operations, where multiple output matrices are computed in parallel. These instructions are not covered in this article. See the Chapter 7 “Matrix Arithmetic Instructions” in the AMD CDNA™3 and AMD CDNA™4 ISA reference guides for the full list of available MFMA instructions.

The table above specifies cycle count for each MFMA operation. Given a known cycle count, one can estimate theoretical peak performance in TFLOP/s of corresponding MFMA operation using the formula below:

2*M*N*K * num_matrix_cores * (max_engine_clock / cycle_count) / 10^6,

where

num_matrix_cores is total number of matrix cores in a GPU (specified in white paper)
max_engine_clock is max engine clock (peak) in MHz (specified in white paper)
cycle_count is cycle count of corresponding MFMA instruction
M, N, K are matrix dimensions

Using this formula and the MFMA instruction 32x32x8 FP16 as an example, we can estimate theoretical peak FP16 Matrix Core performance on the AMD Instinct™ MI325X:

2*32*32*8 * 1216 * (2100 / 32) / 10^6 = 1307.4 TFLOP/s.

4. Compiler Intrinsics

To use Matrix Core instructions in HIP kernels, LLVM provides built-in compiler intrinsic functions. The list of all available compiler intrinsics can be found in the LLVM Github repository. The syntax of the MFMA intrinsics has the following format:

d_reg = __builtin_amdgcn_mfma_ODType_MxNxKInDType(a_reg, b_reg, c_reg, cbsz, abid, blgp),

where

MxNxK specifies the shapes of the matrices A, B, C, D,
ODType is data type of the matrices C and D,
InDType is data type of the input matrices A and B,
a_reg is a scalar/vector containing a portion of the matrix A,
b_reg is a scalar/vector containing a portion of the matrix B,
c_reg is a vector containing a portion of the matrix C,
d_reg is a vector containing a portion of the matrix D,
cbsz, abid, blgp are broadcast flags. For the following discussion, these flags are irrelevant and are, therefore, set to 0 by default, unless specified otherwise. Please refer to the ISA reference guide for detailed information on the broadcast flags.

For example,

__builtin_amdgcn_mfma_f32_16x16x16f16 performs 16x16x16 MFMA, where both input matrices A and B have type FP16 and the output matrix has type FP32
__builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8 performs 32x32x16 MFMA, where both input matrices A and B have type FP8(E4M3) and the output matrix is stored in FP32
__builtin_amdgcn_mfma_f32_32x32x16_fp8_bf8 performs 32x32x16 MFMA, where the matrix A has type FP8(E4M3) and the matrix B has type BF8(E5M2).

The MFMA instructions are wavefront-level (warp-level) instructions, where all work-items (threads) within a wavefront collectively perform a single MFMA operation and the operands A, B, C, D are distributed across work-items so that each work-item in the wavefront holds a portion of the operands. In order to use the MFMA instructions, it’s required to understand how the operands are distributed across threads within a wavefront. The ISA reference guide fully specifies the data layout for all available MFMA instructions. For illustrative purposes, the next chapter explains a subset of the MFMA instructions and the corresponding data layouts.

5. Examples

Important note: In the following discussion we assume the matrices are stored in row-major order. The wavefront size on the AMD CDNA™ architecture is 64. The shapes of the matrices A, B, C, D are MxK, KxN, MxN, and MxN, respectively. The first dimension denotes the number of rows and the second dimension the number of columns in a matrix. For example, the matrix A has M rows and K columns.

5.1. __builtin_amdgcn_mfma_f32_32x32x2f32

In this example we will multiply matrix A of size 32x2 with matrix B of size 2x32 using single wavefront (64 threads) and single MFMA instruction. The output matrix C has shape 32x32. The input and output matrices are FP32. Since threads within a wavefront collectively perform single MFMA instruction, the operands are distributed across the threads. Each thread stores

M * K / wavefront_size = 32 * 2 / 64 = 1 entries of the matrix A
K * N / wavefront_size = 2 * 32 / 64 = 1 entries of the matrix B
M * N / wavefront_size = 32 * 32 / 64 = 16 entries of the matrix C

The operands are distributed according to the scheme below. The matrix elements highlighted in light red are those stored by the thread with index 0 within the wavefront.

Figure 5: Data layout for `__builtin_amdgcn_mfma_f32_32x32x2f32`. The operands are stored in row-major order.

The code example below demonstrates how this operation can be implemented as a HIP kernel:

#include 

using fp32x16_t = __attribute__((vector_size(16 * sizeof(float)))) float;

__global__ void
mfma_fp32_32x32x2_fp32(const float* A, const float* B, float* C) {
    float a_reg;
    float b_reg;
    fp32x16_t c_reg {};

    const float* ldg_a_ptr = A + threadIdx.x / 32 + 2 * (threadIdx.x % 32);
    const float* ldg_b_ptr = B + threadIdx.x % 32 + (threadIdx.x / 32) * 32;

    a_reg = *ldg_a_ptr;
    b_reg = *ldg_b_ptr;

    c_reg = __builtin_amdgcn_mfma_f32_32x32x2f32(a_reg, b_reg, c_reg, 0, 0, 0);

    for (int i = 0; i < 4; i++) {
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + i * 32 * 8]          = c_reg[i * 4];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 1 + i * 32 * 8] = c_reg[i * 4 + 1];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 2 + i * 32 * 8] = c_reg[i * 4 + 2];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 3 + i * 32 * 8] = c_reg[i * 4 + 3];
    }
}

The GPU kernel can then be invoked on the host using a single wavefront:

mfma_fp32_32x32x2_fp32<<<1, 64>>>(A_device, B_device, C_device);

Please note that we use the vector data type fp32x16_t to store the entries of the matrix C in registers. Additionally, we zero-initialize c, since we compute C = A * B without accumulation.

5.2. __builtin_amdgcn_mfma_f32_16x16x16f16

This example demonstrates how to multiply matrix A of size 16x16 with matrix B of size 16x16 using single wavefront (64 threads) and single MFMA instruction. The output matrix C has shape 16x16. The input matrices are stored in FP16 and the output matrix stored in FP32. In this case, each thread stores 4 entries of the matrix A, 4 entries of the matrix B and 4 entries of the matrix C. The data layout for this instruction is shown below. For illustrative purposes, the elements stored by the first thread within the wavefront are highlighted in red.

Figure 6: Data layout for __builtin_amdgcn_mfma_f32_16x16x16f16. The operands are stored in row-major order.

Corresponding HIP kernel is implemented below:

#include 
#include 

using fp16_t = _Float16;
using fp16x4_t = __attribute__((vector_size(4 * sizeof(fp16_t)))) fp16_t;
using fp32x4_t = __attribute__((vector_size(4 * sizeof(float)))) float;

__global__ void
mfma_fp32_16x16x16_fp16(const fp16_t* A, const fp16_t* B, float* C) {

    fp16x4_t a_reg;
    fp16x4_t b_reg;
    fp32x4_t c_reg {};

    a_reg = *reinterpret_cast<const fp16x4_t*>(A + 4 * (threadIdx.x / 16) + 16 * (threadIdx.x % 16));

    for (int i = 0; i < 4; i++) {
        b_reg[i] = *(B + i * 16 + threadIdx.x % 16 + (threadIdx.x / 16) * 64);
    }

    c_reg = __builtin_amdgcn_mfma_f32_16x16x16f16(a_reg, b_reg, c_reg, 0, 0, 0);

    for (int i = 0; i < 4; i++) {
        *(C + i * 16 + threadIdx.x % 16 + (threadIdx.x / 16) * 64) = c_reg[i];
    }
}

Please note that both __half and _Float16 types can be used in device code. However, the host supports only _Float16 type for arithmetic operations. As in the previous example, we use vector data types to store the matrix elements in registers.

5.3. __builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8

In this example we will multiply matrix A of size 32x16 with matrix B of size 16x32 using single wavefront (64 threads) and single MFMA instruction. The output matrix C has shape 32x32. The input matrices are stored in FP8 and the output matrix is stored in FP32. In this scenario, each thread stores 8 elements of the matrix A, 8 elements of the matrix B and 16 elements of the matrix C. The operands are distributed according to the scheme below. For illustrative purposes, the elements stored by the first thread within the wavefront are highlighted in red.

Figure 7: Data layout for __builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8. The operands are stored in row-major order.

The code example below implements this operation as a HIP kernel:

#include 
#include 

using fp8_t = __hip_fp8_storage_t;
using fp8x8_t = __attribute__((vector_size(8 * sizeof(fp8_t)))) fp8_t;
using fp32x16_t = __attribute__((vector_size(16 * sizeof(float)))) float;

__global__ void
mfma_fp32_32x32x16_fp8_fp8(const fp8_t* A, const fp8_t* B, float* C) {
    fp8x8_t a_reg;
    fp8x8_t b_reg;
    fp32x16_t c_reg {};

    a_reg = *reinterpret_cast<const fp8x8_t*>(A + (threadIdx.x / 32) * 8 + (threadIdx.x % 32) * 16);

    for (int i = 0; i < 8; i++) {
        b_reg[i] = *(B + i * 32 + threadIdx.x % 32 + (threadIdx.x / 32) * 8 * 32);
    }

    c_reg = __builtin_amdgcn_mfma_f32_32x32x16_fp8_fp8((long)a_reg, (long)b_reg, c_reg, 0, 0, 0);

    for (int i = 0; i < 4; i++) {
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + i * 32 * 8]          = c_reg[i * 4];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 1 + i * 32 * 8] = c_reg[i * 4 + 1];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 2 + i * 32 * 8] = c_reg[i * 4 + 2];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 3 + i * 32 * 8] = c_reg[i * 4 + 3];
    }
}

To define FP8, we use __hip_fp8_storage_t type from hip_fp8.h. Note that the intrinsic function expects its first two operands to be of type long. To compile the code, the operands a and b are, therefore, converted to long.

5.4. __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f8

Important note: the MFMA instruction discussed in this example is supported only on AMD CDNA™4 GPUs (gfx950). Please make sure to install AMD ROCm™ version 7.0 or later.

The AMD CDNA™4 architecture introduces a new family of MFMA instructions with block exponent scaling. The syntax of these instructions differs from the classic MFMA compiler intrinsics:

d_reg = __builtin_amdgcn_mfma_scale_f32_MxNxK_f8f6f4(a_reg, b_reg, c_reg, Atype, Btype, OPSEL_A, scale_a, OPSEL_B, scale_b)

MxNxK specifies shapes of the matrices A, B, C, D
a_reg is a vector containing elements of the matrix A,
b_reg is a vector containing elements of the matrix B,
c_reg is a vector containing elements of the matrix C,
d_reg is a vector containing elements of the matrix D,
Atype is an integer that specifies the data type of the matrix A. The following values are possible: 0 = E4M3 (fp8), 1 = E5M2(bf8), 2 = E2M3(fp6), 3 = E3M2(bf6), 4 = E2M1(fp4),
Btype is an integer that specifies the data type of the matrix B. The following values are possible: 0 = E4M3 (fp8), 1 = E5M2(bf8), 2 = E2M3(fp6), 3 = E3M2(bf6), 4 = E2M1(fp4),
OPSEL_A, OPSEL_B are OPSEL codes. These arguments are not relevant for the discussion and therefore will be set to 0,
scale_a, scale_b are scalars / vectors containing scale factors of type E8M0.

As an example, let’s take a closer look at the instruction __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4. The inputs to this instruction are

Matrix A of size 32x64
Matrix Ax of size 32x2
Matrix B of size 64x32
Matrix Bx of size 2x32

The output matrix C has shape 32x32. Specifically, this instruction performs the following operation using single wavefront (64 threads):

Figure 8: Block-scaled matrix multiplication via __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4.

During dot product operations, the scales Ax, Bx are applied after the normal dot product and prior to output/accumulation.

In this example, we will multiply two FP8 matrices using the __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4 intrinsic function. The input matrices A, B are stored in FP8 format, while the output matrix is stored in FP32. The scale matrices Ax, Bx contain elements of type E8M0. Each thread stores 32 entries from the matrix A, 1 entry from the matrix Ax, 32 entries from the matrix B, 1 entry from the matrix Bx and 16 entries from the matrix C. The operands are distributed according to the scheme below. Please note that this scheme is valid only if both input matrices A and B have FP8 type. For illustrative purposes, the matrix elements stored by the thread with threadIdx.x = 0 are highlighted in light red, while the elements stored by the thread with threadIdx.x = 32 within the wavefront are highlighted in light green.

Figure 9: Data layout for __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4 with FP8 input matrices. The operands are stored in row-major order.

The following code example shows how this operation can be implemented as a HIP kernel:

#include 
#include 

using fp8_t = __amd_fp8_storage_t;
using fp8x32_t = __attribute__((vector_size(32 * sizeof(fp8_t)))) fp8_t;
using fp32x16_t = __attribute__((vector_size(16 * sizeof(float)))) float;

__global__ void
mfma_fp32_32x32x64_fp8_fp8(const fp8_t* A, const fp8_t* B, float* C) {
    fp8x32_t a_reg;
    fp8x32_t b_reg;
    fp32x16_t c_reg {};

    const fp8_t* ldg_a = A + (threadIdx.x % 32) * 64 + (threadIdx.x / 32) * 16;
    for (int i=0; i < 2; i++) {
        for (int j=0; j < 16; j++) {
            a_reg[i*16 + j] = *(ldg_a + i * 32 + j);
        }
    }

    const fp8_t* ldg_b = B + threadIdx.x % 32 + 32 * 16 * (threadIdx.x / 32);

    for (int i=0; i<2; i++) {
        for (int j=0; j < 16; j++) {
            b_reg[i*16 + j] = *(ldg_b + 32 * j + i * 32 * 32);
        }
    }

    uint8_t scale_a = 127;
    uint8_t scale_b = 127;

    c_reg = __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4(a_reg, b_reg, c_reg, 0, 0, 0, scale_a, 0, scale_b);

    for (int i = 0; i < 4; i++) {
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + i * 32 * 8]          = c_reg[i * 4];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 1 + i * 32 * 8] = c_reg[i * 4 + 1];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 2 + i * 32 * 8] = c_reg[i * 4 + 2];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 3 + i * 32 * 8] = c_reg[i * 4 + 3];
    }
}

Please note that in this example we use __amd_fp8_storage_t type defined in hip_ext_ocp.h to represent FP8. This library provides extensions APIs for low-precision and micro-scaling formats, and compared to hip_fp8.h, exposes a wider capability set. gfx950 provides hardware acceleration for these APIs. Most of the APIs are 1 to 1 mapping of hardware instruction. Additionally, we use uint8_t type to represent E8M0 scale factors. Since scale_a and scale_b encode exponent values, the corresponding actual scale factors are 2^(scale_a - 127) and 2^(scale_b - 127). If scale_a = scale_b = 127, the actual scale factors are equal to 1 and no scaling is applied.

5.5. __builtin_amdgcn_mfma_scale_f32_32x32x64_f4f4

In our last example, we demonstrate how to multiply two FP4 matrices using the __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4 intrinsic function. As in the previous example, each thread stores 32 entries from the matrix A, 1 entry from the matrix Ax, 32 entries from the matrix B, 1 entry from the matrix Bx and 16 entries from the matrix C. The data layout for the output matrix remains the same as in the FP8 case. However, the data layout for the input matrices is different and depicted below. For illustrative purposes, the matrix elements stored by the thread with threadIdx.x = 0 are highlighted in light red, while the elements stored by the thread with threadIdx.x = 32 within the wavefront are highlighted in light green.

Figure 10: Data layout for __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4 with FP4 input matrices. The operands are stored in row-major order.

The code snippet below demonstrates how to implement this operation as a HIP kernel:

#include 
#include 

using fp4x2_t = __amd_fp4x2_storage_t;
using fp4x64_t  = fp4x2_t __attribute__((ext_vector_type(32)));
using fp32x16_t = __attribute__((vector_size(16 * sizeof(float)))) float;

__global__ void
mfma_fp32_32x32x64_fp4_fp4(const fp4x2_t* A, const fp4x2_t* B, float* C) {

    fp4x64_t a_reg {};
    fp4x64_t b_reg {};
    fp32x16_t c_reg {};

    const fp4x2_t* ldg_a = A + (threadIdx.x % 32) * 32 + (threadIdx.x / 32) * 16;

    for (int i = 0; i < 16; i++) {
        a_reg[i] = *(ldg_a + i);
    }

    const fp4x2_t* ldg_b = B + (threadIdx.x % 32) / 2 + 16 * 32 * (threadIdx.x / 32);
    int b_extract_idx = threadIdx.x % 2;

    for (int i = 0; i < 16; i++) {
        uint8_t tmp0 = __amd_extract_fp4(*(ldg_b + 16 * 2 * i), b_extract_idx);
        uint8_t tmp1 = __amd_extract_fp4(*(ldg_b + 16 * (2 * i + 1)), b_extract_idx);
        b_reg[i] = __amd_create_fp4x2(tmp0, tmp1);
    }

    uint8_t scale_a = 127;
    uint8_t scale_b = 127;

    c_reg = __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4(a_reg, b_reg, c_reg, 4, 4, 0, scale_a, 0, scale_b);

    for (int i = 0; i < 4; i++) {
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + i * 32 * 8]          = c_reg[i * 4];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 1 + i * 32 * 8] = c_reg[i * 4 + 1];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 2 + i * 32 * 8] = c_reg[i * 4 + 2];
        C[threadIdx.x % 32 + (threadIdx.x / 32) * 4 * 32 + 32 * 3 + i * 32 * 8] = c_reg[i * 4 + 3];
    }
}

Since memory addressing is not allowed at a granularity smaller than 8 bits, we use __amd_fp4x2_storage_t (an alias for uint8_t) to store the input matrices and enable pointer operations. Note that the FP4 elements that need to be loaded from the matrix B are not contiguous in memory. To extract a single FP4 element, we use the __amd_extract_fp4 function provided in hip_ext_ocp.h. This function returns one FP4 element (of type uint8_t) from a fp4x2 vector, based on the index passed as the second argument:

uint8_t __amd_extract_fp4(const __amd_fp4x2_storage_t x, const size_t index) {
    if (index == 0) return (x & 0xFu);
    return (x >> 4);
}

Two FP4 values are then combined into __amd_fp4x2_storage_t using __amd_create_fp4x2:

__amd_fp4x2_storage_t __amd_create_fp4x2(const uint8_t x, const uint8_t y) {
    __amd_fp4x2_storage_t ret = 0;
    ret = x | (y << 4);
    return ret;
}

The compiler intrinsic function __builtin_amdgcn_mfma_scale_f32_32x32x64_f8f6f4 requires its first two arguments to be 256 bits wide. Since 32 FP4 elements occupy only 128 bits, we define fp4x64_t, which is 256 bits wide. In this type, 128 bits contain data, while the remaining 128 bits are zero. This allows us to pass a_reg and b_reg to the intrinsic function and compile the code successfully.

Summary

In this article, we introduced Matrix Core instructions available on the AMD CDNA™3 and CDNA™4 architectures. We covered floating-point formats in detail, including modern low-precision element data types such as FP8, FP6, FP4, and the scale data type E8M0. We further explained how the floating-point types are represented as binary sequences and demonstrated, with concrete examples, how to convert their binary representations into real values. Next, we listed Matrix Core instructions supported by the modern CDNA™ architectures and discussed how to calculate the theoretical peak performance of Matrix Cores for specific MFMA instructions. To make the discussion more practical, we reviewed the compiler intrinsic functions that allow users to program Matrix Cores inside HIP kernels. Finally, we examined a subset of MFMA instructions in detail, providing code examples and illustrations to explain data layout and demonstrate how to implement simple mixed-precision MFMA operations in HIP. For additional information on Matrix Cores and low-precision data types, please refer to the following resources:

Advanced Matrix Multiplication Optimization on NVIDIA GPUs

2025-01-12T09:35:01+00:00

This project is inspired by the outstanding works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao and the best CUDA hackers from the GPU MODE community (Discord server). A special thanks to Mark Saroufim and Andreas Köpf for running GPU MODE and all you’ve done for the community.

The code is available at sgemm.cu. This article complements my blog post, which covers the implementation of FP32 matrix multiplication that outperforms BLAS libraries on modern Intel and AMD CPUs. Today we’ll walk through a GPU implementation of SGEMM (Single-precision GEneral Matrix Multiply) operation defined as C := alpha*A*B + beta*C. The blog delves into benchmarking code on CUDA devices and explains the algorithm’s design along with optimization techniques. These include inlined PTX, asynchronous memory copies, double-buffering, avoiding shared memory bank conflicts, and efficient coalesced storage through shared memory. I’d also like to mention that the high-level algorithm design used in this project was developed by the excellent engineers at NVIDIA and has been extensively studied in prior works on cuBLAS and CUTLASS. My main contribution was translating it into efficient CUDA/PTX code. The goal of this project wasn’t to build an SGEMM that would magically outperform cuBLAS on all GPUs and all matrix sizes. This is especially pointless, given the open-sourced, lightweight CUTLASS library. Instead, the project primarily targets CUDA learners and aims to bridge the gap between the SGEMM implementations explained in books/blogs and those used in NVIDIA’s BLAS libraries. While the implementation is expected to deliver high performance on Ada/Ampere/Volta/Turing devices, it was specifically fine-tuned for and tested on a local NVIDIA RTX 3090 (=GA102 chip: RTX 3080, A10, A40, A6000). The achieved performance is shown below, comparing results with locked and unlocked GPU core frequencies against cuBLAS and Simon Boehm’s highly cited work (used in llamafile, aka tinyBLAS). I plan to continue publishing educational content on high-performance kernels used in AI/ML. Let me know what topics you’d like to see next! Projects currently in development: beating NVIDIA on Tensor Cores, Stream-K GEMM, FlashAttention, xLSTM. If you enjoy educational content like this and would like to see more, please share this article. Your feedback would be greatly appreciated!

P.S. Please feel free to get in touch if you are interested in collaborating. My contact information is available on the homepage.

1. Introduction

I clearly remember Andrej’s post on the current state of the existing cuda learning materials vs. cuda code used in high-performance libraries:

Indeed, when it comes to SGEMM implementations, there are some excellent educational blog posts, such as

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (mentioned by Andrej)
CUDA Matrix Multiplication Optimization

that break down, step by step, how to optimize a CUDA matmul kernel. However, in terms of achieved performance, none of them come close to matching the speed of cuBLAS or CUTLASS, especially when using recent CUDA versions and if benchmarked properly. From my experiments, these implementations achieve 50-70% of cuBLAS’ performance at best. Additionally, I found the explanations in both blog posts a bit overcomplicated in the final optimization steps. Nevertheless, I still think these resources are great for anyone starting with CUDA programming since they provide good foundational knowledge.

On the other hand, I’ve seen some really fast SGEMM implementations with cuBLAS-level performance:

The problem is that they are undocumented, difficult to find and understand, especially for a CUDA beginner. A similar problem exists with CUTLASS. While it is highly performant, there is a lack of introductory or educational materials explaining how it is internally designed and implemented in efficient CUDA/PTX. Another notable project is MaxAs, an assembler for the Maxwell architecture developed over a decade ago by Scott Gray. This tool enables programming directly in SASS (the assembly language for NVIDIA GPUs), allowing direct communication with the hardware instead of relying on the hardware-agnostic CUDA/PTX. Using MaxAs, Scott wrote an SGEMM implementation that achieved around 98% of the GM204 chip’s theoretical maximum FLOPS, surpassing cuBLAS by an average of 5%. While the results are impressive, programming in SASS is inflexible and requires deep understanding of the underlying hardware. Furthermore, with significant advancements in the compiler since then, programming directly in SASS is only advantageous in exceptional cases (for example, if you build tinygrad). CUTLASS achieves performance on par with cuBLAS across various GPU architectures and matrix sizes using only CUDA/PTX code.

But can we actually exceed the cuBLAS barrier? In the following chapters, we will briefly review the high-level SGEMM design used in CUTLASS, and discuss how to translate this design into efficient CUDA/PTX. This guide assumes only a basic knowledge of the CUDA programming model and linear algebra. If you are new to CUDA programming, I strongly recommend starting with these short introductory articles:

Before we proceed with implementation, let’s talk about benchmarking code on NVIDIA GPUs - a topic often overlooked. Properly benchmarking code is just as important as the code itself, particularly when comparing different implementations.

2. How to Benchmark Code on CUDA Devices?

The most reliable way to measure kernel duration is by profiling with NVIDIA Nsight Compute and manually extracting performance data. To obtain deterministic and reproducible results, Nsight Compute automatically applies the following settings:

Clock Control: locks GPU clock frequencies to their base values
Cache Control: flushes all GPU caches before each replay pass
Persistence mode

Alternatively, you can apply these settings manually and measure kernel duration at runtime without relying on external profilers. On Ubuntu, you can retrieve the base core clock frequency using:

nvidia-smi base-clocks

For instance, on an RTX 3090, the base core clock frequency is 1395 MHz. Next, you’ll need the memory clock frequencies, which work in combination with the base core clock:

nvidia-smi -q -d supported_clocks

From the list of supported frequencies, choose the fastest memory clock compatible with the base core frequency. Memory clock speeds are generally more stable than core clock speeds. To lock the clock frequencies and enable persistence mode, run the following commands:

sudo nvidia-smi --persistence-mode=1
# NVIDIA RTX 3090
sudo nvidia-smi --lock-gpu-clocks=1395
sudo nvidia-smi --lock-memory-clocks=9501

To reset the core and memory clock frequencies, you can use:

sudo nvidia-smi --reset-gpu-clocks
sudo nvidia-smi --reset-memory-clocks
sudo nvidia-smi --persistence-mode=0

GPU clock frequencies may drop due to the GPU’s thermal state, but for high-performance applications, throttling is often caused by power limits. Faulty hardware can also lead to throttling. It’s a good idea to monitor the GPU’s state at least during a test run. Use the following command to keep track of power draw, clock speeds, and throttling reasons in real time:

watch -n 0.1 nvidia-smi --query-gpu=power.draw,clocks.sm,clocks.mem,clocks_throttle_reasons.active --format=csv

A sample output might look like this:

308.50 W, 1395 MHz, 9501 MHz, 0x0000000000000000

The bit mask 0x0000000000000000 indicates no throttling, and the clocks are running at their maximum speeds. A value of 0x0000000000000001 indicates an idle state. Any other values suggest throttling is occurring. For a full list of bit mask values and their meanings, refer to the NvmlClocksThrottleReasons documentation.

Once you’ve locked the clock frequencies, you can measure the kernel duration directly in CUDA using CUDA events. Here’s an example:

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
float elapsed_time_ms = 0.0;

cudaEventRecord(start);
kernel<<<...>>>(...);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop);

For reliable measurements, multiple replay passes are typically used. In such cases, the GPU cache should be flushed before each kernel replay. This can be done using cudaMemsetAsync as shown in nvbench:

// Flush L2 cache
int dev_id{};
int m_l2_size{};
void* buffer;
checkCudaErrors(cudaGetDevice(&dev_id));
checkCudaErrors(cudaDeviceGetAttribute(&m_l2_size, cudaDevAttrL2CacheSize, dev_id));
if (m_l2_size > 0) {
    checkCudaErrors(cudaMalloc(&buffer, static_cast<std::size_t>(m_l2_size)));
    int* m_l2_buffer = reinterpret_cast<int*>(buffer);
    checkCudaErrors(cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size)));
    checkCudaErrors(cudaFree(m_l2_buffer));
}

Locking the clock frequencies to their base values is a reliable way to measure the speed of your kernel. However, in real-world scenarios, algorithms don’t typically run with locked clocks. To achieve optimal performance, your algorithm needs to be both fast and power-efficient. The less power your algorithm consumes, the higher the clock speeds your hardware can maintain. NVIDIA GPUs often reduce clock frequencies aggressively, well before hitting their power limits, which can significantly degrade application performance. To account for this, we benchmark our implementation under both locked and unlocked clock conditions, testing for both speed and power efficiency. In our benchmarks, we evaluate matrix sizes ranging from 1024 to 12,800 with a step size of 128. For each matrix size, we launch 1000*exp((-matsize+1024)/3100.0)) kernel replays and calculate the execution time as the average of the second half of the replays. For example, given matrix size problem m=n=k=4096, we run the sgemm 1000*exp((-4096 + 1024)/3100.0))=371 times and measure the average duration of the last 185 runs, ensuring the clocks have stabilized. This profiling strategy leads to consistent and reproducible results, even when GPU clocks are unlocked.

Avoid using WSL for performance measurements. To ensure accurate and reliable results, please use a native Linux environment.

3. Memory Layout

Without loss of generality in this implementation, we assume matrices are stored in row-major order. A matrix A with dimensions M x N is stored as contiguous array of length M*N. Elements A[row][col] are accessed via a 1D raw C pointer ptr[row*N + col] with 0<=col<=N-1 and 0<=row<=M-1. Matrix multiplication is denoted as $C=AB$, where the shapes of matrices $A, B, C$ are $M \times K, K \times N$ and $M \times N$, respectively.

To adapt this implementation for matrices stored in column-major order, simply swap the operands $A$ and $B$, because:

\[C^\text{T} = (A B)^\text{T} = B^\text{T} A^\text{T},\]

Here, $A, B, C$ are matrices stored in row-major order, while $A^\text{T}, B^\text{T}, C^\text{T}$ are the corresponding transposed matrices (i.e., stored in column-major order).

cuBLAS provides an API to calculate SGEMM:

cublasSgemm(m, n, k, A, lda, B, ldb, C, ldc); // simplified form

with m, n, k denote the matrix sizes $M, N, K$. The parameters lda, ldb, ldc are the leading dimensions of matrices $A, B, C$, respectively. The leading dimension is the length of the fastest-varying dimension when iterating over the matrix elements (i.e., the length of the first dimension). For matrices stored in row-major order, the leading dimension is usually the number of columns, so typically lda=k, ldb=n, ldc=n. However, this isn’t always the case. In scenarios where you need to compute a submatrix of a larger matrix, the leading dimension might be larger than the number of columns.

Matrices may be also padded with zeros to support vectorized memory loads or tensor cores. The vectorized load instructions allow to load multiple elements at once using just 1 instruction. Though the vectorized loads reduce total number of instructions and improve bandwidth utilization, they also impose alignment constraint on input data, so that the leading dimension must be divisible by 2 (for 64-bit loads) or 4 (for 128-bit loads). The figure below illustrate the case for 128-bit (=4 floats) loads.

Note how it’s impossible to load the elements of the first row without touching the elements of the next row if the leading dimension is not divisible by 4. Padding with zeros helps, but requires additional memory. Another solution would be to check at runtime if the leading dimension is divisible by 4. If it is - then use vectorized loads, if not - scalar loads. Additionally, zero padding was commonly used in the past to enable tensor core computations. For instance, in cuBLAS versions < 11, Tensor Core FP16 operations required m, n, k to be multiples of 8.

4. Parallel Thread Execution

The CUDA compilation trajectory of a .cu file looks as follows:

During Stage 1 CUDA code is compiled to PTX (parallel thread execution) instructions - intermediate high-level code, which can be considered as assembly for a virtual GPU architecture. Such a virtual GPU is defined entirely by the set of capabilities, or features, that it provides to the application. PTX doesn’t run on any real architecture, directly. It must be optimized and translated to native target-architecture instructions (Stage 2). NVIDIA provides a mechanism to insert PTX code into your CUDA program, so that you can mix CUDA/PTX in source code and still have benefits of code optimizations during the PTX generation. By rewriting parts of your code in PTX, you can 1) reduce total number of generated PTX instructions 2) exactly specify PTX instructions you need 3) tune the instructions through qualifiers 4) apply optimizations that are either lacking in the compiler or prohibited by C++ language extensions. Important! Using inline PTX Assembly will not make your code automatically faster than the one written in CUDA. It will only be faster if your hand-written PTX is better than the generated by the compiler.

In this implementation we will program some parts of the algorithm directly in PTX, so I highly recommend to check this short overview of inline ptx assembly if you have never used it before. The PTX instructions are well documented and can be found at PTX Instruction Set. We will now briefly review the PTX instructions used in this implementation.

4.1. Global Memory Loads

For global memory loads we will use ld.global.f32 instruction. Here, “ld” denotes “load” and “f32” - “32-bit float”. The following CUDA code

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
reg = *gmem_ptr; // global memory -> register transfer

can be implemented in PTX as:

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
asm volatile("ld.global.f32 %0, [%1];" : "=f"(reg) : "l"(gmem_ptr));

The f in "=f" denotes float datatype and the = modifier specifies that the register is written to. The l represents unsigned 64-bit integer. We also use volatile keyword to ensure that the instruction is not deleted or moved during generation of PTX.

4.2. Global Memory Stores

For global memory stores there is `st.global.f32 instruction:

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
// *gmem_ptr = reg; can be implemented in PTX as:
asm volatile("st.global.f32 [%0], %1;" : : "l"(gmem_ptr), "f"(reg));

4.3. Global to Shared Memory Transfers

When you write something like this in CUDA:

__shared__ float smem_ptr[n]; // pointer to shared memory
float* gmem_ptr = data_in_global_memory; // pointer to global memory
*smem_ptr = *gmem_ptr; // global to shared memory transfer

a two-step process occurs. First, the data is fetched from global memory into registers and then that data is copied from registers into shared memory. Additionally the data is cached in all cache levels during the transfer.

Global to shared memory transfers

For this reason, a global to shared memory transfer in PTX consists of two data movement instructions ld.global and st.shared:

__shared__ float smem_ptr[n]; // pointer to shared memory
uint64_t smem_addr;
// convert generic address to shared address (store location for st.shared instruction)
asm volatile("cvta.to.shared.u64 %0, %1;" : "=l"(smem_addr) : "l"(smem_ptr));

float* gmem_ptr = data_in_global_memory; // pointer to global memory
float buffer;
// global memory -> register
asm volatile("ld.global.f32 %0, [%1];" : "=f"(buffer) : "l"(gmem_ptr));
// register -> shared memory
asm volatile("st.shared.f32 [%0], %1;" : : "l"(smem_addr), "f"(buffer));

Prior to Ampere architecture it was not possible to transfer data from global memory directly to shared memory mitigating storing in registers. Starting from Ampere architecture, there are asynchronous copy instructions that allow this. The usage of these instructions will be demonstrated later.

4.4. Vectorized Shared Memory Loads and Stores

In PTX you can also implement vectorized memory operations (loading/storing multiple elements with one instruction). Here, v4 denotes vector with four elements:

float reg0, reg1, reg2, reg3;
uint64_t addr;
...
// Shared memory 128-bit loads
asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];"
             : "=f"(reg0), "=f"(reg1), "=f"(reg2), "=f"(reg3)
             : "l"(addr));
// Shared memory 128-bit stores
asm volatile("st.shared.v4.f32 [%0], {%1, %2, %3, %4};"
             :
             : "l"(addr), "f"(reg0), "f"(reg1), "f"(reg2), "f"(reg3));

4.5. Predicated Execution

In PTX conditional executions are implemented using optional guard predicates. The following CUDA code:

float reg;
float* ptr; //pointer to global memory
unsigned guard;
...
if (guard != 0) {
    *ptr = reg;
}

can be converted to PTX as:

float reg;
float* ptr;
unsigned guard;
...
asm volatile(".reg .pred p;\n\t" // declare predicate 'p'
             ".setp.ne.u32 p, %2, 0;\n\t" // set 'p' to true if (guard != 0); ne="not equal"
             "@p ld.global.f32 %0, [%1];\n\t" // execute instruction if 'p' is true
             : "=f"(reg)
             : "l"(ptr), "r"(guard));

We use guard predicates in combination with global load/store instructions to perform global memory access only if it is not out of bounds.

5. SGEMM Design

Let’s now break down the high-level design of the algorithm. The paper Strassen’s Algorithm Reloaded on GPUs contains, in my opinion, one of the best visualizations of the SGEMM design from the CUTLASS library. The SGEMM algorithm can be roughly divided into three main parts:

Transferring data from global to shared memory
Loading data from shared memory and performing arithmetic operations
Writing results back to global memory.

Each of these steps must be carefully optimized to achieve high overall performance. In the following sections, we’ll explore each step in detail and discuss efficient implementation strategies. It’s worth mentioning that the first step - “transferring data from global memory to shared memory” is the most challenging to grasp. However, once you understand this part, the remaining steps become much easier to follow.

5.1. Transferring data from global to shared memory

Source: Strassen’s Algorithm Reloaded on GPUs

To parallelize $C=AB$ on GPU, the matrix $C$ is partitioned into sub-matrices $\tilde{C}$ of size $m_S \times n_S$ and the sub-matrices are processed in parallel with one thread block computing one sub-matrix $\tilde{C}$ independently from other thread blocks. To compute $\tilde{C}$, we iterate over the dimension $K$. In each iteration, a submatrix $\tilde{A}$ of size $m_s \times k_s$ and a submatrix $\tilde{B}$ of size $k_s \times n_s$ are loaded from global into shared memory (see the figure above). These submatrices are then multiplied, and the result is used to update $\tilde{C}$ as $\tilde{C} += \tilde{A} \tilde{B}$. The sub-matrices $\tilde{A}, \tilde{B}, \tilde{C}$ are often called blocks or tiles. In total there are $K / k_s$ iterations (assuming the simplest case, where $K$ is divisible by $k_s$). The limited shared memory capacity is the reason why the dimension $K$ is divided into smaller $k_s$ blocks. Full $m_s \times K, K \times n_s$ blocks simply wouldn’t fit available shared memory. For now, don’t be distracted by why the matrices are loaded into shared memory and how exactly the matrices $\tilde{A}, \tilde{B}$ are multiplied, we will discuss it in the next chapter. Let’s focus on the efficient data movement from global to shared memory as our first step towards fast SGEMM.

The pseudo code of the algorithm, from the perspective of a thread block, is as follows:

// The shapes of block_a, block_b, block_c are (ms x ks), (ks x ns), (ms x ns)
// Each thread block computes one block of C:
block_c = 0
__shared__ float block_a[block_a_size]
__shared__ float block_b[block_b_size]
for (i=0; i<K/ks; i++) {
    block_a = load ith block of matrix A // from global into shared memory
    block_b = load ith block of matrix B // from global into shared memory
    block_c += block_a * block_b // compute matrix product and update block_c
}
store(block_c) // store to global memory

Data transfers from global memory to shared memory have significantly higher latency compared to arithmetic operations. During this time, threads are forced to stall, idly waiting for the data needed to compute block_a * block_b. One way to mitigate this latency is by overlapping data transfers with computations, leveraging instruction-level parallelism (ILP). In GEMM implementations, a technique known as double buffering is commonly used to achieve this overlap:

block_c = 0
// Shared Memory Double buffering
__shared__ float block_a[2][block_a_size] // 2x shared memory usage
__shared__ float block_b[2][block_b_size] // 2x shared memory usage
block_a[0] = load first block of matrix A
block_b[0] = load first block of matrix B

for (i=0; i<(K/ks-1); i++) {
    idx = i%2
    prefetch_idx = (i+1)%2
    // prefetch next blocks
    block_a[prefetch_idx] = load next block of matrix A
    block_a[prefetch_idx] = load next block of matrix B
    // use blocks loaded in previous iteration to calculate matrix product
    block_c += block_a[idx] * block_b[idx]
}
// final update of the accumulator using last blocks
block_c += block_a[prefetch_idx] * block_b[prefetch_idx]

store_to_global_memory(block_c)

Note that block_c += block_a[idx] * block_b[idx] doesn’t depend on blocks[prefetch_idx] allowing the arithmetic instructions to be issued in parallel with the data movement instructions. However, this comes at the cost of doubled shared memory usage, as we need to store two blocks instead of one. The good news is that modern GPUs have sufficient shared memory to support double-buffering.

We’ve already introduced several parameters such as block sizes $m_s, k_s, n_s$ and number of threads per thread block. The choice of these parameters highly depends on the shapes of the operands $A, B, C$, as well as the underlying GPU architecture. For example, cuBLAS implements multiple SGEMM kernels optimized for various matrix shapes and GPU architectures. At runtime, it selects the most appropriate kernel using a heuristic approach. The block sizes $m_s, k_s, n_s$ affect not only how the data will be fetched from global memory, but also how the work in all subsequent steps (shared memory loads, arithmetic operations, global memory stores) is organized among the threads to achieve the best possible performance. The choice of the block sizes and the number of threads per thread block also impact shared memory / register usage, which can result in decreased performance if not taken into account. As you might expect, identifying optimal parameter values requires excellent understanding of hardware and extensive experimentation. Fortunately, SGEMM is a well-studied problem and we can use the results from previous studies of cuBLAS and CUTLASS. For large square matrices (M=N=K > 1024) the combinations of $m_S \times n_S$ such as $128 \times 256$, $128 \times 128$ and $256 \times 128$ lead to optimal performance. From my tests, the configuration $m_s \times n_s \times k_s = 128 \times 128 \times 8$ with 256 threads per thread block achieved the highest TFLOP/S on my local RTX 3090 for matrix size problems 1024 <= M=N=K <= 2500. Therefore, we will start with implementation of a 128x128x8 SGEMM kernel. Now that we know the block dimensions and the number of threads per thread block, let’s discuss how to efficiently organize data loading from global memory and storage into shared memory.

First, we need to load 128x8 submatrix $\tilde{A}$ using 256 threads. This results in each thread loading 128*8/256 = 4 float elements from global memory. There are several different ways to organize the loading of the block. For global memory reads/stores you always want your accesses to be contiguous or coalesced, so that 32 threads in a wrap access 32 consecutive floats in memory. If a memory access is coalesced the minimum number of memory transactions will be used. However, it is not possible in case of the $\tilde{A}$ block: each row of the block contains only 8 consecutive elements. Nevertheless, even in such cases, consecutive threads in a wrap accessing consecutive elements in memory is preferable and usually results in better performance. The figure below shows how the loading of the block $\tilde{A}$ is implemented. Here, different colors represent different threads, whereas only first 16 threads are shown for simplicity. Four consecutive rows are loaded by 8 consecutive threads: the rows 1-4 are loaded by threads 0-7, the rows 5-8 are loaded by threads 8-15, the rows 9-12 are loaded by threads 16-23 and so on, with the last rows 125-128 are loaded by threads 248-255. We also transpose the block $\tilde{A}$ while storing in shared memory for better memory access pattern during the next computation step. Note how each thread stores 4 consecutive elements in shared memory. This allows us to use PTX vectorized stores st.shared.v4.f32.

Storing to shared memory using this naive scheme would result in shared memory bank conflicts. From the CUDA programming guide:

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Imagine a float32 array of size 8x32 stored in row-major order as shown below.

In this context, colors and their shades represent memory banks: each row corresponds to 32 distinct memory banks, while each column represents a single memory bank. Here are two important notes about shared memory bank conflicts:

The determination of bank conflicts is made per memory transaction (or using modern CUDA language - per wave), not per request, not per warp, not per instruction.
Two requests to the same bank and the same 32-bit location in that bank do not create a bank conflict (illustrated in the CUDA programming guide).

When you store (or load) 4 bytes(= 1 float) per thread, which is 4*32=128 bytes per warp, a CUDA device issues a single memory transaction (warp-wide) so that the shared memory access must be conflict-free across the whole wrap(=32 threads). In our case, we store 16 bytes(= 4 floats) per thread using the vector instructions. Warp-wide that will be a total of 512 bytes per request. The GPU splits the request into 4 memory transactions (threads 0-7 make up a transaction, threads 8-15 a transaction and so on), each of which is 128 byte wide. If we would store according to our scheme, then each thread within threads 0-7 would store to the same four columns (red color shades) or with other words to the same four memory banks causing bank conflicts. The same applies for other memory transactions i.e. threads 8-15, threads 16-23 and so on. One possible way to completely avoid bank conflicts would be to pad the leading dimension with 16 bytes (=4 floats) as shown below.

Now, if we store the data according to our scheme, each thread within threads 0-7 would accesses distinct memory banks, resulting in 32 memory banks being accessed per memory transaction. The same applies for the remaining memory transactions i.e. t8-t15, t16-t23 and so on. This is the reason why the leading dimension is 132 and not 128 in the implementation:

const int smem_a_ld = 132; // 128 + 4

To implement double-buffering and store two $\tilde{A}$ blocks, theoretically, we would need shared memory of size 2*132*8*4 bytes. However, we increase the size to the nearest power of 2 = 2*256*8*4 to enable fast switching. Compare the following code with the pseudocode presented at the beginning of the chapter:

// Double-buffering (blocks_b is omitted for simplicity)
__shared__ float __align__(2*256*8*sizeof(float)) blocks_a[2*256*8]
uint64_t lds_a_addr;
uint64_t sts_a_addr;
float* lds_a_ptr = blocks_a; // lds = load shared
float* sts_a_ptr = blocks_a; // sts = store shared
lds_a_addr = convert_to_addr(lds_a_ptr); // convert pointer to address for PTX load/store instructions
sts_a_addr = convert_to_addr(sts_a_ptr); // convert pointer to address for PTX load/store instructions

// store first block to first half of shared memory
sts_ptx(sts_a_addr);
// switch address to second half of shared memory
sts_a_addr ^= 8192;

for (int i=0; i<(K/ks-1); i++) {
    ...
    // store next block to second(first) half of shared memory
    sts_ptx(sts_a_addr);
    ...
    // load block from first(second) half of shared memory to compute c+=block_a*block_b
    lds_ptx(lds_a_addr);
    ...
    // swap the addresses for next iteration: lds_a_addr = sts_a_addr, sts_a_addr = lds_a_addr
    lds_a_addr ^= 8192;
    sts_a_addr ^= 8192;
    ...
}
...

First, we require blocks_a to be 2*256*8*4=2^14=16384-byte aligned. This implies the address of the first element of blocks_a to be divisible by 16384 or with other words the last 14 bits of the address are zero:

As each block size is 8192=2^13 bytes, switching between the blocks can now be implemented with just a single XOR instruction ^= 8192. The only drawback of this method is the unused shared memory (in this case 2*8*128*4 bytes). However, this can be ignored considering maximum amount of shared memory per thread block on modern GPUs.

Loading and storing a 8 x 128 submatrix $\tilde{B}$ is much simpler to manage due to its shape. Since the sub-matrix must not be transposed, the loading and storing schemes are identical:

We use 32 consecutive threads to load 32 consecutive elements, with each thread loading 4 elements, spaced apart by a stride of 32. Note that since we store data in 32 distinct shared memory banks, no padding is required, and bank conflicts are avoided. Furthermore, the block size 128*8 is naturally a power of two, eliminating the need for additional padding and allowing block switching with a single XOR ^=4096 instruction.

5.2. Shared Memory Loads and Arithmetic Operations

With blocks $\tilde{A}$ and $\tilde{B}$ now residing in shared memory, let’s discuss how to efficiently load from shared memory and compute block $\tilde{C}$. To do this, we’ll dive one level deeper into our parallelization strategy and describe the algorithm from a warp’s perspective:

Launched thread block consists of 256 threads, which corresponds to 256/32=8 warps. The block $\tilde{C}$, with dimensions $128 \times 128$, is, therefore, divided into 8 regions $\tilde{C}_W$ labeled $W1, …, W8$ in the figure. Each region $\tilde{C}_W$ has dimensions $m_W \times n_W = 32 \times 64$ and is computed by a single warp: $W1$ is computed by threads t0-t31, $W2$ is computed by threads t32-t63, and so on, with $W8$ computed by threads t224-t255. The figure above uses $W8$ as an example to demonstrate how a single $\tilde{C}_W$ region is computed. We iterate over the dimension $K$ and in each iteration we

load fragment_a (=column of size $m_W \times 1$) from $\tilde{A}$ into registers
load fragment_b (=row of size $1 \times n_W$) from $\tilde{B}$ into registers
multiply the fragments and update $\tilde{C}_W$

As $k_S = 8$, there will be in total 8 iterations. This explanation is from the perspective of a warp. Now, let’s delve one final level deeper and examine how the work within a warp is distributed among its 32 threads.

Each thread in a wrap computes four 4x4 sub-matrices (=accumulators) within $\tilde{C}_W$ or if concatenated - 8x8 accumulator. To do this, each thread loads 8 elements from fragment_a, 8 elements from fragment_b (as illustrated for thread t0 in the figure), multiplies them and updates the accumulator using fused multiply-add (FMA) instructions. Since block_a was transposed in the previous step, the elements in fragment_a are stored contiguously in memory, allowing faster access through vectorized loads. The threads are arranged in a way that avoids bank conflicts and works around NVIDIA’s shared memory broadcast limitation. This limitation occurs when 4 floats loaded using 16-byte vector instruction must be broadcast to more than 4 consecutive threads within a warp.

Bringing everything together, the entire SGEMM algorithm can be visualized as follows:

As you might expect, the accumulators are frequently updated during the computation and need to be stored in the fastest memory - the register files. Each thread allocates float accumulator[8][8], so that the entire block $\tilde{C}$ of size $128 \times 128$ is stored in registers by the 256 threads. This works because 256=16*16, and the combined arrangement (16*8)x(16*8)=128x128 matches the size of $\tilde{C}$. Just as we used double buffering to load the blocks $\tilde{A}$ and $\tilde{B}$ (from global memory to shared memory), we now also double buffer the fragments to minimize memory transfer latencies when moving data from shared memory to registers. The pseudocode for the algorithm can be written as follows:

// Pseudocode
__shared__ float block_a[2][block_a_size]
__shared__ float block_b[2][block_b_size]
float fragment_a[2][8]
float fragment_b[2][8]
float accumulator[8][8]

block_a[0] = load first block of matrix A
block_b[0] = load first block of matrix B
fragment_a[0] = load first fragment from block_a[0]
fragment_b[0] = load first fragment from block_b[0]

for (block_k=0; block_k<(K/ks-1); block_k++) {
    block_idx = block_k % 2
    block_prefetch_idx = (block_k+1) % 2
    // prefetch next blocks (Shared Memory Double buffering)
    block_a[block_prefetch_idx] = load next block of matrix A
    block_a[block_prefetch_idx] = load next block of matrix B
    for (int warp_k=0; warp_k<8; warp_k++) {
        frag_idx = warp_k % 2
        frag_prefetch_idx = (warp_k + 1) % 2
        // prefetch next fragments (Register Double buffering)
        fragment_a[frag_prefetch_idx] = load next fragment from block_a[block_idx]
        fragment_b[frag_prefetch_idx] = load next fragment from block_b[block_idx]
        // use fragments loaded in previous iteration to calculate matrix product
        for (int i = 0; i < 8; i++) {
            for (int j = 0; j < 8; j++) {
                accumulator[i][j] += fragment_a[frag_idx][i] * fragment_b[frag_idx][j];
            }
        }
    }
    fragment_a[0] = load first fragment from block_a[block_prefetch_idx]
    fragment_b[0] = load first fragment from block_b[block_prefetch_idx]
}

// final update of the accumulator using last blocks
for (int warp_k=0; warp_k<8; warp_k++) {
    frag_idx = warp_k % 2
    frag_prefetch_idx = (warp_k + 1) % 2
    // prefetch next fragments (Register Double buffering)
    fragment_a[frag_prefetch_idx] = load next fragment from block_a[block_prefetch_idx]
    fragment_b[frag_prefetch_idx] = load next fragment from block_b[block_prefetch_idx]
    // use fragments loaded in previous iteration to calculate matrix product
    for (int i = 0; i < 8; i++) {
        for (int j = 0; j < 8; j++) {
            accumulator[i][j] += fragment_a[frag_idx][i] * fragment_b[frag_idx][j];
        }
    }
}

// After completing the matrix multiplication C=A*B, we perform one final update to the accumulator
// to compute  C=alpha*A*B before storing the result back to global memory:
for (int i=0; i<8; i++) {
    for (int j=0; j<8; j++) {
        accumulator[i][j] *= alpha;
    }
}

store_to_global_memory(accumulator)

5.3. Coalesced Global Memory Stores Through Shared Memory

Just as with global memory reads, we want our global memory writes to be coalesced. However, directly storing the accumulators to global memory based on our current mapping

would result in random memory accesses, significantly hurting performance. To fix this, we use shared memory as a buffer to rearrange the accumulators, enabling coalesced global memory writes. At this stage, the accumulators have already been computed, so we no longer need shared memory for computation. Transferring data from registers to shared memory is fast. The overhead of these additional transfers from registers to shared memory is negligible compared to the performance gains achieved through coalesced writes. We write the accumulator’s elements to shared memory row by row according to the following scheme:

The first row, containing 32 elements, is copied to the first 32 consecutive memory addresses in shared memory. Similarly, the second row is copied to the next 32 consecutive memory addresses, and so on with all 16 rows have been copied to shared memory. Next, we iterate through the rows in shared memory, and in each iteration, we store a row (containing 32 elements) to global memory using coalesced writes:

The process is then repeated for the other three 4x4 accumulators of the threads.

To compute $C := \alpha AB + \beta C$, we make a slight adjustment to the process of storing the data to global memory. After copying the accumulator from registers to shared memory, we check if beta != 0.0. If true, we load (using coalesced loads) the corresponding element from global memory into a register, multiply it by beta and add the result to the accumulator stored in shared memory. Finally, we store the updated accumulator alpha*A*B+beta*C from shared memory to global memory using coalesced writes.

6. Performance Analysis

So far, we have discussed the design of the 128x128x8 SGEMM kernel. Its implementation is available at 128x128x8.cuh and closely follows the pseudo-code outlined earlier. Let’s now benchmark this kernel to evaluate its performance. First, we conduct a benchmark with locked clock frequencies:

The benchmark results show that the implementation outperforms cuBLAS when clock speeds remain constant. However, performance alone is not enough; we also need to consider power consumption. To evaluate both metrics, we run the benchmark with unlocked clock frequencies:

This reveals the effect of throttling due to reaching power limits. While the 128x128x8 kernel is, on average, 3–4% faster than cuBLAS, it consumes 12% more power. The increased power consumption causes the GPU to operate near the power limit for matrix sizes m=n=k>4000, resulting in reduced clock speeds and overall performance degradation. This the reason why optimizing both running time and power consumption is required for achieving a balanced and efficient implementation.

We can slightly improve the running time of the kernel by utilizing vectorized global texture loads. The new kernel is available at 128x128x8_texld. Since the vectorized load instructions impose alignment constraints on the input data, we first verify the memory alignment and ensure the leading dimensions of matrices A and B are divisible by 4:

bool is_aligned = (((unsigned)lda & 3u) == 0) && (((unsigned)ldb & 3u) == 0)
                    && (((unsigned long)A & 15u) == 0) && (((unsigned long)B & 15u) == 0);

If the input data is aligned, we can use the vectorized load instructions. First we need to create texture objects, texture descriptors, and resource descriptors. These are configured to handle float data type with four 32-bit channels (x, y, z, w). The texture objects are then bound to the operands A, B, and passed to the kernel instead of raw pointers A, B.

cudaResourceDesc resDesc;
cudaTextureDesc texDesc;
cudaTextureObject_t tex_a = 0;
cudaTextureObject_t tex_b = 0;
...
if (is_aligned) {
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 0;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeLinear;
    resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
    resDesc.res.linear.desc.x = 32;
    resDesc.res.linear.desc.y = 32;
    resDesc.res.linear.desc.z = 32;
    resDesc.res.linear.desc.w = 32;
    resDesc.res.linear.devPtr = A;
    resDesc.res.linear.sizeInBytes = m * lda * sizeof(float);
    cudaCreateTextureObject(&tex_a, &resDesc, &texDesc, NULL);
    resDesc.res.linear.devPtr = B;
    resDesc.res.linear.sizeInBytes = k * ldb * sizeof(float);
    cudaCreateTextureObject(&tex_b, &resDesc, &texDesc, NULL);
    sgemm_texld_128x128x8<<<grid, threads>>>(m,
                                             n,
                                             k,
                                             *alpha,
                                             tex_a,
                                             lda,
                                             tex_b,
                                             ldb,
                                             *beta,
                                             C,
                                             ldc);
    cudaDestroyTextureObject(tex_a);
    cudaDestroyTextureObject(tex_b);
}

Within the kernel, we load data through texture objects using the tex1Dfetch function, which compiles to a single PTX instruction:

float4 texld_a_buffer;
texld_a_buffer = tex1Dfetch<float4>(tex_a, texld_a_offset);

We use global texture loads over normal vectorized global loads (ld.global.v4.f32) because texture loads handle out-of-bounds reads gracefully by returning zeros, avoiding the need for predicated execution. This simplification leads to more efficient code:

Lastly, we developed an 128x256x8 SGEMM kernel leveraging bigger block size $n_S=256$ and asynchronous copy instructions (cp.async.ca.shared.global) which are supported starting with the Ampere architecture. The main advantage of these instructions is that one can overlay computation with memory transfers and avoid pipeline stalls. Additionally, they allow to copy data from global memory directly into shared memory bypassing registers:

By simply replacing the normal global load instructions with cp.async in the 128x128x8 kernel results in degraded performance - possibly due to higher latencies of the cp.async instructions or suboptimal compiler optimizations. However, combining increased block size with cp.async yields superior results in both speed and power efficiency:

Our final implementation combines the 128x128x8 and 128x256x8 kernels. For smaller matrices m=n < 2500, we use the 128x128x8 kernel, otherwise, the 128x256x8 kernel.

Advanced Matrix Multiplication Optimization on NVIDIA GPUs

2025-01-12T09:35:01+00:00

This project is inspired by the outstanding works of Andrej Karpathy, George Hotz, Scott Gray, Horace He, Philippe Tillet, Jeremy Howard, Lei Mao and the best CUDA hackers from the GPU MODE community (Discord server). A special thanks to Mark Saroufim and Andreas Köpf for running GPU MODE and all you’ve done for the community.

P.S. Please feel free to get in touch if you are interested in collaborating. My contact information is available on the homepage.

1. Introduction

I clearly remember Andrej’s post on the current state of the existing cuda learning materials vs. cuda code used in high-performance libraries:

Indeed, when it comes to SGEMM implementations, there are some excellent educational blog posts, such as

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (mentioned by Andrej)
CUDA Matrix Multiplication Optimization

On the other hand, I’ve seen some really fast SGEMM implementations with cuBLAS-level performance:

2. How to Benchmark Code on CUDA Devices?

Clock Control: locks GPU clock frequencies to their base values
Cache Control: flushes all GPU caches before each replay pass
Persistence mode

Alternatively, you can apply these settings manually and measure kernel duration at runtime without relying on external profilers. On Ubuntu, you can retrieve the base core clock frequency using:

nvidia-smi base-clocks

For instance, on an RTX 3090, the base core clock frequency is 1395 MHz. Next, you’ll need the memory clock frequencies, which work in combination with the base core clock:

nvidia-smi -q -d supported_clocks

sudo nvidia-smi --persistence-mode=1
# NVIDIA RTX 3090
sudo nvidia-smi --lock-gpu-clocks=1395
sudo nvidia-smi --lock-memory-clocks=9501

To reset the core and memory clock frequencies, you can use:

sudo nvidia-smi --reset-gpu-clocks
sudo nvidia-smi --reset-memory-clocks
sudo nvidia-smi --persistence-mode=0

watch -n 0.1 nvidia-smi --query-gpu=power.draw,clocks.sm,clocks.mem,clocks_throttle_reasons.active --format=csv

A sample output might look like this:

308.50 W, 1395 MHz, 9501 MHz, 0x0000000000000000

Once you’ve locked the clock frequencies, you can measure the kernel duration directly in CUDA using CUDA events. Here’s an example:

cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
float elapsed_time_ms = 0.0;

cudaEventRecord(start);
kernel<<<...>>>(...);
cudaEventRecord(stop);

cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop);

// Flush L2 cache
int dev_id{};
int m_l2_size{};
void* buffer;
checkCudaErrors(cudaGetDevice(&dev_id));
checkCudaErrors(cudaDeviceGetAttribute(&m_l2_size, cudaDevAttrL2CacheSize, dev_id));
if (m_l2_size > 0) {
    checkCudaErrors(cudaMalloc(&buffer, static_cast<std::size_t>(m_l2_size)));
    int* m_l2_buffer = reinterpret_cast<int*>(buffer);
    checkCudaErrors(cudaMemsetAsync(m_l2_buffer, 0, static_cast<std::size_t>(m_l2_size)));
    checkCudaErrors(cudaFree(m_l2_buffer));
}

Avoid using WSL for performance measurements. To ensure accurate and reliable results, please use a native Linux environment.

3. Memory Layout

To adapt this implementation for matrices stored in column-major order, simply swap the operands $A$ and $B$, because:

\[C^\text{T} = (A B)^\text{T} = B^\text{T} A^\text{T},\]

Here, $A, B, C$ are matrices stored in row-major order, while $A^\text{T}, B^\text{T}, C^\text{T}$ are the corresponding transposed matrices (i.e., stored in column-major order).

cuBLAS provides an API to calculate SGEMM:

cublasSgemm(m, n, k, A, lda, B, ldb, C, ldc); // simplified form

4. Parallel Thread Execution

The CUDA compilation trajectory of a .cu file looks as follows:

4.1. Global Memory Loads

For global memory loads we will use ld.global.f32 instruction. Here, “ld” denotes “load” and “f32” - “32-bit float”. The following CUDA code

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
reg = *gmem_ptr; // global memory -> register transfer

can be implemented in PTX as:

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
asm volatile("ld.global.f32 %0, [%1];" : "=f"(reg) : "l"(gmem_ptr));

4.2. Global Memory Stores

For global memory stores there is `st.global.f32 instruction:

float reg; // single float register
float* gmem_ptr = data_in_global_memory; // pointer to global memory
// *gmem_ptr = reg; can be implemented in PTX as:
asm volatile("st.global.f32 [%0], %1;" : : "l"(gmem_ptr), "f"(reg));

4.3. Global to Shared Memory Transfers

When you write something like this in CUDA:

__shared__ float smem_ptr[n]; // pointer to shared memory
float* gmem_ptr = data_in_global_memory; // pointer to global memory
*smem_ptr = *gmem_ptr; // global to shared memory transfer

Global to shared memory transfers

For this reason, a global to shared memory transfer in PTX consists of two data movement instructions ld.global and st.shared:

__shared__ float smem_ptr[n]; // pointer to shared memory
uint64_t smem_addr;
// convert generic address to shared address (store location for st.shared instruction)
asm volatile("cvta.to.shared.u64 %0, %1;" : "=l"(smem_addr) : "l"(smem_ptr));

float* gmem_ptr = data_in_global_memory; // pointer to global memory
float buffer;
// global memory -> register
asm volatile("ld.global.f32 %0, [%1];" : "=f"(buffer) : "l"(gmem_ptr));
// register -> shared memory
asm volatile("st.shared.f32 [%0], %1;" : : "l"(smem_addr), "f"(buffer));

Prior to Ampere architecture it was not possible to transfer data from global memory directly to shared memory mitigating storing in registers. Starting from Ampere architecture, there are asynchronous copy instructions that allow this. The usage of these instructions will be demonstrated later.

4.4. Vectorized Shared Memory Loads and Stores

In PTX you can also implement vectorized memory operations (loading/storing multiple elements with one instruction). Here, v4 denotes vector with four elements:

float reg0, reg1, reg2, reg3;
uint64_t addr;
...
// Shared memory 128-bit loads
asm volatile("ld.shared.v4.f32 {%0, %1, %2, %3}, [%4];"
             : "=f"(reg0), "=f"(reg1), "=f"(reg2), "=f"(reg3)
             : "l"(addr));
// Shared memory 128-bit stores
asm volatile("st.shared.v4.f32 [%0], {%1, %2, %3, %4};"
             :
             : "l"(addr), "f"(reg0), "f"(reg1), "f"(reg2), "f"(reg3));

4.5. Predicated Execution

In PTX conditional executions are implemented using optional guard predicates. The following CUDA code:

float reg;
float* ptr; //pointer to global memory
unsigned guard;
...
if (guard != 0) {
    *ptr = reg;
}

can be converted to PTX as:

float reg;
float* ptr;
unsigned guard;
...
asm volatile(".reg .pred p;\n\t" // declare predicate 'p'
             ".setp.ne.u32 p, %2, 0;\n\t" // set 'p' to true if (guard != 0); ne="not equal"
             "@p ld.global.f32 %0, [%1];\n\t" // execute instruction if 'p' is true
             : "=f"(reg)
             : "l"(ptr), "r"(guard));

We use guard predicates in combination with global load/store instructions to perform global memory access only if it is not out of bounds.

5. SGEMM Design

Transferring data from global to shared memory
Loading data from shared memory and performing arithmetic operations
Writing results back to global memory.

5.1. Transferring data from global to shared memory

Source: Strassen’s Algorithm Reloaded on GPUs

The pseudo code of the algorithm, from the perspective of a thread block, is as follows:

// The shapes of block_a, block_b, block_c are (ms x ks), (ks x ns), (ms x ns)
// Each thread block computes one block of C:
block_c = 0
__shared__ float block_a[block_a_size]
__shared__ float block_b[block_b_size]
for (i=0; i<K/ks; i++) {
    block_a = load ith block of matrix A // from global into shared memory
    block_b = load ith block of matrix B // from global into shared memory
    block_c += block_a * block_b // compute matrix product and update block_c
}
store(block_c) // store to global memory

block_c = 0
// Shared Memory Double buffering
__shared__ float block_a[2][block_a_size] // 2x shared memory usage
__shared__ float block_b[2][block_b_size] // 2x shared memory usage
block_a[0] = load first block of matrix A
block_b[0] = load first block of matrix B

for (i=0; i<(K/ks-1); i++) {
    idx = i%2
    prefetch_idx = (i+1)%2
    // prefetch next blocks
    block_a[prefetch_idx] = load next block of matrix A
    block_a[prefetch_idx] = load next block of matrix B
    // use blocks loaded in previous iteration to calculate matrix product
    block_c += block_a[idx] * block_b[idx]
}
// final update of the accumulator using last blocks
block_c += block_a[prefetch_idx] * block_b[prefetch_idx]

store_to_global_memory(block_c)

Storing to shared memory using this naive scheme would result in shared memory bank conflicts. From the CUDA programming guide:

To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.

Shared memory has 32 banks that are organized such that successive 32-bit words map to successive banks. Imagine a float32 array of size 8x32 stored in row-major order as shown below.

The determination of bank conflicts is made per memory transaction (or using modern CUDA language - per wave), not per request, not per warp, not per instruction.
Two requests to the same bank and the same 32-bit location in that bank do not create a bank conflict (illustrated in the CUDA programming guide).

const int smem_a_ld = 132; // 128 + 4

// Double-buffering (blocks_b is omitted for simplicity)
__shared__ float __align__(2*256*8*sizeof(float)) blocks_a[2*256*8]
uint64_t lds_a_addr;
uint64_t sts_a_addr;
float* lds_a_ptr = blocks_a; // lds = load shared
float* sts_a_ptr = blocks_a; // sts = store shared
lds_a_addr = convert_to_addr(lds_a_ptr); // convert pointer to address for PTX load/store instructions
sts_a_addr = convert_to_addr(sts_a_ptr); // convert pointer to address for PTX load/store instructions

// store first block to first half of shared memory
sts_ptx(sts_a_addr);
// switch address to second half of shared memory
sts_a_addr ^= 8192;

for (int i=0; i<(K/ks-1); i++) {
    ...
    // store next block to second(first) half of shared memory
    sts_ptx(sts_a_addr);
    ...
    // load block from first(second) half of shared memory to compute c+=block_a*block_b
    lds_ptx(lds_a_addr);
    ...
    // swap the addresses for next iteration: lds_a_addr = sts_a_addr, sts_a_addr = lds_a_addr
    lds_a_addr ^= 8192;
    sts_a_addr ^= 8192;
    ...
}
...

Loading and storing a 8 x 128 submatrix $\tilde{B}$ is much simpler to manage due to its shape. Since the sub-matrix must not be transposed, the loading and storing schemes are identical:

5.2. Shared Memory Loads and Arithmetic Operations

load fragment_a (=column of size $m_W \times 1$) from $\tilde{A}$ into registers
load fragment_b (=row of size $1 \times n_W$) from $\tilde{B}$ into registers
multiply the fragments and update $\tilde{C}_W$

Bringing everything together, the entire SGEMM algorithm can be visualized as follows:

// Pseudocode
__shared__ float block_a[2][block_a_size]
__shared__ float block_b[2][block_b_size]
float fragment_a[2][8]
float fragment_b[2][8]
float accumulator[8][8]

block_a[0] = load first block of matrix A
block_b[0] = load first block of matrix B
fragment_a[0] = load first fragment from block_a[0]
fragment_b[0] = load first fragment from block_b[0]

for (block_k=0; block_k<(K/ks-1); block_k++) {
    block_idx = block_k % 2
    block_prefetch_idx = (block_k+1) % 2
    // prefetch next blocks (Shared Memory Double buffering)
    block_a[block_prefetch_idx] = load next block of matrix A
    block_a[block_prefetch_idx] = load next block of matrix B
    for (int warp_k=0; warp_k<8; warp_k++) {
        frag_idx = warp_k % 2
        frag_prefetch_idx = (warp_k + 1) % 2
        // prefetch next fragments (Register Double buffering)
        fragment_a[frag_prefetch_idx] = load next fragment from block_a[block_idx]
        fragment_b[frag_prefetch_idx] = load next fragment from block_b[block_idx]
        // use fragments loaded in previous iteration to calculate matrix product
        for (int i = 0; i < 8; i++) {
            for (int j = 0; j < 8; j++) {
                accumulator[i][j] += fragment_a[frag_idx][i] * fragment_b[frag_idx][j];
            }
        }
    }
    fragment_a[0] = load first fragment from block_a[block_prefetch_idx]
    fragment_b[0] = load first fragment from block_b[block_prefetch_idx]
}

// final update of the accumulator using last blocks
for (int warp_k=0; warp_k<8; warp_k++) {
    frag_idx = warp_k % 2
    frag_prefetch_idx = (warp_k + 1) % 2
    // prefetch next fragments (Register Double buffering)
    fragment_a[frag_prefetch_idx] = load next fragment from block_a[block_prefetch_idx]
    fragment_b[frag_prefetch_idx] = load next fragment from block_b[block_prefetch_idx]
    // use fragments loaded in previous iteration to calculate matrix product
    for (int i = 0; i < 8; i++) {
        for (int j = 0; j < 8; j++) {
            accumulator[i][j] += fragment_a[frag_idx][i] * fragment_b[frag_idx][j];
        }
    }
}

// After completing the matrix multiplication C=A*B, we perform one final update to the accumulator
// to compute  C=alpha*A*B before storing the result back to global memory:
for (int i=0; i<8; i++) {
    for (int j=0; j<8; j++) {
        accumulator[i][j] *= alpha;
    }
}

store_to_global_memory(accumulator)

5.3. Coalesced Global Memory Stores Through Shared Memory

Just as with global memory reads, we want our global memory writes to be coalesced. However, directly storing the accumulators to global memory based on our current mapping

The process is then repeated for the other three 4x4 accumulators of the threads.

6. Performance Analysis

bool is_aligned = (((unsigned)lda & 3u) == 0) && (((unsigned)ldb & 3u) == 0)
                    && (((unsigned long)A & 15u) == 0) && (((unsigned long)B & 15u) == 0);

cudaResourceDesc resDesc;
cudaTextureDesc texDesc;
cudaTextureObject_t tex_a = 0;
cudaTextureObject_t tex_b = 0;
...
if (is_aligned) {
    memset(&texDesc, 0, sizeof(texDesc));
    texDesc.readMode = cudaReadModeElementType;
    texDesc.normalizedCoords = 0;
    memset(&resDesc, 0, sizeof(resDesc));
    resDesc.resType = cudaResourceTypeLinear;
    resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
    resDesc.res.linear.desc.x = 32;
    resDesc.res.linear.desc.y = 32;
    resDesc.res.linear.desc.z = 32;
    resDesc.res.linear.desc.w = 32;
    resDesc.res.linear.devPtr = A;
    resDesc.res.linear.sizeInBytes = m * lda * sizeof(float);
    cudaCreateTextureObject(&tex_a, &resDesc, &texDesc, NULL);
    resDesc.res.linear.devPtr = B;
    resDesc.res.linear.sizeInBytes = k * ldb * sizeof(float);
    cudaCreateTextureObject(&tex_b, &resDesc, &texDesc, NULL);
    sgemm_texld_128x128x8<<<grid, threads>>>(m,
                                             n,
                                             k,
                                             *alpha,
                                             tex_a,
                                             lda,
                                             tex_b,
                                             ldb,
                                             *beta,
                                             C,
                                             ldc);
    cudaDestroyTextureObject(tex_a);
    cudaDestroyTextureObject(tex_b);
}

Within the kernel, we load data through texture objects using the tex1Dfetch function, which compiles to a single PTX instruction:

float4 texld_a_buffer;
texld_a_buffer = tex1Dfetch<float4>(tex_a, texld_a_offset);

Our final implementation combines the 128x128x8 and 128x256x8 kernels. For smaller matrices m=n < 2500, we use the 128x128x8 kernel, otherwise, the 128x256x8 kernel.

Advanced Matrix Multiplication Optimization on Modern Multi-Core Processors

2025-01-12T09:00:01+00:00

TL;DR The code is available at sgemm.c. This blog post walks through optimizing multi-threaded FP32 matrix multiplication on modern processors using FMA3 and AVX2 vector instructions. The implementation delivers strong performance on a variety of x86-64 CPUs, both in single-threaded and multithreaded scenarios. However, to reach peak performance, you’ll need to fine-tune hyperparameters - such as the number of threads, kernel size, and tile sizes. Additionally, on AVX-512 CPUs, the BLAS libraries might be notably faster due to AVX-512 instructions, which were intentionally omitted here to support a broader range of processors. Performance results for Intel Core Ultra 265 and AMD Ryzen 7 9700X are shown below.

P.S. Please feel free to get in touch if you are interested in collaborating. My contact information is available on the homepage.

1. Introduction

Matrix multiplication is an essential part of nearly all modern neural networks. Despite using matmul daily in PyTorch, NumPy, or JAX, I’ve never really thought about how it is designed and implemented internally to take full advantage of hardware capabilities. NumPy, for instance, relies on external BLAS (Basic Linear Algebra Subprograms) libraries. These libraries contain high-performance, optimized implementations of common linear algebra operations, such as the dot product, matrix multiplication, vector addition, and scalar multiplication. Examples of BLAS libraries include:

Intel MKL - optimized for Intel CPUs
Accelerate - optimized for Apple CPUs
BLIS - open-source, multi-vendor, BLAS-like Library Instantiation Software
GotoBLAS - open-source, multi-vendor
OpenBLAS - open-source, multi-vendor, fork of GotoBLAS

A closer look at the OpenBLAS code reveals a mix of C and low-level assembly. In fact, OpenBLAS, GotoBLAS, and BLIS are written in C/FORTRAN/Assembly and contain matmul implementations manually optimized for different CPU microarchitectures. My goal was to implement the matrix multiplication in pure C (without low-level assembly code) so that it works for any matrix size, runs on all modern x86-64 processors, and competes with existing BLAS libraries. At the sime time I wanted to keep the code simple and easy to extend. After some research, I found a few great step-by-step tutorials on implementing fast matrix multiplication from scratch, covering both theory and practice:

Fast Multidimensional Matrix Multiplication on CPU from Scratch by Simon Boehm.
Matrix Multiplication by Sergey Slotin.
Geohot’s stream Can you multiply a matrix?

I highly recommend these clear and well-explained tutorials with alternative implementations. They helped me better understand the topic and, in some sense, motivated me to write my own implementation. The reason is that all three solutions above work only for specific matrix sizes and do not achieve performance of the BLAS libraries. Unsatisfied with these results, I kept researching and came across two fascinating papers: “Anatomy of High-Performance Matrix Multiplication” and “Anatomy of High-Performance Many-Threaded Matrix Multiplication”. The first introduces GotoBLAS, a high-performance BLAS implementation by Kazushige Goto. The second reviews the matmul design used in the BLIS library (an extended version of GotoBLAS) and explores different parallelization strategies. Due to its superior high-level design, I had a feeling that the matmul implementation from the BLIS library can outperform existing BLAS implementations even if written in pure C and not manually finetuned using inline assembly. In the next chapters we’ll step-by-step implement the algorithm from scratch and compare against OpenBLAS. Before diving into optimizations, let’s first go over how to install OpenBLAS and properly benchmark the code on a CPU.

2. How to Install and Benchmark OpenBLAS

I benchmarked the code on the following machine:

CPU: AMD Ryzen 7 9700X
RAM: 32GB DDR5 6000 MHz CL36
OpenBLAS 0.3.26
Compiler: GCC 13.3
Compiler flags: -O3 -march=native -mno-avx512f -fopenmp
OS: Ubuntu 24.04.1 LTS

Important! To obtain reproducible and accurate results, minimize the number of active tasks, particularly when benchmarking multi-threaded code. Windows systems generally deliver lower performance compared to Linux due to higher number of active background tasks.

To benchmark OpenBLAS, start by installing it according to the installation guide. During installation, make sure to set an appropriate TARGET and disable AVX512 instructions for a fair comparison. For Zen4/5 processors compile OpenBLAS with:

make TARGET=ZEN

Otherwise, OpenBLAS defaults to AVX512 instructions. After installation, you can run FP32 matmul using the OpenBLAS API:

#include 
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, 1, A, m, B, k, 0, C, m);

The benchmark evaluates the custom implementation and the OpenBLAS API on square matrices, ranging from m=n=k=200 to m=n=k=10000 in steps of 200. To obtain consistent and accurate results, matrix multiplication is repeated n_iter times, and performance is measured as median execution time.

To multiply two float32 matrices - $A$ of size $M \times K$ and $B$ of size $K \times N$, for each element of the resulting matrix $C$ of size $M \times N$, we need to compute the dot product between a row of $A$ and a column of $B$. This requires $K$ (additions) + $K$ (multiplications) = $2K$ Floating Point Operations (FLOP) per element of $C$ or $2MNK$ FLOP in total. A metric often used to evaluate matmul performance is called FLOP per second or FLOP/s or FLOPS, and it can be derived from the execution time as FLOPS=FLOP/exec_time=(2*m*n*k)/exec_time.

3. Theoretical Limit

The image below shows a simplified model of the computer’s memory hierarchy (for now, ignore the layers between the registers and the main memory(=RAM); we will discuss them later).

To perform arithmetic operations on data stored in RAM (off-chip memory, slow and large capacity), the data must be first transferred to CPU and placed in CPU registers (on-chip memory, fast and small capacity). Modern x86-64 CPUs support SIMD (Single Instruction Multiple Data) extensions, which allow multiple pieces of data to be processed in parallel. There are various SIMD extensions, but the ones relevant to our discussion are Advanced Vector Extensions (AVX2) and Fused Multiply-Add (FMA). Both AVX2 and FMA operate on data stored in special 256-bit YMM registers. Each YMM register can hold 8 packed single-precision (32-bit) floats. The FMA2 instructions perform element-wise multiply-add operation on data stored in the YMM registers. The corresponding assembly instruction is called VFMADD231PS (PS stands for PackedSingle) and takes three vector registers (YMM1, YMM2, YMM3) as input to compute YMM1 = YMM2 * YMM3 + YMM1.

According to the intel intrinsics guide or https://uops.info/table.html, for my CPU the throughput (TP) of the fused-multiply-add instruction is 0.5 cycles/instruction or with other words 2 instructions/cycle:

Theoretically, Ryzen 9700X can perform 32 FLOP per cycle: 8 (floats in YMM register) * 2 (add + mul) * 2 (1/TP). Therefore, the theoretical peak FLOPS in single-threaded mode can be roughly estimated as CPU_CLOCK_SPEED * 32 or n_cores * CPU_CLOCK_SPEED * 32 in multi-threaded mode. For example, assuming a sustainable clock speed of 4.7 GHz for an 8-core 9700X processor, the theoretical peak FLOPS in a multi-threaded setting would be 1203 FLOPS.

4. Naive Implementation

In this tutorial we assume that matrices are stored in column-major order: e.g. matrix A of shape MxN is stored as contiguous array of length M*N and an element A[row][col] is accessed via C raw pointer ptr[col*M + row], where 0 <= col <= N-1 and 0 <= row <= M-1.

The simplest implementation of $C=AB$ can be described as follows:

void matmul_naive(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
      for (int p = 0; p < K; p++) {
        C[j * M + i] += A[p * M + i] * B[j * K + p];
      }
    }
  }
}

Here, we iterate over all rows (the outermost loop) and all columns (the second loop) of C and for each element of C we calculate the dot product (the innermost loop) between the corresponding row of matrix A and column of matrix B. It’s always good to start with a simple and robust algorithm that can later be used to test optimized implementations.

5. Kernel

The key idea of high-performance matrix multiplication on CPU is to develop a function that efficiently computes a sub-matrix of $C$. Then, by iterating over $C$ and applying this function to all non-overlapping sub-matrices, we can significantly speed up the entire matrix multiplication operation. For this, we, first, partition the matrix $C$ of shape $M \times N$ into smaller non-overlapping sub-matrices of shape $m_R \times n_R$, with $n_R \ll N$ and $m_R \ll M$. To calculate $C=AB$, we iterate over $C$ and compute each of its non-overlapping $m_R \times n_R$ sub-matrices as shown below:

The function that computes an $m_R \times n_R$ sub-matrix $\bar{C}$ of $C$ is called a kernel (aka. micro-kernel using BLIS notation). This function is the core of high-performance matrix multiplication. When we say a matrix multiplication algorithm is optimized for a specific CPU architecture, it usually refers to kernel optimization. For example, OpenBLAS contains kernels optimized for different CPU microarchitectures.

Let’s take a closer look at the kernel. To compute an $m_R \times n_R$ sub-matrix $\bar{C}$ of $C$, we need to multiply corresponding $m_R \times K$ sub-matrix $\bar{A}$ of $A$ with $K \times n_R$ sub-matrix $\bar{B}$ of $B$ as shown in the figure below:

If we were to do this in a naive manner using the dot product, we would need to fetch $2K$ elements from RAM to calculate a single element of $\bar{C}$ or $2K m_R n_R$ elements in total to compute $\bar{C}$. There is, however, an alternative strategy that can reduce the number of fetched elements.

First, we initialize the matrix $\bar{C}$ with zeros and store it in registers. Since both $n_R$ and $m_R$ are small, the entire matrix fits within the registers. Here, the subscript $R$ in $n_R$ and $m_R$ denotes “register”. Next, we iterate over the dimension $K$, and in each iteration, we:

load 1 column of $\bar{A}$ and 1 row of $\bar{B}$ from RAM into the registers. Again, note that both the row and column vectors are limited in size and can be stored in the registers.
compute the outer product between the two vectors and add the result of the outer product to the matrix $\bar{C}$.

After $K$ iterations, the computation of the matrix $\bar{C}$ is completed and it can be stored into RAM. $\bar{C}$ is often referred to as the accumulator, because it accumulates the outer products along the dimension $K$. A single accumulation step of the outer product between two vectors is also known as rank-1 update.

Outer product between a column vector and a row vector.

In total, we fetch $(m_R + n_R)K$ elements from RAM into registers. Compared to the naive approach, this reduces the number of memory accesses by a factor of

\[\frac{2m_Rn_RK}{(m_R + n_R)K} = \frac{2m_Rn_R}{m_R + n_R}\]

This factor is maximized when both $m_R$, $n_R$ are large and equal. However, the values of $m_R$ and $n_R$ are typically constrained by the available register memory.

Now, let’s discuss in detail how the outer product and accumulation can be efficiently implemented using SIMD FMA instructions. Unfortunately, there are no SIMD instructions that compute the outer product in a single step. Therefore, we need to decompose the outer product into simpler operations. The figure below illustrates the process:

Here, we compute the outer product between a column vector of size $m_R$ and a row vector of size $n_R$ to update an accumulator $\bar{C}$ of size $m_R \times n_R$. The accumulator is stored in the YMM registers, with each column of the accumulator spanning one or multiple YMM registers. The column vector is also stored in the YMM registers (highlighted as yellow). Since each YMM register holds 8 floats, the dimension $m_R$ must be divisible by 8. The accumulator is updated column by column. During the first iteration we broadcast the first element of the row vector to a vector of size $m_R$ and place it in the YMM registers (highlighted as green). Then, we element-wise multiply the column vector with the broadcasted vector and accumulate the result to the first column of the accumulator $\bar{C}$ using FMA instruction. We repeat this process for the remaining elements of the row vector to update the corresponding columns of the accumulator. After $n_R$ iterations, the rank-1 update of the accumulator is completed.

The last thing we need to discuss before implementing the kernel in C is how to choose the kernel size i.e. $m_R$ and $n_R$. CPUs with AVX support have 16 YMM registers. From our previous discussion, we know that we need $(m_R/8) \cdot n_R$ registers to store the accumulator $\bar{C}$, $m_R/8$ registers to store the column vector and 1 register (because we can reuse the same register for all FMA operations) for the broadcasted vector. We want $m_R$ and $n_R$ to be as large as possible while satisfying the following conditions:

$\Big(\cfrac{m_R}{8} \cdot n_R + \cfrac{m_R}{8} + 1\Big) <= 16$
$m_R$ is a multiple of 8

In theory we want $m_R = n_R$ to minimize the number of fetched elements. However, in practice, a non-square kernel with $m_R = 16, n_R = 6$ showed the best performance on my CPU. Therefore, we will implement this kernel in the next section. Feel free to experiment with other kernel sizes, such as $8 \times 8, 8 \times 12$, $8 \times 13$, $8 \times 14$, $32 \times 2$ and compare their performance on your CPU.

Let’s implement the algorithm discussed above using the $16 \times 6$ kernel. The code of this implementation can be found at matmul_kernel.c. To use SIMD instructions in C we first need to include the immintin.h library:

#include 

The implementation of the algorithm is straightforward: we iterate over matrix $C$ and apply the kernel function to each of it’s non-overlapped $16 \times 6$ sub-matrices $\bar{C}$.

void matmul_kernel(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int i = 0; i < M; i += 16) {
    for (int j = 0; j < N; j += 6) {
        kernel_16x6(&A[i], &B[j * K], &C[j * M + i], M, N, K);
    }
  }
}

The kernel function is declared as follows:

void kernel_16x6(float* A_start, float* B_start, float* C_start, int M, int N, int K);

The function takes as input pointers to the starting positions of $\bar{A}, \bar{B}$, and $\bar{C}$ along with the matrix problem size. It then computes $16 \times 6$ sub-matrix $\bar{C}$ of $C$ according to $\bar{C} = \bar{A} \bar{B}$.

Inside the kernel function, first, we declare the variables stored in the YMM registers:

__m256 C_accum[6][2] = {}; // zero-initialized
__m256 b_packFloat8;
__m256 a0_packFloat8;
__m256 a1_packFloat8;

A variable of type __m256 is a 256-bit vector that represents the contents of a YMM register, which holds eight 32-bit floating-point values. C_accum is the accumulator stored in the YMM registers. The variable b_packFloat8 contains a broadcasted element from a row vector of $\bar{B}$, while a0_packFloat8 and a1_packFloat8 represent a column vector of $\bar{A}$. Since the column vector contains 16 floats, it requires two YMM registers for storage.

SIMD intrinsics are well documented and can be found in the Intel Intrinsics Guide. For example, _mm256_loadu_ps

The kernel iterates over the dimension $K$ and in each iteration performs a rank-1 update of the accumulator:

for (int p = 0; p < K; p++) {
  // Load column vector of size 16
  // {
  a0_packFloat8 = _mm256_loadu_ps(&A_start[p * M]);
  a1_packFloat8 = _mm256_loadu_ps(&A_start[p * M + 8]);
  // }
  // Broadcast scalar element to vector of size 8
  // {
  b_packFloat8 = _mm256_broadcast_ss(&B_start[p]);
  // }
  // Update the first column of the accumulator
  // {
  C_accum[0][0] = _mm256_fmadd_ps(a0_packFloat8, b_packFloat8, C_accum[0][0]);
  C_accum[0][1] = _mm256_fmadd_ps(a1_packFloat8, b_packFloat8, C_accum[0][1]);
  // }
  ...
  ...
  ...
  b_packFloat8 = _mm256_broadcast_ss(&B_start[5 * K + p]);
  // update the last column of the accumulator
  // {
  C_accum[5][0] = _mm256_fmadd_ps(a0_packFloat8, b_packFloat8, C_accum[5][0]);
  C_accum[5][1] = _mm256_fmadd_ps(a1_packFloat8, b_packFloat8, C_accum[5][1]);
  // }
}

After $K$ rank-1 updates, the computation of the accumulator is complete, and the result can be stored in RAM:

// Store the accumulator column by column:
for (int j = 0; j < 6; j++) {
  _mm256_storeu_ps(&C_start[j * M], C_accum[j][0]);
  _mm256_storeu_ps(&C_start[j * M + 8], C_accum[j][1]);
}

Let’s take a look at the generated assembly code to see if it actually contains SIMD FMA instructions and uses the YMM registers:

gcc -O3 -mno-avx512f -march=native matmul_kernel.c -S

// matmul_kernel.s
...
vfmadd231ps	%ymm14, %ymm1, %ymm13
vfmadd231ps	%ymm14, %ymm0, %ymm12
vmovaps	%ymm13, 32(%rsp)
vmovaps	%ymm12, 64(%rsp)
vbroadcastss	(%rax,%r9), %ymm14
vfmadd231ps	%ymm14, %ymm1, %ymm10
vfmadd231ps	%ymm14, %ymm0, %ymm11
vmovaps	%ymm10, 96(%rsp)
vmovaps	%ymm11, 128(%rsp)
vbroadcastss	(%rax,%r9,2), %ymm14
addq	$4, %rax
vfmadd231ps	%ymm14, %ymm1, %ymm2
vfmadd231ps	%ymm14, %ymm0, %ymm3
...

6. Padding

You may have noticed that the current implementation only works for matrix sizes where $M$ and $N$ are multiples of $m_R$ and $n_R$, respectively. Specifically, the kernel assumes that matrix $\bar{C}$ has dimensions $m_R \times n_R$, matrix $\bar{A}$ is $m_R \times K$ and matrix $\bar{B}$ is $K \times n_R$. Our goal is to generalize the kernel so that it can handle matrices $\bar{C}, \bar{A}, \bar{B}$ with dimensions $m \times n, m \times K, K \times n$, even when $m \neq m_R$ and $n \neq n_R$, as shown below:

First, when storing the accumulator, we need to ensure that elements are only stored within the matrix boundaries. If the number of overlapping columns, $n$, is smaller than $n_R$, the process is straightforward - we simply iterate over $n$ columns instead of $n_R$:

// n - number of overlapped columns within C boundary

// "j < n" instead "j < 6", since n can be less than 6.
for (int j = 0; j < n; j++) {
  _mm256_storeu_ps(&C_start[j * M], C_accum[j][0]);
  _mm256_storeu_ps(&C_start[j * M + 8], C_accum[j][1]);
}

The case where the number of overlapped rows $m$ differs from $m_R$ is a bit trickier because _mm256_storeu_ps stores 8 elements at once. Fortunately, immintrin.h library contains _mm256_maskstore_ps function, which stores packed floats according to mask values. The function takes three arguments as input:

float *a
__m256i mask
__m256 b

__m256i is a vector datatype that holds eight 32-bit integers. Each integer in mask corresponds to a data element in b. The most significant bit (MSB) of each integer in mask represents the mask bit. If the mask bit is zero, the corresponding value in b is not stored in the memory location pointed to by a. For example, the MSB of unsigned integer 2147483648 (binary format 10000000 00000000 00000000 00000000) is 1, so the corresponding data element in b will be stored. On the other hand, the MSB of unsigned integer 2147483647 (binary format 01111111 11111111 11111111 11111111) is 0, meaning the corresponding data element in b will not be stored.

If $m \neq m_R$ , we generate integer masks by left-shifting unsigned integer 65535 (=00000000 00000000 11111111 111111111 in binary format) depending on the number of overlapped rows $m$. In the code snippet below the function _mm256_setr_epi32() creates a __m256i vector from eight 32-bit integers.

__m256i masks[2];
if (m != 16) {
  const uint32_t bit_mask = 65535;
  masks[0] = _mm256_setr_epi32(bit_mask << (m + 15),
                               bit_mask << (m + 14),
                               bit_mask << (m + 13),
                               bit_mask << (m + 12),
                               bit_mask << (m + 11),
                               bit_mask << (m + 10),
                               bit_mask << (m + 9),
                               bit_mask << (m + 8));
  masks[1] = _mm256_setr_epi32(bit_mask << (m + 7),
                               bit_mask << (m + 6),
                               bit_mask << (m + 5),
                               bit_mask << (m + 4),
                               bit_mask << (m + 3),
                               bit_mask << (m + 2),
                               bit_mask << (m + 1),
                               bit_mask << m);
  for (int j = 0; j < n; j++) {
    _mm256_maskstore_ps(&C_start[j * M], masks[0], C_accum[j][0]);
    _mm256_maskstore_ps(&C_start[j * M + 8], masks[1], C_accum[j][1]);
  }
}

The compiler auto-vectorizes the sequential bit-shifting operations using a combination of vpaddd and vpsllvd instructions, making the mask computation very efficient. There is, however, an alternative method to compute the masks, as will be shown later.

When loading elements from matrices $\bar{A}$ and $\bar{B}$ inside the kernel, we need to check that the loads are within the matrix boundaries. One way to do this is by using _mm256_maskload_ps when loading elements from the matrix $\bar{A}$ and looping over $n$ elements instead of $n_R$ when loading elements from the matrix $\bar{B}$. However, this method would significantly degrade the kernel’s performance. The additional instructions required to compute the loading masks introduce overhead, and since $n$ is not a compile-time constant, the compiler cannot unroll the loop efficiently. Instead, if $m \neq m_R$, we copy the matrix $\bar{A}$ into a buffer, pad it with zeros and pass the padded matrix of size $m_R \times K$ to the kernel. We do the same for the matrix $\bar{B}$ if $n \neq n_R$. The implementation straightforwardly follows the description:

#define BLOCK_A_MAXSIZE 500000
#define BLOCK_B_MAXSIZE 200000

static float blockA_buffer[BLOCK_A_MAXSIZE] __attribute__((aligned(64)));
static float blockB_buffer[BLOCK_B_MAXSIZE] __attribute__((aligned(64)));

void matmul_pack(float* A, float* B, float* C, const int M, const int N, const int K) {
    for (int i = 0; i < M; i += 16) {
        const int m = min(16, M - i);
        float* blockA = &A[i];
        int blockA_ld = M;
        if (m != 16) {
            pack_blockA(&A[i], blockA_buffer, m, M, K);
            blockA = blockA_buffer;
            blockA_ld = 16;
        }
        for (int j = 0; j < N; j += 6) {
            const int n = min(6, N - j);
            float* blockB = &B[j * K];
            if (n != 6) {
                pack_blockB(&B[j * K], blockB_buffer, n, N, K);
                blockB = blockB_buffer;
            }
            kernel_16x6(blockA, blockB, &C[j * M + i], m, n, M, K, blockA_ld);
        }
    }
}

For further implementations details, please check matmul_pad.h

7. Cache Blocking

Let’s revisit the computer’s memory hierarchy. Previously, we focused on the main memory (DRAM) and the CPU registers, but we skipped an important intermediary: the CPU cache system.

Unlike DRAM, the CPU cache is an on-chip memory designed to store frequently and/or recently accessed data from the main memory. This helps minimize data transfers between the main memory and CPU registers. Although the cache is much faster than DRAM, it has a limited storage capacity. To optimize data access, modern desktop CPUs use a multi-level cache hierarchy. This typically includes L1, L2, and L3 caches, each offering progressively larger storage but with increasing access times. L1 cache is the fastest and closest to the CPU core.

Intel Core i9-13900K labelled die shot. Source: How are Microchips Made?

To improve access speed, CPUs transfer data between main memory and cache in fixed-size chunks called cache lines or cache blocks. When a cache line is loaded from main memory, it is stored as a cache entry. For example, in AMD Ryzen Zen CPUs, the cache line size is 64 bytes. The cache takes advantage of data locality - how programs typically access memory. When a single floating-point number is requested from a continuous array in memory, the cache doesn’t just fetch that one value; it also preloads the next floating-point numbers and stores them in the cache. This is why reading data sequentially from an array is much more efficient than randomly accessing scattered memory locations. When the CPU needs to read or write to a memory location, it first checks if the data is already in the cache. This leads to two possible scenarios:

Cache Hit - If the requested memory location is found in the cache, the CPU can access it instantly, avoiding the need to fetch data from the much slower DRAM.
Cache Miss - If the requested data is not in the cache, the CPU retrieves it from the main memory and stores it in the cache for future access.

Since the cache has limited space, it must decide which data to replace when new information needs to be stored. This decision is governed by a cache replacement policy. Some of the most common policies include:

LRU (Least Recently Used): Replaces the cache entry that has gone unused the longest.
LFU (Least Frequently Used): Evicts the entry that has been accessed the least often.
LFRU (Least Frequently Recently Used): A hybrid approach that considers both recent and overall access frequency.

Similar to registers, once data is loaded into the cache, we want to reuse the data as much as possible to reduce main memory accesses. Given the cache’s limited capacity, storing entire input matrices $C, B, A$ in the cache isn’t feasible. Instead, we divide them into smaller blocks, load these blocks into the cache, and reuse them for rank-1 updates. This technique is often referred to as tiling or cache blocking, allowing us to handle matrices of arbitrary size effectively.

The single-threaded matrix multiplication with cache blocking can be visualized as shown in the image borrowed from the official BLIS repository:

Let’s step through the diagram and discuss it. In the outer-most loop (5th loop) we iterate over dimension $N$, dividing matrix $C$ into blocks $C_j$ of size $M \times n_c$ and matrix $B$ into blocks $B_j$ of size $K \times n_c$. The subscript $c$ in $n_c$ stands for cache. In the 4th loop we iterate over dimension $K$ and divide matrix $A$ into $A_j$ of size $M \times k_c$ and $B_j$ into $B_p$ of size $k_c \times n_c$. Notice $B_p$ has fixed, limited size and can now be loaded into the cache. $B_p$ is packed into $\tilde{B}_p$, padded with zeros, if necessary, and loaded into the L3 cache. I In the 3rd loop we iterate over dimension $M$ and divide $C_j$ into $C_i$ (there is a typo in the diagram) of size $m_c \times n_c$ and $A_p$ into $A_j$ of size $m_c \times k_c$. Matrix $A_j$ is now restricted in size and can be loaded entirely into the L2 cache. $A_j$ is packed into $\tilde{A}_j$ and padded with zeros if needed. Note how we reuse the same $\tilde{B}_p$ block from the L3 cache for different $A_j$ blocks. Both $m_c$ and $n_c$ are chosen to be a multiple of $m_R$ and $n_R$ respectively.

In the last two loops we simply iterate over cached blocks and divide them into $m_R \times k_c$ and $k_c \times n_R$ panels. These panels are then passed to the kernel to perform rank-1 updates on the $m_R \times n_R$ sub-matrix of $C$, similarly to what we have already done in the previous chapter. Each panel of $\tilde{B}_p$ is loaded into the L1 cache and reused for multiple panels of $\tilde{A}_j$. Keep in mind that $\tilde{A}_j$ and $\tilde{B}_p$ are packed differently. During rank-1 updates we sequentially read a panel of $\tilde{A}_j$ column by column and a panel of $\tilde{B}_p$ row by row. Thus, each panel inside $\tilde{A}_j$ is stored in column-major order, while each panel inside $\tilde{B}_p$ is stored in row-major order.

Different CPU models have different cache sizes. To achieve peak performance, it’s crucial to optimize three key parameters: cache sizes for L1, L2, and L3 cashes (represented by $k_c$, $m_c$, and $n_c$ respectively). Theoretically, these parameters should be chosen so that:

$k_c \times n_c$ fills the entire L3 cache.
$m_c \times k_c$ fills the entire L2 cache.
$k_c \times n_R$ fills the entire L1 cache.

While these values provide a good starting point, using larger values often leads to better performance in practice. Unfortunately (or fortunately), we cannot manually place data into the cache or control which cache levels store the data; the CPU manages this automatically using cache replacement policies. Therefore, cache blocking and cache reuse must be implemented at the algorithm level through, for example, well-designed loops and strategic data access patterns.

The implementation matmul_cache.h straightforwardly follows the algorithm depicted in the diagram:

void matmul_cache(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int j = 0; j < N; j += NC) {
    const int nc = min(NC, N - j);
    for (int p = 0; p < K; p += KC) {
      const int kc = min(KC, K - p);
      pack_blockB(&B[j * K + p], blockB_packed, nc, kc, K);
      for (int i = 0; i < M; i += MC) {
        const int mc = min(MC, M - i);
        pack_blockA(&A[p * M + i], blockA_packed, mc, kc, M);
        for (int jr = 0; jr < nc; jr += NR) {
          for (int ir = 0; ir < mc; ir += MR) {
            const int mr = min(MR, mc - ir);
            const int nr = min(NR, nc - jr);
            kernel_16x6(&blockA_packed[ir * kc], &blockB_packed[jr * kc], &C[(j + jr) * M + (i + ir)], mr, nr, kc, M);
          }
        }
      }
    }
  }
}

8. Kernel Micro-Optimizations

Instead of using arrays of __m256 to define the accumulator $\bar{C}$ and the masks

__m256 C_buffer[6][2];
__m256i masks[2];

we explicitly unroll them

    __m256 C00 = _mm256_setzero_ps();
    __m256 C10 = _mm256_setzero_ps();
    __m256 C01 = _mm256_setzero_ps();
    __m256 C11 = _mm256_setzero_ps();
    __m256 C02 = _mm256_setzero_ps();
    __m256 C12 = _mm256_setzero_ps();
    __m256 C03 = _mm256_setzero_ps();
    __m256 C13 = _mm256_setzero_ps();
    __m256 C04 = _mm256_setzero_ps();
    __m256 C14 = _mm256_setzero_ps();
    __m256 C05 = _mm256_setzero_ps();
    __m256 C15 = _mm256_setzero_ps();
    __m256i packed_mask0;
    __m256i packed_mask1;

By doing this, GCC can better optimize the code avoiding register spilling. Additionally, we use vector instructions to calculate the masks as follows:

static int8_t mask[32]
    __attribute__((aligned(64))) = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
                                    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0};
packed_mask0 = _mm256_cvtepi8_epi32(_mm_loadu_si64(&mask[16 - mr]));
packed_mask1 = _mm256_cvtepi8_epi32(_mm_loadu_si64(&mask[16 - mr + 8]));

The corresponding implementation can be found at matmul_micro.h

9. Multithreading

There are indeed many loops that can be potentially parallelized. To achieve high-performance, we want to parallelize both packing and arithmetic operations. Let’s start with the arithmetic operations. The 5th, 4th, 3rd loops around the micro-kernel iterate over matrix dimensions in chunks of cache block sizes $n_c$, $k_c$, $m_c$. To efficiently parallelize the loops and keep all threads busy, we want number of iterations (=matrix dimension / cache block size) to be at least = number of threads (generally, the more the better). In other words, the input matrix dimension should be at least = number of threads * cache block size. As we discussed earlier, we also want cache blocks to fully occupy the corresponding cache levels. On modern CPUs, the second requirement results in cache block sizes of thousand(s) of elements. For example, on my Ryzen 9700X, cache block sizes of $n_c=1535$, $m_c=1024$ attain the best performance in the single-threaded scenario. Given the number of available cores on Ryzen 9700X, we need input matrices with dimensions of at least $\max(m_c, n_c) \times \text{number of cores} = 1535 \times 8 = 12280$ to be able to distribute the work over all cores.

In contrast, the last two loops iterate over cache blocks, dividing them into $m_R, n_R$ blocks. Since $n_R, m_R$ are typically very small (<20), these loops are ideal candidates for parallelization. Moreover, we can choose $m_c, n_c$ to be multiples of number of cores so that the work is evenly distributed across all cores.

On my machine, parallelizing the second and first inner loops jointly with collapse(2) results in the best performance:

#pragma omp parallel for collapse(2) num_threads(NTHREADS)
  for (int jr = 0; jr < nc; jr += NR)

Advanced Matrix Multiplication Optimization on Modern Multi-Core Processors

2024-08-01T09:00:01+00:00

P.S. Please feel free to get in touch if you are interested in collaborating. My contact information is available on the homepage.

1. Introduction

Intel MKL - optimized for Intel CPUs
Accelerate - optimized for Apple CPUs
BLIS - open-source, multi-vendor, BLAS-like Library Instantiation Software
GotoBLAS - open-source, multi-vendor
OpenBLAS - open-source, multi-vendor, fork of GotoBLAS

Fast Multidimensional Matrix Multiplication on CPU from Scratch by Simon Boehm.
Matrix Multiplication by Sergey Slotin.
Geohot’s stream Can you multiply a matrix?

2. How to Install and Benchmark OpenBLAS

I benchmarked the code on the following machine:

CPU: AMD Ryzen 7 9700X
RAM: 32GB DDR5 6000 MHz CL36
OpenBLAS 0.3.26
Compiler: GCC 13.3
Compiler flags: -O3 -march=native -mno-avx512f -fopenmp
OS: Ubuntu 24.04.1 LTS

make TARGET=ZEN

Otherwise, OpenBLAS defaults to AVX512 instructions. After installation, you can run FP32 matmul using the OpenBLAS API:

#include 
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, 1, A, m, B, k, 0, C, m);

3. Theoretical Limit

The image below shows a simplified model of the computer’s memory hierarchy (for now, ignore the layers between the registers and the main memory(=RAM); we will discuss them later).

4. Naive Implementation

The simplest implementation of $C=AB$ can be described as follows:

void matmul_naive(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
      for (int p = 0; p < K; p++) {
        C[j * M + i] += A[p * M + i] * B[j * K + p];
      }
    }
  }
}

5. Kernel

load 1 column of $\bar{A}$ and 1 row of $\bar{B}$ from RAM into the registers. Again, note that both the row and column vectors are limited in size and can be stored in the registers.
compute the outer product between the two vectors and add the result of the outer product to the matrix $\bar{C}$.

Outer product between a column vector and a row vector.

In total, we fetch $(m_R + n_R)K$ elements from RAM into registers. Compared to the naive approach, this reduces the number of memory accesses by a factor of

\[\frac{2m_Rn_RK}{(m_R + n_R)K} = \frac{2m_Rn_R}{m_R + n_R}\]

This factor is maximized when both $m_R$, $n_R$ are large and equal. However, the values of $m_R$ and $n_R$ are typically constrained by the available register memory.

$\Big(\cfrac{m_R}{8} \cdot n_R + \cfrac{m_R}{8} + 1\Big) <= 16$
$m_R$ is a multiple of 8

#include 

The implementation of the algorithm is straightforward: we iterate over matrix $C$ and apply the kernel function to each of it’s non-overlapped $16 \times 6$ sub-matrices $\bar{C}$.

void matmul_kernel(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int i = 0; i < M; i += 16) {
    for (int j = 0; j < N; j += 6) {
        kernel_16x6(&A[i], &B[j * K], &C[j * M + i], M, N, K);
    }
  }
}

The kernel function is declared as follows:

void kernel_16x6(float* A_start, float* B_start, float* C_start, int M, int N, int K);

Inside the kernel function, first, we declare the variables stored in the YMM registers:

__m256 C_accum[6][2] = {}; // zero-initialized
__m256 b_packFloat8;
__m256 a0_packFloat8;
__m256 a1_packFloat8;

SIMD intrinsics are well documented and can be found in the Intel Intrinsics Guide. For example, _mm256_loadu_ps

The kernel iterates over the dimension $K$ and in each iteration performs a rank-1 update of the accumulator:

for (int p = 0; p < K; p++) {
  // Load column vector of size 16
  // {
  a0_packFloat8 = _mm256_loadu_ps(&A_start[p * M]);
  a1_packFloat8 = _mm256_loadu_ps(&A_start[p * M + 8]);
  // }
  // Broadcast scalar element to vector of size 8
  // {
  b_packFloat8 = _mm256_broadcast_ss(&B_start[p]);
  // }
  // Update the first column of the accumulator
  // {
  C_accum[0][0] = _mm256_fmadd_ps(a0_packFloat8, b_packFloat8, C_accum[0][0]);
  C_accum[0][1] = _mm256_fmadd_ps(a1_packFloat8, b_packFloat8, C_accum[0][1]);
  // }
  ...
  ...
  ...
  b_packFloat8 = _mm256_broadcast_ss(&B_start[5 * K + p]);
  // update the last column of the accumulator
  // {
  C_accum[5][0] = _mm256_fmadd_ps(a0_packFloat8, b_packFloat8, C_accum[5][0]);
  C_accum[5][1] = _mm256_fmadd_ps(a1_packFloat8, b_packFloat8, C_accum[5][1]);
  // }
}

After $K$ rank-1 updates, the computation of the accumulator is complete, and the result can be stored in RAM:

// Store the accumulator column by column:
for (int j = 0; j < 6; j++) {
  _mm256_storeu_ps(&C_start[j * M], C_accum[j][0]);
  _mm256_storeu_ps(&C_start[j * M + 8], C_accum[j][1]);
}

Let’s take a look at the generated assembly code to see if it actually contains SIMD FMA instructions and uses the YMM registers:

gcc -O3 -mno-avx512f -march=native matmul_kernel.c -S

// matmul_kernel.s
...
vfmadd231ps	%ymm14, %ymm1, %ymm13
vfmadd231ps	%ymm14, %ymm0, %ymm12
vmovaps	%ymm13, 32(%rsp)
vmovaps	%ymm12, 64(%rsp)
vbroadcastss	(%rax,%r9), %ymm14
vfmadd231ps	%ymm14, %ymm1, %ymm10
vfmadd231ps	%ymm14, %ymm0, %ymm11
vmovaps	%ymm10, 96(%rsp)
vmovaps	%ymm11, 128(%rsp)
vbroadcastss	(%rax,%r9,2), %ymm14
addq	$4, %rax
vfmadd231ps	%ymm14, %ymm1, %ymm2
vfmadd231ps	%ymm14, %ymm0, %ymm3
...

6. Padding

// n - number of overlapped columns within C boundary

// "j < n" instead "j < 6", since n can be less than 6.
for (int j = 0; j < n; j++) {
  _mm256_storeu_ps(&C_start[j * M], C_accum[j][0]);
  _mm256_storeu_ps(&C_start[j * M + 8], C_accum[j][1]);
}

float *a
__m256i mask
__m256 b

__m256i masks[2];
if (m != 16) {
  const uint32_t bit_mask = 65535;
  masks[0] = _mm256_setr_epi32(bit_mask << (m + 15),
                               bit_mask << (m + 14),
                               bit_mask << (m + 13),
                               bit_mask << (m + 12),
                               bit_mask << (m + 11),
                               bit_mask << (m + 10),
                               bit_mask << (m + 9),
                               bit_mask << (m + 8));
  masks[1] = _mm256_setr_epi32(bit_mask << (m + 7),
                               bit_mask << (m + 6),
                               bit_mask << (m + 5),
                               bit_mask << (m + 4),
                               bit_mask << (m + 3),
                               bit_mask << (m + 2),
                               bit_mask << (m + 1),
                               bit_mask << m);
  for (int j = 0; j < n; j++) {
    _mm256_maskstore_ps(&C_start[j * M], masks[0], C_accum[j][0]);
    _mm256_maskstore_ps(&C_start[j * M + 8], masks[1], C_accum[j][1]);
  }
}

#define BLOCK_A_MAXSIZE 500000
#define BLOCK_B_MAXSIZE 200000

static float blockA_buffer[BLOCK_A_MAXSIZE] __attribute__((aligned(64)));
static float blockB_buffer[BLOCK_B_MAXSIZE] __attribute__((aligned(64)));

void matmul_pack(float* A, float* B, float* C, const int M, const int N, const int K) {
    for (int i = 0; i < M; i += 16) {
        const int m = min(16, M - i);
        float* blockA = &A[i];
        int blockA_ld = M;
        if (m != 16) {
            pack_blockA(&A[i], blockA_buffer, m, M, K);
            blockA = blockA_buffer;
            blockA_ld = 16;
        }
        for (int j = 0; j < N; j += 6) {
            const int n = min(6, N - j);
            float* blockB = &B[j * K];
            if (n != 6) {
                pack_blockB(&B[j * K], blockB_buffer, n, N, K);
                blockB = blockB_buffer;
            }
            kernel_16x6(blockA, blockB, &C[j * M + i], m, n, M, K, blockA_ld);
        }
    }
}

For further implementations details, please check matmul_pad.h

7. Cache Blocking

Let’s revisit the computer’s memory hierarchy. Previously, we focused on the main memory (DRAM) and the CPU registers, but we skipped an important intermediary: the CPU cache system.

Intel Core i9-13900K labelled die shot. Source: How are Microchips Made?

Cache Hit - If the requested memory location is found in the cache, the CPU can access it instantly, avoiding the need to fetch data from the much slower DRAM.
Cache Miss - If the requested data is not in the cache, the CPU retrieves it from the main memory and stores it in the cache for future access.

LRU (Least Recently Used): Replaces the cache entry that has gone unused the longest.
LFU (Least Frequently Used): Evicts the entry that has been accessed the least often.
LFRU (Least Frequently Recently Used): A hybrid approach that considers both recent and overall access frequency.

The single-threaded matrix multiplication with cache blocking can be visualized as shown in the image borrowed from the official BLIS repository:

$k_c \times n_c$ fills the entire L3 cache.
$m_c \times k_c$ fills the entire L2 cache.
$k_c \times n_R$ fills the entire L1 cache.

The implementation matmul_cache.h straightforwardly follows the algorithm depicted in the diagram:

void matmul_cache(float* A, float* B, float* C, const int M, const int N, const int K) {
  for (int j = 0; j < N; j += NC) {
    const int nc = min(NC, N - j);
    for (int p = 0; p < K; p += KC) {
      const int kc = min(KC, K - p);
      pack_blockB(&B[j * K + p], blockB_packed, nc, kc, K);
      for (int i = 0; i < M; i += MC) {
        const int mc = min(MC, M - i);
        pack_blockA(&A[p * M + i], blockA_packed, mc, kc, M);
        for (int jr = 0; jr < nc; jr += NR) {
          for (int ir = 0; ir < mc; ir += MR) {
            const int mr = min(MR, mc - ir);
            const int nr = min(NR, nc - jr);
            kernel_16x6(&blockA_packed[ir * kc], &blockB_packed[jr * kc], &C[(j + jr) * M + (i + ir)], mr, nr, kc, M);
          }
        }
      }
    }
  }
}

8. Kernel Micro-Optimizations

Instead of using arrays of __m256 to define the accumulator $\bar{C}$ and the masks

__m256 C_buffer[6][2];
__m256i masks[2];

we explicitly unroll them

    __m256 C00 = _mm256_setzero_ps();
    __m256 C10 = _mm256_setzero_ps();
    __m256 C01 = _mm256_setzero_ps();
    __m256 C11 = _mm256_setzero_ps();
    __m256 C02 = _mm256_setzero_ps();
    __m256 C12 = _mm256_setzero_ps();
    __m256 C03 = _mm256_setzero_ps();
    __m256 C13 = _mm256_setzero_ps();
    __m256 C04 = _mm256_setzero_ps();
    __m256 C14 = _mm256_setzero_ps();
    __m256 C05 = _mm256_setzero_ps();
    __m256 C15 = _mm256_setzero_ps();
    __m256i packed_mask0;
    __m256i packed_mask1;

By doing this, GCC can better optimize the code avoiding register spilling. Additionally, we use vector instructions to calculate the masks as follows:

static int8_t mask[32]
    __attribute__((aligned(64))) = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
                                    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0};
packed_mask0 = _mm256_cvtepi8_epi32(_mm_loadu_si64(&mask[16 - mr]));
packed_mask1 = _mm256_cvtepi8_epi32(_mm_loadu_si64(&mask[16 - mr + 8]));

The corresponding implementation can be found at matmul_micro.h

9. Multithreading

On my machine, parallelizing the second and first inner loops jointly with collapse(2) results in the best performance:

#pragma omp parallel for collapse(2) num_threads(NTHREADS)
  for (int jr = 0; jr < nc; jr += NR)