Personal Notes: Compute-Bound Example

...and Divergent Kernels

Jul 14, 2025

Higher Arithmetic Density

On my previous post I showcased how GPUs are not good for memory-bound problems, so naturally I wanted to see it being good for compute-bound problems. To achieve this I took the vector addition code from the previous post and added a few heavier math operation to it in a tight loop. The CPU code looks like this:

void computeIntensiveCPU(float *input, float *output, int n) {
    for (int i = 0; i < n; i++) {
        float result = input[i];

        for (int j = 0; j < 1000; j++) {
            result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
            result = powf(result, 0.9f) + logf(fabsf(result) + 1.0f);
            result = expf(result * 0.001f) - tanhf(result * 0.01f);
        }

        output[i] = result;
    }
}

And the equivalent CUDA kernel looks like this:

__global__ void computeIntensiveKernel(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n) {
        float result = input[idx];

        // Perform many expensive operations to make it compute-bound
        for (int i = 0; i < 1000; i++) {
            result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
            result = powf(result, 0.9f) + logf(fabsf(result) + 1.0f);
            result = expf(result * 0.001f) - tanhf(result * 0.01f);
        }

        output[idx] = result;
    }
}

Warp Divergence Performance

While I was there, I decided to try and measure the performance degradation of kernel divergence (i.e. two or more threads within the same warp being force from control flow to execute an alternative codepath). Here’s a version of computeIntensiveKernel that does just that:

__global__ void warpDivergenceKernel(float *input, float *output, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;

    if (idx < n) {
        float result = input[idx];

        if (threadIdx.x % 2 == 0) {
            // Even threads
            for (int i = 0; i < 1000; i++) {
                result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
                result = powf(result, 0.9f) + logf(fabsf(result) + 1.0f);
                result = expf(result * 0.001f) - tanhf(result * 0.01f);
            }
        } else {
            // Odd threads
            for (int i = 0; i < 1000; i++) {
                result = sinf(result) * cosf(result) + sqrtf(fabsf(result));
                result = powf(result, 0.9f) + logf(fabsf(result) + 1.0f);
                result = expf(result * 0.001f) - tanhf(result * 0.01f);
            }
        }
        output[idx] = result;
    }
}

The if-clause splitting the code paths of even & odd threads will create the divergence effect. Each path performs the same amount of work as computeIntensiveKernel but the warp will execute both paths sequentially inducing overhead.

Let it rip!

The full code for this experimental code can be found in this gist. Running it gives us the following output:

Problem size: 100000 elements
===============================================

1. CPU Compute-Intensive Benchmark
    CPU time: 11.3368 seconds

2. GPU Compute-Intensive Kernel - Block Size Effects (including memory transfers)
    Block size:  31, Kernel Time: 0.0008 s, H2D Time: 0.0001 s, D2H Time: 0.0003 s, Total GPU Time: 0.0012 s, Speedup: 9265.0x ✗
        └─ Wasted threads per partial warp: 1/32 (96.9% efficiency)
    Block size:  32, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0009 s, Speedup: 13283.1x ✓


    Block size:  63, Kernel Time: 0.0008 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0010 s, Speedup: 11684.9x ✗
        └─ Wasted threads per partial warp: 1/32 (96.9% efficiency)
    Block size:  64, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0009 s, Speedup: 13112.5x ✓


    Block size: 127, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0009 s, Speedup: 13083.9x ✗
        └─ Wasted threads per partial warp: 1/32 (96.9% efficiency)
    Block size: 128, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0009 s, Speedup: 13132.9x ✓


    Block size: 255, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0008 s, Speedup: 13575.8x ✗
        └─ Wasted threads per partial warp: 1/32 (96.9% efficiency)
    Block size: 256, Kernel Time: 0.0007 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0008 s, Speedup: 13593.0x ✓

    Block size: 511, Kernel Time: 0.0008 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0010 s, Speedup: 11495.7x ✗
        └─ Wasted threads per partial warp: 1/32 (96.9% efficiency)
    Block size: 512, Kernel Time: 0.0008 s, H2D Time: 0.0001 s, D2H Time: 0.0001 s, Total GPU Time: 0.0010 s, Speedup: 11539.1x ✓

3. Warp Divergence Effects
    Testing kernel with branching (warp divergence)...
    Regular kernel (256 threads):    0.0007 s
    Divergent kernel (256 threads):  0.0014 s
    Divergence overhead: 99.6%

As you’ll notice the problem size is pretty small and I also haven’t used multi-threading for the CPU code. That’s on purpose because on my previous post , the single-threaded CPU code pretty much crushed the multi-threaded CPU and the GPU code. Now the story is very different.

Analyzing The Results

The GPU is so much better at this (as expected). I even added the time it takes to transfer the data between host and device (see H2D/D2H times).
Even wasting one thread/core within a warp for these kernels can induce measurable performance overheads.
Though my example code is an extreme case, performance degradation due to Warp Divergence is real.

Small Tech Journal

Discussion about this post

Ready for more?