Skip to content

batched-bench : add "separate text gen" mode#17103

Merged
ggerganov merged 1 commit intomasterfrom
gg/batched-bench-separate-tg
Nov 10, 2025
Merged

batched-bench : add "separate text gen" mode#17103
ggerganov merged 1 commit intomasterfrom
gg/batched-bench-separate-tg

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Nov 8, 2025

Adding -tgs to llama-batched-bench would make it decode the sequences separately, one by one:

# no -tgs
0123 0123 0123 ...

# -tgs
0 0 0 ... 1 1 1 ... 2 2 2 ... 3 3 3 ...

This is useful for benchmarking the performance of the unified KV cache where it's important to detect and skip masked regions in the KQ mask.

Example with the Metal backend:

# unified KV cache with up to 4 sequences, running one by one
llama-batched-bench -m ../models/gemma-3-4b-it/ggml-model-f16.gguf -c 33792 -npp 8192 -ntg 32 -npl 1,2,4 -kvu -tgs

# the cache looks like this
#
#                        prompt processing ends here v
# 000...[8192 tokens]...000111...111222...222333...333000...[32 tokens]...000111...111222...222333...333
#.                             text generation starts ^

With the -INF block optimizations in the FA kernels:

main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.101 |  2641.88 |    0.478 |    66.95 |    3.579 |  2298.00 |
|  8192 |     32 |    2 |  16448 |    6.091 |  2689.76 |    0.971 |    65.90 |    7.062 |  2328.95 |
|  8192 |     32 |    4 |  32896 |   12.373 |  2648.43 |    1.965 |    65.15 |   14.337 |  2294.45 |

Disabling the -INF block optimizations in the FA kernels:

patch
diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index cea535ade..6c249fb56 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -4633,7 +4633,7 @@ kernel void kernel_flash_attn_ext_blk(
     const int32_t nblk0 = ((args.ne30 + C - 1)/C);
 
     if (tiisg == 0) {
-        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = res;
+        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = 1;
     }
 }
 
@@ -5660,7 +5660,7 @@ void kernel_flash_attn_ext_vec_impl(
             }
 
             // skip -INF blocks
-            if (simd_max(sm[tiisg]) == -INFINITY) {
+            if (false) {
                 continue;
             }
 
main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.528 |  2321.91 |    0.500 |    64.05 |    4.028 |  2041.84 |
|  8192 |     32 |    2 |  16448 |    7.393 |  2216.28 |    1.027 |    62.30 |    8.420 |  1953.47 |
|  8192 |     32 |    4 |  32896 |   16.157 |  2028.06 |    2.159 |    59.30 |   18.316 |  1796.04 |

Observe that both pp and tg perf is worse and it's amplified with more sequences in the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant