batched-bench : add "separate text gen" mode by ggerganov · Pull Request #17103 · ggml-org/llama.cpp

ggerganov · 2025-11-08T12:21:36Z

Adding -tgs to llama-batched-bench would make it decode the sequences separately, one by one:

# no -tgs
0123 0123 0123 ...

# -tgs
0 0 0 ... 1 1 1 ... 2 2 2 ... 3 3 3 ...

This is useful for benchmarking the performance of the unified KV cache where it's important to detect and skip masked regions in the KQ mask.

Example with the Metal backend:

# unified KV cache with up to 4 sequences, running one by one
llama-batched-bench -m ../models/gemma-3-4b-it/ggml-model-f16.gguf -c 33792 -npp 8192 -ntg 32 -npl 1,2,4 -kvu -tgs

# the cache looks like this
#
#                        prompt processing ends here v
# 000...[8192 tokens]...000111...111222...222333...333000...[32 tokens]...000111...111222...222333...333
#.                             text generation starts ^

With the -INF block optimizations in the FA kernels:

main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.101 |  2641.88 |    0.478 |    66.95 |    3.579 |  2298.00 |
|  8192 |     32 |    2 |  16448 |    6.091 |  2689.76 |    0.971 |    65.90 |    7.062 |  2328.95 |
|  8192 |     32 |    4 |  32896 |   12.373 |  2648.43 |    1.965 |    65.15 |   14.337 |  2294.45 |

Disabling the -INF block optimizations in the FA kernels:

patch

diff --git a/ggml/src/ggml-metal/ggml-metal.metal b/ggml/src/ggml-metal/ggml-metal.metal
index cea535ade..6c249fb56 100644
--- a/ggml/src/ggml-metal/ggml-metal.metal
+++ b/ggml/src/ggml-metal/ggml-metal.metal
@@ -4633,7 +4633,7 @@ kernel void kernel_flash_attn_ext_blk(
     const int32_t nblk0 = ((args.ne30 + C - 1)/C);
 
     if (tiisg == 0) {
-        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = res;
+        dst[((i3*args.ne32 + i2)*nblk1 + i1)*nblk0 + i0] = 1;
     }
 }
 
@@ -5660,7 +5660,7 @@ void kernel_flash_attn_ext_vec_impl(
             }
 
             // skip -INF blocks
-            if (simd_max(sm[tiisg]) == -INFINITY) {
+            if (false) {
                 continue;
             }

main: n_kv_max = 33792, n_batch = 2048, n_ubatch = 512, flash_attn = -1, is_pp_shared = 0, is_tg_separate = 1, n_gpu_layers = -1, n_threads = 16, n_threads_batch = 16

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|  8192 |     32 |    1 |   8224 |    3.528 |  2321.91 |    0.500 |    64.05 |    4.028 |  2041.84 |
|  8192 |     32 |    2 |  16448 |    7.393 |  2216.28 |    1.027 |    62.30 |    8.420 |  1953.47 |
|  8192 |     32 |    4 |  32896 |   16.157 |  2028.06 |    2.159 |    59.30 |   18.316 |  1796.04 |

Observe that both pp and tg perf is worse and it's amplified with more sequences in the cache.

batched-bench : add "separate text gen" mode

cf19f43

github-actions bot added the examples label Nov 8, 2025

ggerganov mentioned this pull request Nov 8, 2025

Misc. bug: performance regression in llama-server (ggml-vulkan) #17033

Closed

DajanaV mentioned this pull request Nov 8, 2025

UPSTREAM PR #17103: batched-bench : add "separate text gen" mode auroralabs-loci/llama.cpp#132

Closed

ggerganov merged commit f914544 into master Nov 10, 2025
71 checks passed

ggerganov deleted the gg/batched-bench-separate-tg branch November 10, 2025 10:59

ggerganov mentioned this pull request Dec 22, 2025

CUDA: skip masked KV slices for all FA kernels #14924

Merged

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026

batched-bench : add "separate text gen" mode (ggml-org#17103)

fd3ccdd

blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026

batched-bench : add "separate text gen" mode (#17103)

37d0a82

ggerganov mentioned this pull request Feb 12, 2026

Misc. bug: llama-server with -kvu and --parallel 4 slows down tg with more inactive slots #19523

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batched-bench : add "separate text gen" mode#17103

batched-bench : add "separate text gen" mode#17103
ggerganov merged 1 commit intomasterfrom
gg/batched-bench-separate-tg

ggerganov commented Nov 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ggerganov commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggerganov commented Nov 8, 2025 •

edited

Loading