graph : fix KQ mask, lora, cvec reuse checks by ggerganov · Pull Request #19644 · ggml-org/llama.cpp

ggerganov · 2026-02-15T12:07:31Z

Graph reuse was never triggered for parallel decoding with non-unified KV cache due to incorrect check of the KQ mask shape.

Also fix the checks for reusing lora and control vectors.

Before:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     32 |    2 |   1088 |    0.611 |  1675.23 |    0.708 |    90.38 |    1.319 |   824.63 |
|   512 |     32 |    3 |   1632 |    0.971 |  1581.80 |    0.916 |   104.85 |    1.887 |   865.01 |
|   512 |     32 |    4 |   2176 |    1.209 |  1693.47 |    1.081 |   118.40 |    2.290 |   950.04 |
|   512 |     32 |    5 |   2720 |    1.520 |  1683.99 |    1.278 |   125.22 |    2.798 |   972.13 |
|   512 |     32 |    6 |   3264 |    1.807 |  1700.43 |    1.379 |   139.25 |    3.185 |  1024.66 |
|   512 |     32 |    7 |   3808 |    2.166 |  1654.66 |    1.541 |   145.33 |    3.707 |  1027.16 |
|   512 |     32 |    8 |   4352 |    2.408 |  1701.34 |    1.676 |   152.77 |    4.083 |  1065.81 |
0.22.545.489 I llama_perf_context_print:        load time =    2462.16 ms
0.22.545.490 I llama_perf_context_print: prompt eval time =   19704.66 ms / 19568 tokens (    1.01 ms per token,   993.06 tokens per second)
0.22.545.491 I llama_perf_context_print:        eval time =     486.25 ms /    32 runs   (   15.20 ms per token,    65.81 tokens per second)
0.22.545.491 I llama_perf_context_print:       total time =   22545.24 ms / 19600 tokens
0.22.545.492 I llama_perf_context_print:    graphs reused =         31

After:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     32 |    2 |   1088 |    0.611 |  1677.03 |    0.651 |    98.30 |    1.262 |   862.35 |
|   512 |     32 |    3 |   1632 |    0.991 |  1549.98 |    0.859 |   111.77 |    1.850 |   882.21 |
|   512 |     32 |    4 |   2176 |    1.209 |  1694.18 |    1.021 |   125.32 |    2.230 |   975.68 |
|   512 |     32 |    5 |   2720 |    1.519 |  1685.52 |    1.250 |   128.04 |    2.768 |   982.52 |
|   512 |     32 |    6 |   3264 |    1.813 |  1694.48 |    1.326 |   144.79 |    3.139 |  1039.83 |
|   512 |     32 |    7 |   3808 |    2.176 |  1647.06 |    1.478 |   151.55 |    3.654 |  1042.14 |
|   512 |     32 |    8 |   4352 |    2.406 |  1702.61 |    1.617 |   158.28 |    4.023 |  1081.75 |
0.22.415.925 I llama_perf_context_print:        load time =    2659.43 ms
0.22.415.926 I llama_perf_context_print: prompt eval time =   19428.53 ms / 19568 tokens (    0.99 ms per token,  1007.18 tokens per second)
0.22.415.927 I llama_perf_context_print:        eval time =     479.63 ms /    32 runs   (   14.99 ms per token,    66.72 tokens per second)
0.22.415.928 I llama_perf_context_print:       total time =   22415.75 ms / 19600 tokens
0.22.415.928 I llama_perf_context_print:    graphs reused =        256

* upstream/master: (88 commits) ci : bump komac version (ggml-org#19682) build : link ws2_32 as PUBLIC on Windows (ggml-org#19666) build : cleanup library linking logic (ggml-org#19665) convert : add JoyAI-LLM-Flash (ggml-org#19651) perplexity: add proper batching (ggml-org#19661) common : inline functions (ggml-org#18639) ggml : make `ggml_is_view` as API (ggml-org#19539) model: Add support for Tiny Aya Models (ggml-org#19611) build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658) Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591) models : deduplicate delta-net graphs for Qwen family (ggml-org#19597) graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644) ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel (ggml-org#19132) sync : ggml ggml : bump version to 0.9.7 (ggml/1425) ggml : bump version to 0.9.6 (ggml/1423) cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624) docs: update s390x build docs (ggml-org#19643) build : remove LLAMA_HTTPLIB option (ggml-org#19623) cmake : check if KleidiAI API has been fetched (ggml-org#19640) ...

* graph : fix KQ mask reuse condition * cont : dedup KQ mask build and can_reuse * cont : fix build * graph : fix adapter check for reuse

ggerganov added 2 commits February 15, 2026 13:30

graph : fix KQ mask reuse condition

abba87e

cont : dedup KQ mask build and can_reuse

3315621

ggerganov requested a review from CISC as a code owner February 15, 2026 12:07

ggerganov added 2 commits February 15, 2026 14:10

cont : fix build

7e3bf17

graph : fix adapter check for reuse

8bc255a

ggerganov changed the title ~~graph : fix KQ mask reuse check~~ graph : fix KQ mask, lora, cvec reuse checks Feb 15, 2026

ggerganov mentioned this pull request Feb 15, 2026

cuda : enable CUDA graphs for MMID 1 <= BS <= 4 #19645

Merged

1 task

ggerganov merged commit d5dfc33 into master Feb 16, 2026
77 of 78 checks passed

ggerganov deleted the gg/graph-fix-kq-mask-reuse branch February 16, 2026 07:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

graph : fix KQ mask, lora, cvec reuse checks#19644

graph : fix KQ mask, lora, cvec reuse checks#19644
ggerganov merged 4 commits intomasterfrom
gg/graph-fix-kq-mask-reuse

ggerganov commented Feb 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ggerganov commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ggerganov commented Feb 15, 2026 •

edited

Loading