Skip to content

graph : fix KQ mask, lora, cvec reuse checks#19644

Merged
ggerganov merged 4 commits intomasterfrom
gg/graph-fix-kq-mask-reuse
Feb 16, 2026
Merged

graph : fix KQ mask, lora, cvec reuse checks#19644
ggerganov merged 4 commits intomasterfrom
gg/graph-fix-kq-mask-reuse

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 15, 2026

cont #14482

Graph reuse was never triggered for parallel decoding with non-unified KV cache due to incorrect check of the KQ mask shape.

Also fix the checks for reusing lora and control vectors.

Before:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     32 |    2 |   1088 |    0.611 |  1675.23 |    0.708 |    90.38 |    1.319 |   824.63 |
|   512 |     32 |    3 |   1632 |    0.971 |  1581.80 |    0.916 |   104.85 |    1.887 |   865.01 |
|   512 |     32 |    4 |   2176 |    1.209 |  1693.47 |    1.081 |   118.40 |    2.290 |   950.04 |
|   512 |     32 |    5 |   2720 |    1.520 |  1683.99 |    1.278 |   125.22 |    2.798 |   972.13 |
|   512 |     32 |    6 |   3264 |    1.807 |  1700.43 |    1.379 |   139.25 |    3.185 |  1024.66 |
|   512 |     32 |    7 |   3808 |    2.166 |  1654.66 |    1.541 |   145.33 |    3.707 |  1027.16 |
|   512 |     32 |    8 |   4352 |    2.408 |  1701.34 |    1.676 |   152.77 |    4.083 |  1065.81 |
0.22.545.489 I llama_perf_context_print:        load time =    2462.16 ms
0.22.545.490 I llama_perf_context_print: prompt eval time =   19704.66 ms / 19568 tokens (    1.01 ms per token,   993.06 tokens per second)
0.22.545.491 I llama_perf_context_print:        eval time =     486.25 ms /    32 runs   (   15.20 ms per token,    65.81 tokens per second)
0.22.545.491 I llama_perf_context_print:       total time =   22545.24 ms / 19600 tokens
0.22.545.492 I llama_perf_context_print:    graphs reused =         31

After:

|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
|   512 |     32 |    2 |   1088 |    0.611 |  1677.03 |    0.651 |    98.30 |    1.262 |   862.35 |
|   512 |     32 |    3 |   1632 |    0.991 |  1549.98 |    0.859 |   111.77 |    1.850 |   882.21 |
|   512 |     32 |    4 |   2176 |    1.209 |  1694.18 |    1.021 |   125.32 |    2.230 |   975.68 |
|   512 |     32 |    5 |   2720 |    1.519 |  1685.52 |    1.250 |   128.04 |    2.768 |   982.52 |
|   512 |     32 |    6 |   3264 |    1.813 |  1694.48 |    1.326 |   144.79 |    3.139 |  1039.83 |
|   512 |     32 |    7 |   3808 |    2.176 |  1647.06 |    1.478 |   151.55 |    3.654 |  1042.14 |
|   512 |     32 |    8 |   4352 |    2.406 |  1702.61 |    1.617 |   158.28 |    4.023 |  1081.75 |
0.22.415.925 I llama_perf_context_print:        load time =    2659.43 ms
0.22.415.926 I llama_perf_context_print: prompt eval time =   19428.53 ms / 19568 tokens (    0.99 ms per token,  1007.18 tokens per second)
0.22.415.927 I llama_perf_context_print:        eval time =     479.63 ms /    32 runs   (   14.99 ms per token,    66.72 tokens per second)
0.22.415.928 I llama_perf_context_print:       total time =   22415.75 ms / 19600 tokens
0.22.415.928 I llama_perf_context_print:    graphs reused =        256

@ggerganov ggerganov requested a review from CISC as a code owner February 15, 2026 12:07
@ggerganov ggerganov changed the title graph : fix KQ mask reuse check graph : fix KQ mask, lora, cvec reuse checks Feb 15, 2026
@ggerganov ggerganov merged commit d5dfc33 into master Feb 16, 2026
77 of 78 checks passed
@ggerganov ggerganov deleted the gg/graph-fix-kq-mask-reuse branch February 16, 2026 07:21
michaelneale added a commit to michaelneale/llama.cpp that referenced this pull request Feb 17, 2026
* upstream/master: (88 commits)
  ci : bump komac version (ggml-org#19682)
  build : link ws2_32 as PUBLIC on Windows (ggml-org#19666)
  build : cleanup library linking logic (ggml-org#19665)
  convert : add JoyAI-LLM-Flash (ggml-org#19651)
  perplexity: add proper batching (ggml-org#19661)
  common : inline functions (ggml-org#18639)
  ggml : make `ggml_is_view` as API (ggml-org#19539)
  model: Add support for Tiny Aya Models (ggml-org#19611)
  build : rework llama_option_depr to handle LLAMA_CURL (ggml-org#19658)
  Adjust workaround for ROCWMMA_FATTN/GFX9 to only newer ROCm veresions (ggml-org#19591)
  models : deduplicate delta-net graphs for Qwen family (ggml-org#19597)
  graph : fix KQ mask, lora, cvec reuse checks (ggml-org#19644)
  ggml: aarch64: Implement SVE in Gemm q4_k 8x8 q8_k Kernel  (ggml-org#19132)
  sync : ggml
  ggml : bump version to 0.9.7 (ggml/1425)
  ggml : bump version to 0.9.6 (ggml/1423)
  cuda: optimize iq2xxs/iq2xs/iq3xxs dequantization (ggml-org#19624)
  docs: update s390x build docs (ggml-org#19643)
  build : remove LLAMA_HTTPLIB option (ggml-org#19623)
  cmake : check if KleidiAI API has been fetched (ggml-org#19640)
  ...
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse
bartowski1182 pushed a commit to bartowski1182/llama.cpp that referenced this pull request Mar 2, 2026
* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse
ArberSephirotheca pushed a commit to ArberSephirotheca/llama.cpp that referenced this pull request Mar 3, 2026
* graph : fix KQ mask reuse condition

* cont : dedup KQ mask build and can_reuse

* cont : fix build

* graph : fix adapter check for reuse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant