Conversation
2f577c5 to
30b4d4e
Compare
f61b0f7 to
d9e1781
Compare
fc4fdf6 to
76681e3
Compare
|
This should be ready for review. Currently, there is some small gain for Metal where It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use |
| res &= self_kq_mask->ne[0] == mctx->get_n_kv(); | ||
| res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD); | ||
|
|
||
| res &= mctx->get_supports_set_rows(); // TODO: tmp |
There was a problem hiding this comment.
If update() is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows, because the head offset tends to be the same for similar consecutive ubatches in find_slot.
That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows.
There was a problem hiding this comment.
Yes, it has to work. As long as the check for the head is correctly added to the update() of the respective inputs it should be good.
An update on this - it seems that Edit: prototyped this in #14570. Does not seem worth pursuing as the gains are microscopic. |
|
I tested this on CUDA on various models I had lying around and I see there a perf regression on a larger model, not sure if I'm doing something wrong. I cherry-picked this PR on top of #14551 and compared with commit1 = 14551, and commit2 = 14551 + this PR. Also interesting is 2x(?) speed up on qwen2lvl 3B
To be sure, I ran it again
|
|
The The But before benchmarking, I think you should make sure that the generated results when graph reuse is enabled are coherent (see the And also from the results that you posted it seems that there is a lot of variability on your system for this |
|
@ggerganov I re-ran with r=100, and tg 64 and 128, I see quite a bit of variability at tg128, but tg64 is pretty tight (<1% variability). Also confirming the
|
7a9e3f4 to
600e69f
Compare
| // xxxxx----- | ||
| // xxxxx----- | ||
| // To visualize the mask, see https://github.com/ggml-org/llama.cpp/pull/12615 | ||
| // TODO: optimize this section |
There was a problem hiding this comment.
Note for later: there is opportunity for optimizing the KQ mask initialization here. For large n_kv this section becomes measurable.
ce77041 to
8303a68
Compare
| // reset the previous graph result to make sure that it won't be reused | ||
| // TODO: change the mctx->apply() to return information if a graph reserve is needed | ||
| // reset the graph result only if the memory module did reset the scheduler | ||
| gf_res_prev->reset(); | ||
|
|
There was a problem hiding this comment.
I am not sure what would happen if e.g. in a call to kv_self_update the scheduler is reset. Is this detected to prevent reusing the graph?
I added a reset of the previous graph result every time the memory module makes an update to guarantee that if a scheduler reset occurs, we would not attempt to reuse the previous graph.
This is too conservative because sometimes a memory module update would not involve a scheduler reset (for example, KV cache buffer copy from one stream to another). Will follow-up with an update to avoid this overhead per the TODO.
|
In acaf4b7 I made a relatively significant change to how the @slaren Might be worth to take another look just in case. Even if not perfect, the results should be correct and graph reuse with split KV cache should now be functional. I will be improving the implementation in a follow-up PR. |
0bdb209 to
c7ccf38
Compare
llama-lookahead has been broken since PR ggml-org#14482 (July 2025) which changed seq_id validation from LLAMA_MAX_SEQ constant to context-specific n_seq_max. Two lookahead-specific issues: 1. n_seq_max: Lookahead needs W + G + 1 = 31 sequences for parallel Jacobi decoding, but params.n_parallel defaulted to 1. Fix: Set params.n_parallel = W + G + 1 before context creation. 2. KV unified: Batch splitting with coupled sequences requires unified KV cache mode, but lookahead didn't enable it. Fix: Set params.kv_unified = true. Bug timeline: - Nov 2023: lookahead.cpp created, worked with LLAMA_MAX_SEQ constant - July 2025: PR ggml-org#14482 changed to n_seq_max validation, broke lookahead Note: This PR depends on ggml-org#18729 for the batch init fix (params.n_ctx -> llama_n_ctx). Both PRs are needed for lookahead to fully work. Tested with Qwen2.5-Coder-0.5B: lookahead generates output with n_accept > 0. Bug history researched with Claude.
llama-lookahead has been broken since PR ggml-org#14482 (July 2025) which changed seq_id validation from LLAMA_MAX_SEQ constant to context-specific n_seq_max. Two lookahead-specific issues: 1. n_seq_max: Lookahead needs W + G + 1 = 31 sequences for parallel Jacobi decoding, but params.n_parallel defaulted to 1. Fix: Set params.n_parallel = W + G + 1 before context creation. 2. KV unified: Batch splitting with coupled sequences requires unified KV cache mode, but lookahead didn't enable it. Fix: Set params.kv_unified = true. Bug timeline: - Nov 2023: lookahead.cpp created, worked with LLAMA_MAX_SEQ constant - July 2025: PR ggml-org#14482 changed to n_seq_max validation, broke lookahead Note: This PR depends on ggml-org#18729 for the batch init fix (params.n_ctx -> llama_n_ctx). Both PRs are needed for lookahead to fully work. Tested with Qwen2.5-Coder-0.5B: lookahead generates output with n_accept > 0. Bug history researched with Claude.
Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
* lookup, lookahead: fix crash when n_ctx not specified Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR #4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR #14482 changed validation logic. Consolidates fix from PR #18730 per maintainer request. Commit message drafted with Claude.
* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
* llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci
* llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci
* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.
target #14285
Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.
Note
To enable this functionality, there is a temporary requirement
LLAMA_SET_ROWS=1to be set in your environment variable. In the future, this will become the default.This functionality requires the
ggml_set_rows()operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g.inp_embd,inp_pos,inp_attn, etc.).This PR adds logic to update a previous
llm_graph_resultby verifying that the newllm_graph_paramswould result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the newubatchis compatible. See the newllm_graph_result::update()method:https://github.com/ggml-org/llama.cpp/blob/fc4fdf623c098d2b4d7699fdb7f2ea5ae1f63b57/src/llama-graph.h#L506-L525
The other change that is needed is to introduce a way to swap the
llama_memory_contextof all graph inputs, so that the new call tollm_graph_result_i::set_inputs()uses the correct context from the currentubatch. This is performed by calling thellm_graph_input_i::update()method of all input tensors.To enable this feature, define the
LLAMA_SET_ROWSenvironment variable.To debug, define
LLAMA_GRAPH_RESULT_DEBUG=2and add-lv 1to the CLI args.Tests
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128Benchmark on M2 Ultra:
TODO
is_samemethods?Next PRs
llama_graph_result_iinterface - does not seem to have any purpose (graph : refactor context to not pass gf explicitly #14629)ggml_cgraph * gfeverywhere. Simply move it tollm_graph_context(graph : refactor context to not pass gf explicitly #14629)