llama : reuse compute graphs by ggerganov · Pull Request #14482 · ggml-org/llama.cpp

ggerganov · 2025-07-01T14:19:09Z

Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.

Note

To enable this functionality, there is a temporary requirement LLAMA_SET_ROWS=1 to be set in your environment variable. In the future, this will become the default.

This functionality requires the ggml_set_rows() operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g. inp_embd, inp_pos, inp_attn, etc.).

This PR adds logic to update a previous llm_graph_result by verifying that the new llm_graph_params would result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the new ubatch is compatible. See the new llm_graph_result::update() method:

https://github.com/ggml-org/llama.cpp/blob/fc4fdf623c098d2b4d7699fdb7f2ea5ae1f63b57/src/llama-graph.h#L506-L525

The other change that is needed is to introduce a way to swap the llama_memory_context of all graph inputs, so that the new call to llm_graph_result_i::set_inputs() uses the correct context from the current ubatch. This is performed by calling the llm_graph_input_i::update() method of all input tensors.

To enable this feature, define the LLAMA_SET_ROWS environment variable.

To debug, define LLAMA_GRAPH_RESULT_DEBUG=2 and add -lv 1 to the CLI args.

Tests

LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa

LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128

Benchmark on M2 Ultra:

LLAMA_SET_ROWS=1 ./scripts/compare-commits.sh gg/kv-cache-use-set-rows gg/llama-reuse-graphs -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 0,1 -t 1 -r 10 -n 1,32 -p 0

Model	Test	t/s master	t/s gg/llama-reuse-graphs	Speedup
gemma3 4B Q4_0	tg32	113.56	116.19	1.02
llama 1B Q8_0	tg32	275.25	280.95	1.02
qwen2 1.5B Q4_0	tg32	206.63	215.55	1.04
qwen2 1.5B Q8_0	tg32	174.63	180.79	1.04
qwen2 3B Q4_0	tg32	141.91	147.95	1.04
qwen2 3B Q8_0	tg32	111.40	114.48	1.03

TODO

Clean-up and improve new interfaces and members
Avoid graph input dynamic casts in is_same methods?
Allow to reuse more models
Manual user option to force disable of graph reuse?

Next PRs

Remove llama_graph_result_i interface - does not seem to have any purpose (graph : refactor context to not pass gf explicitly #14629)
Be able to compare the unique sequence ids of 2 ubatches
Avoid passing ggml_cgraph * gf everywhere. Simply move it to llm_graph_context (graph : refactor context to not pass gf explicitly #14629)
Try to reuse Metal graphs via MTLIndirectCommandBuffer
Make CUDA reuse CUDA graphs using this new mechanism
Reduce min nodes llama : reuse compute graphs #14482 (comment)

ggerganov · 2025-07-05T13:16:20Z

This should be ready for review. Currently, there is some small gain for Metal where ggml_set_rows() is available. We basically save the time for creating a new ggml_cgraph for each ubatch.

It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use MTLCommandBuffer to encode the compute commands and these objects do not allow the commands to be reused. However, according to the Apple documentation, the MTLIndirectCommandBuffer can be reused multiple times, so it seems to be what we need. It's still not clear to me how to encode compute commands to it (the docs only show examples of encoding rendering commands), but it might be possible. Any hints would be highly appreciated.

compilade · 2025-07-05T16:18:48Z

src/llama-graph.cpp

+    res &= self_kq_mask->ne[0] == mctx->get_n_kv();
+    res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD);
+
+    res &= mctx->get_supports_set_rows(); // TODO: tmp


If update() is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows, because the head offset tends to be the same for similar consecutive ubatches in find_slot.

That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows.

Yes, it has to work. As long as the check for the head is correctly added to the update() of the respective inputs it should be good.

ggerganov · 2025-07-06T16:47:48Z

It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use MTLCommandBuffer to encode the compute commands and these objects do not allow the commands to be reused. However, according to the Apple documentation, the MTLIndirectCommandBuffer can be reused multiple times, so it seems to be what we need. It's still not clear to me how to encode compute commands to it (the docs only show examples of encoding rendering commands), but it might be possible. Any hints would be highly appreciated.

An update on this - it seems that MTLIndirectCommandBuffer won't cut it. However, I think an alternative simple approach of preparing 2 MTLCommandBuffer for each compute would work. The second command buffer will be prepared asynchronously after we commit the first one and on the next iteration, if the application informed us that we can reuse the graph, we will directly commit the prepared command buffer from the previous iteration and asynchronously prepare a new one for the next iteration. I will attempt an implementation of this in a follow-up PR and try to utilize ggml_backend_graph_plan_t in the process.

Edit: prototyped this in #14570. Does not seem worth pursuing as the gains are microscopic.

am17an · 2025-07-07T03:14:51Z

I tested this on CUDA on various models I had lying around and I see there a perf regression on a larger model, not sure if I'm doing something wrong. I cherry-picked this PR on top of #14551 and compared with commit1 = 14551, and commit2 = 14551 + this PR. Also interesting is 2x(?) speed up on qwen2lvl 3B

Model	FlashAttention	Test	t/s 99af79eb	t/s cuda_set_rows	Speedup
gemma3 4B Q8_0	No	tg1	107.63	109.06	1.01
gemma3 4B Q8_0	No	tg32	110.20	110.77	1.01
gemma3 4B Q8_0	Yes	tg1	104.97	106.19	1.01
gemma3 4B Q8_0	Yes	tg32	106.32	106.32	1.00
llama 13B Q5_K_M	No	tg1	71.82	54.37	0.76
llama 13B Q5_K_M	No	tg32	71.49	68.12	0.95
llama 13B Q5_K_M	Yes	tg1	73.12	64.46	0.88
llama 13B Q5_K_M	Yes	tg32	72.19	66.93	0.93
llama 7B Q5_K_M	No	tg1	122.50	124.71	1.02
llama 7B Q5_K_M	No	tg32	119.88	121.22	1.01
llama 7B Q5_K_M	Yes	tg1	126.75	127.27	1.00
llama 7B Q5_K_M	Yes	tg32	125.98	126.06	1.00
qwen2vl 3B Q4_K_M	No	tg1	87.15	185.50	2.13
qwen2vl 3B Q4_K_M	No	tg32	191.90	192.95	1.01
qwen2vl 3B Q4_K_M	Yes	tg1	186.79	187.19	1.00
qwen2vl 3B Q4_K_M	Yes	tg32	187.53	187.20	1.00

To be sure, I ran it again

Model	FlashAttention	Test	t/s 99af79eb	t/s cuda_set_rows	Speedup
gemma3 4B Q8_0	No	tg1	104.31	105.15	1.01
gemma3 4B Q8_0	No	tg32	104.44	110.46	1.06
gemma3 4B Q8_0	Yes	tg1	103.53	103.42	1.00
gemma3 4B Q8_0	Yes	tg32	104.16	95.79	0.92
llama 13B Q5_K_M	No	tg1	68.89	53.73	0.78
llama 13B Q5_K_M	No	tg32	66.89	70.87	1.06
llama 13B Q5_K_M	Yes	tg1	68.92	72.69	1.05
llama 13B Q5_K_M	Yes	tg32	71.02	69.24	0.98
llama 7B Q5_K_M	No	tg1	108.22	116.91	1.08
llama 7B Q5_K_M	No	tg32	113.81	117.13	1.03
llama 7B Q5_K_M	Yes	tg1	120.66	121.00	1.00
llama 7B Q5_K_M	Yes	tg32	124.38	121.43	0.98
qwen2vl 3B Q4_K_M	No	tg1	52.09	126.43	2.43
qwen2vl 3B Q4_K_M	No	tg32	188.24	191.41	1.02
qwen2vl 3B Q4_K_M	Yes	tg1	175.98	190.76	1.08
qwen2vl 3B Q4_K_M	Yes	tg32	180.16	188.93	1.05

ggerganov · 2025-07-07T06:15:18Z

The tg1 tests are irrelevant - they are just for warming up the next test. So the x2 is just fluctuation.

The graph_reuse parameter in llama-bench is disabled on this branch. I pushed a change to toggle it with -gr <0|1> CLI arg, but it won't work with the compare-commits.sh script.

But before benchmarking, I think you should make sure that the generated results when graph reuse is enabled are coherent (see the llama-cli and llama-parallel commands). Make sure that they report successful graph reusing at the end of the execution:

0.01.045.581 I llama_perf_context_print: prompt eval time =      36.08 ms /     8 tokens (    4.51 ms per token,   221.75 tokens per second)
0.01.045.582 I llama_perf_context_print:        eval time =     399.58 ms /    31 runs   (   12.89 ms per token,    77.58 tokens per second)
0.01.045.582 I llama_perf_context_print:       total time =     441.40 ms /    39 tokens
0.01.045.584 I llama_perf_context_print:    graphs reused =         30        <--- pay attention to this number
0.01.045.992 I ggml_metal_free: deallocating

And also from the results that you posted it seems that there is a lot of variability on your system for this tg32 test. Try to narrow down what is causing it and make sure the results are stable before benchmarking the graph reuse. Or use larger tg128 test and more repetitions.

am17an · 2025-07-07T07:31:27Z

@ggerganov I re-ran with r=100, and tg 64 and 128, I see quite a bit of variability at tg128, but tg64 is pretty tight (<1% variability). Also confirming the llama_perf_context_print

Model	FlashAttention	Test	t/s 99af79eb	t/s faa60717	Speedup
gemma3 4B Q8_0	No	tg1	111.89	110.63	0.99
gemma3 4B Q8_0	No	tg64	109.37	104.83	0.96
gemma3 4B Q8_0	No	tg128	105.82	112.28	1.06
gemma3 4B Q8_0	Yes	tg1	108.19	106.61	0.99
gemma3 4B Q8_0	Yes	tg64	108.57	101.07	0.93
gemma3 4B Q8_0	Yes	tg128	103.60	110.38	1.07
llama 13B Q5_K_M	No	tg1	57.01	58.88	1.03
llama 13B Q5_K_M	No	tg64	70.34	73.04	1.04
llama 13B Q5_K_M	No	tg128	69.97	68.06	0.97
llama 13B Q5_K_M	Yes	tg1	77.40	72.23	0.93
llama 13B Q5_K_M	Yes	tg64	71.20	73.46	1.03
llama 13B Q5_K_M	Yes	tg128	71.88	71.90	1.00
llama 7B Q5_K_M	No	tg1	119.34	129.08	1.08
llama 7B Q5_K_M	No	tg64	117.72	128.20	1.09
llama 7B Q5_K_M	No	tg128	115.36	120.00	1.04
llama 7B Q5_K_M	Yes	tg1	128.29	126.52	0.99
llama 7B Q5_K_M	Yes	tg64	125.85	131.65	1.05
llama 7B Q5_K_M	Yes	tg128	116.00	121.91	1.05
qwen2vl 3B Q4_K_M	No	tg1	141.47	136.11	0.96
qwen2vl 3B Q4_K_M	No	tg64	178.99	195.90	1.09
qwen2vl 3B Q4_K_M	No	tg128	175.23	193.03	1.10
qwen2vl 3B Q4_K_M	Yes	tg1	172.34	187.55	1.09
qwen2vl 3B Q4_K_M	Yes	tg64	185.92	175.19	0.94
qwen2vl 3B Q4_K_M	Yes	tg128	187.77	192.94	1.03

common/arg.cpp

src/llama-batch.h

src/llama-graph.h

ggerganov · 2025-07-08T18:15:46Z

src/llama-kv-cache-unified.cpp

    //      xxxxx-----
    //      xxxxx-----
    // To visualize the mask, see https://github.com/ggml-org/llama.cpp/pull/12615
+    // TODO: optimize this section


Note for later: there is opportunity for optimizing the KQ mask initialization here. For large n_kv this section becomes measurable.

This will be fixed in #14600 after merging #14363

src/llama-model.cpp

ggml-ci

ggerganov · 2025-07-17T09:32:49Z

src/llama-context.cpp

+        // reset the previous graph result to make sure that it won't be reused
+        // TODO: change the mctx->apply() to return information if a graph reserve is needed
+        //       reset the graph result only if the memory module did reset the scheduler
+        gf_res_prev->reset();
+


I am not sure what would happen if e.g. in a call to kv_self_update the scheduler is reset. Is this detected to prevent reusing the graph?

I added a reset of the previous graph result every time the memory module makes an update to guarantee that if a scheduler reset occurs, we would not attempt to reuse the previous graph.

This is too conservative because sometimes a memory module update would not involve a scheduler reset (for example, KV cache buffer copy from one stream to another). Will follow-up with an update to avoid this overhead per the TODO.

ggerganov · 2025-07-17T09:40:38Z

In acaf4b7 I made a relatively significant change to how the llama_ubatch data buffers are managed. The old implementation was keeping the ubatch data buffers in the llama_batch_allocr with the idea of avoiding memory allocations for each new llama_ubatch. Hence the llama_ubatch was just pointing to data in llama_batch_allocr. However, this logic was making it complicated to check if graph reusing is possible, since the old llama_ubatch could be pointing to buffers that are now being used for the new llama_ubatch. So I changed this to no longer keep the llama_ubatches data inside the balloc, but instead to carry it along and reference count it with std::shared_ptr.

@slaren Might be worth to take another look just in case. Even if not perfect, the results should be correct and graph reuse with split KV cache should now be functional. I will be improving the implementation in a follow-up PR.

ggml-ci

src/llama-batch.h

llama-lookahead has been broken since PR ggml-org#14482 (July 2025) which changed seq_id validation from LLAMA_MAX_SEQ constant to context-specific n_seq_max. Two lookahead-specific issues: 1. n_seq_max: Lookahead needs W + G + 1 = 31 sequences for parallel Jacobi decoding, but params.n_parallel defaulted to 1. Fix: Set params.n_parallel = W + G + 1 before context creation. 2. KV unified: Batch splitting with coupled sequences requires unified KV cache mode, but lookahead didn't enable it. Fix: Set params.kv_unified = true. Bug timeline: - Nov 2023: lookahead.cpp created, worked with LLAMA_MAX_SEQ constant - July 2025: PR ggml-org#14482 changed to n_seq_max validation, broke lookahead Note: This PR depends on ggml-org#18729 for the batch init fix (params.n_ctx -> llama_n_ctx). Both PRs are needed for lookahead to fully work. Tested with Qwen2.5-Coder-0.5B: lookahead generates output with n_accept > 0. Bug history researched with Claude.

Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.

* lookup, lookahead: fix crash when n_ctx not specified Since PR #16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR #4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR #4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR #10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR #16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR #16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR #14482 changed validation logic. Consolidates fix from PR #18730 per maintainer request. Commit message drafted with Claude.

* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.

* llama : reuse compute graphs ggml-ci * llama-bench : add graph reuse parameter ggml-ci * cont : remove the parameter and the sched resets ggml-ci * graph : rename update() to can_reuse() ggml-ci * params : remove is_same() ggml-ci * graph : set res->params in llm_graph_context constructor ggml-ci * graph : avoid set_max_nodes in llm_graph_result ggml-ci * kv-cache : reuse llama_context's graph result instance ggml-ci * context : reset the previous graph result upon memory updates ggml-ci * batch : llama_ubatch now carries its data instead of pointing to balloc ggml-ci * merge : fix build ggml-ci * graph : fix can_reuse() checks when flash-attention is disabled * graph : move llm_graph_result impl in source file + debug env ggml-ci

* lookup, lookahead: fix crash when n_ctx not specified Since PR ggml-org#16653 (Dec 15, 2025), the default n_ctx is 0 to enable automatic GPU memory fitting. This causes llama-lookup and llama-lookahead to crash when run without explicit -c flag: GGML_ASSERT(batch.seq_id[batch.n_tokens] && "llama_batch size exceeded") Root cause: Both examples use params.n_ctx directly for batch initialization, but params.n_ctx remains 0 even after the context is properly initialized to n_ctx_train internally. Bug history: - Nov 2023: lookahead.cpp created (PR ggml-org#4207) with params.n_ctx pattern - Dec 2023: lookup.cpp created (PR ggml-org#4484) with same pattern - Nov 2024: default n_ctx changed to 4096 (PR ggml-org#10136) - bug dormant - Dec 2025: default n_ctx changed to 0 (PR ggml-org#16653) - bug activated The bug was dormant for 2+ years because params.n_ctx defaulted to 512, then 4096. PR ggml-org#16653 changed it to 0 for GPU auto-fitting, triggering the crash. Fix: Use llama_n_ctx(ctx) to get the actual runtime context size, matching the pattern already used elsewhere in lookup.cpp (line 72) and in speculative.cpp/speculative-simple.cpp. Tested: llama-lookup now works without -c flag (12.5% acceptance on Gemma-3-1B). Note: llama-lookahead has a separate pre-existing issue with sequence initialization (n_seq_max=1 vs W+G+1 needed) that is unrelated to this fix. * lookahead: fix n_seq_max and kv_unified configuration Lookahead decoding requires: - W + G + 1 = 31 sequences for parallel Jacobi decoding - Unified KV cache for coupled sequences in batch splitting These requirements were broken after PR ggml-org#14482 changed validation logic. Consolidates fix from PR ggml-org#18730 per maintainer request. Commit message drafted with Claude.

ggerganov mentioned this pull request Jul 1, 2025

kv-cache : use ggml_set_rows #14285

Merged

5 tasks

rgerganov marked this pull request as ready for review July 2, 2025 06:04

ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 2f577c5 to 30b4d4e Compare July 2, 2025 12:49

ggerganov mentioned this pull request Jul 3, 2025

llama : support Jamba hybrid Transformer-Mamba models #7531

Merged

8 tasks

Base automatically changed from gg/kv-cache-use-set-rows to master July 3, 2025 07:53

ggerganov force-pushed the gg/llama-reuse-graphs branch from f61b0f7 to d9e1781 Compare July 3, 2025 08:00

gabe-l-hart mentioned this pull request Jul 3, 2025

Granite Four #13550

Merged

3 tasks

esrakorkmz approved these changes Jul 3, 2025

View reviewed changes

ggerganov marked this pull request as draft July 4, 2025 05:50

ggerganov force-pushed the gg/llama-reuse-graphs branch 4 times, most recently from fc4fdf6 to 76681e3 Compare July 5, 2025 12:26

ggerganov requested a review from slaren July 5, 2025 13:16

compilade reviewed Jul 5, 2025

View reviewed changes

ggerganov mentioned this pull request Jul 6, 2025

CUDA: add set rows for f32 and f16 #14551

Merged

ggerganov marked this pull request as ready for review July 6, 2025 16:43

github-actions bot added the examples label Jul 7, 2025

ggerganov mentioned this pull request Jul 7, 2025

metal : reuse graphs #14570

Draft

slaren reviewed Jul 8, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

src/llama-batch.h Outdated Show resolved Hide resolved

src/llama-graph.h Outdated Show resolved Hide resolved

ggerganov force-pushed the gg/llama-reuse-graphs branch from 7a9e3f4 to 600e69f Compare July 8, 2025 18:08

ggerganov commented Jul 8, 2025

View reviewed changes

slaren reviewed Jul 9, 2025

View reviewed changes

src/llama-model.cpp Outdated Show resolved Hide resolved

ggerganov force-pushed the gg/llama-reuse-graphs branch 2 times, most recently from ce77041 to 8303a68 Compare July 11, 2025 08:25

ggerganov added 3 commits July 17, 2025 12:14

batch : llama_ubatch now carries its data instead of pointing to balloc

acaf4b7

ggml-ci

Merge branch 'master' into gg/llama-reuse-graphs

04155f0

merge : fix build

a872790

ggml-ci

ggerganov commented Jul 17, 2025

View reviewed changes

finneyyan mentioned this pull request Jul 17, 2025

llama-cli on Hexagon-NPU introducing a lot of extra time chraac/llama.cpp#51

Open

ggerganov added 2 commits July 17, 2025 14:09

graph : fix can_reuse() checks when flash-attention is disabled

41366a4

graph : move llm_graph_result impl in source file + debug env

c7ccf38

ggml-ci

ggerganov force-pushed the gg/llama-reuse-graphs branch from 0bdb209 to c7ccf38 Compare July 17, 2025 11:12

slaren reviewed Jul 17, 2025

View reviewed changes

src/llama-batch.h Show resolved Hide resolved

ggerganov merged commit 01612b7 into master Jul 17, 2025
55 checks passed

ggerganov deleted the gg/llama-reuse-graphs branch July 17, 2025 16:08

This was referenced Jul 22, 2025

Eval bug: LLAMA_SET_ROWS=1 gibberish output with Dual GPU offload #14795

Closed

ggml: avoid rebuild of GGML graph for each token (#7456) #8366

Closed

This was referenced Jul 30, 2025

graph : fix stack-use-after-return #14960

Merged

graph : fix equal_seq() check #14986

Merged

aendk mentioned this pull request Aug 7, 2025

Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance. #11867

Open

ggerganov mentioned this pull request Dec 10, 2025

batch : fix sequence id ownership #17915

Merged

ggerganov mentioned this pull request Jan 3, 2026

graph : fix graph reuse logic when n_pos_per_embd > 1 #18566

Merged

pestopoppa mentioned this pull request Jan 10, 2026

lookahead: fix n_seq_max and kv_unified configuration #18730

Closed

ggerganov mentioned this pull request Feb 15, 2026

graph : fix KQ mask, lora, cvec reuse checks #19644

Merged

ORippler mentioned this pull request Feb 20, 2026

Improve CUDA graph capture #19754

Merged

Conversation

ggerganov commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

TODO

Next PRs

Uh oh!

ggerganov commented Jul 5, 2025

Uh oh!

compilade Jul 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 6, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jul 7, 2025

Uh oh!

ggerganov commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ggerganov Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov Jul 17, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ggerganov commented Jul 1, 2025 •

edited

Loading

compilade Jul 5, 2025 •

edited

Loading

ggerganov commented Jul 6, 2025 •

edited

Loading

ggerganov commented Jul 7, 2025 •

edited

Loading

am17an commented Jul 7, 2025 •

edited

Loading