Name and Version
version: 9272 (1d7ab2b)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server, libllama (core library)
Command line
./build/bin/llama-server -m /path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--cpu-moe -fa on -c 16384 -b 4096 -ub 4096 \
-ctk q8_0 -ctv q8_0 --kv-unified -np 1
Problem description & steps to reproduce
llama_context::reserve calls graph_reserve(n_tokens, n_seqs, n_tokens, ...). The third argument is n_outputs : the number of token rows that go through lm_head to produce logits, controlled per-batch by the caller via batch.logits[i]. Passing n_outputs = n_tokens here sizes the buffer for the worst case where every token in the ubatch produces a logits row.
In chat / completion workloads the caller only flags one token per sequence in batch.logits[], so the executed graph runs at n_outputs = n_seqs, often 1. #23433 (@am17an) made the runtime cheap via inp_out_ids, but the reservation still sizes for the worst case. On big-vocab models this is significant.
A/B with and without a local hack capping n_outputs to n_seqs in graph_reserve, on Qwen3.6-35B-A3B-UD-Q4_K_XL (n_vocab=211072), RTX 3080 10 GB. Peak sampled at 10 Hz during PP of a 10001-token prompt:
| run |
idle MiB |
peak MiB |
PP throughput |
| worst case |
6597 |
6671 |
2737.65 t/s |
| capped |
3637 |
3711 |
2761.82 t/s |
2960 MiB freed at idle and peak, PP within noise.
Capping at n_seqs only works for callers that do not want per-token logits. Callers like examples/perplexity that legitimately request many logits per call would break. The llama_decode API lets batch.logits[i] be set on any subset, so graph_reserve cannot prove n_outputs will be small without an explicit signal from the caller. Such a signal would also need to cover MTP / speculative-decoding verify, where the target context produces n_max+1 logits per sequence rather than just 1.
Related:
First Bad Commit
No response
Relevant log output
No response
Name and Version
version: 9272 (1d7ab2b)
built with GNU 13.3.0 for Linux x86_64
Operating systems
Linux
Which llama.cpp modules do you know to be affected?
llama-server, libllama (core library)
Command line
./build/bin/llama-server -m /path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \ --cpu-moe -fa on -c 16384 -b 4096 -ub 4096 \ -ctk q8_0 -ctv q8_0 --kv-unified -np 1Problem description & steps to reproduce
llama_context::reservecallsgraph_reserve(n_tokens, n_seqs, n_tokens, ...). The third argument isn_outputs: the number of token rows that go throughlm_headto produce logits, controlled per-batch by the caller viabatch.logits[i]. Passingn_outputs = n_tokenshere sizes the buffer for the worst case where every token in the ubatch produces a logits row.In chat / completion workloads the caller only flags one token per sequence in
batch.logits[], so the executed graph runs atn_outputs = n_seqs, often 1. #23433 (@am17an) made the runtime cheap viainp_out_ids, but the reservation still sizes for the worst case. On big-vocab models this is significant.A/B with and without a local hack capping
n_outputston_seqsingraph_reserve, on Qwen3.6-35B-A3B-UD-Q4_K_XL (n_vocab=211072), RTX 3080 10 GB. Peak sampled at 10 Hz during PP of a 10001-token prompt:2960 MiB freed at idle and peak, PP within noise.
Capping at
n_seqsonly works for callers that do not want per-token logits. Callers likeexamples/perplexitythat legitimately request many logits per call would break. Thellama_decodeAPI letsbatch.logits[i]be set on any subset, sograph_reservecannot proven_outputswill be small without an explicit signal from the caller. Such a signal would also need to cover MTP / speculative-decoding verify, where the target context producesn_max+1logits per sequence rather than just 1.Related:
inp_out_idsskip onlm_headFirst Bad Commit
No response
Relevant log output
No response