Skip to content

Misc. bug: Could save ~3 GB VRAM in graph_reserve when caller doesn't need logits (big-vocab models at large ub) #23527

@drauh

Description

@drauh

Name and Version

version: 9272 (1d7ab2b)
built with GNU 13.3.0 for Linux x86_64

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server, libllama (core library)

Command line

./build/bin/llama-server -m /path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    --cpu-moe -fa on -c 16384 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 --kv-unified -np 1

Problem description & steps to reproduce

llama_context::reserve calls graph_reserve(n_tokens, n_seqs, n_tokens, ...). The third argument is n_outputs : the number of token rows that go through lm_head to produce logits, controlled per-batch by the caller via batch.logits[i]. Passing n_outputs = n_tokens here sizes the buffer for the worst case where every token in the ubatch produces a logits row.

In chat / completion workloads the caller only flags one token per sequence in batch.logits[], so the executed graph runs at n_outputs = n_seqs, often 1. #23433 (@am17an) made the runtime cheap via inp_out_ids, but the reservation still sizes for the worst case. On big-vocab models this is significant.

A/B with and without a local hack capping n_outputs to n_seqs in graph_reserve, on Qwen3.6-35B-A3B-UD-Q4_K_XL (n_vocab=211072), RTX 3080 10 GB. Peak sampled at 10 Hz during PP of a 10001-token prompt:

run idle MiB peak MiB PP throughput
worst case 6597 6671 2737.65 t/s
capped 3637 3711 2761.82 t/s

2960 MiB freed at idle and peak, PP within noise.

Capping at n_seqs only works for callers that do not want per-token logits. Callers like examples/perplexity that legitimately request many logits per call would break. The llama_decode API lets batch.logits[i] be set on any subset, so graph_reserve cannot prove n_outputs will be small without an explicit signal from the caller. Such a signal would also need to cover MTP / speculative-decoding verify, where the target context produces n_max+1 logits per sequence rather than just 1.

Related:

First Bad Commit

No response

Relevant log output

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requestperformanceSpeed related topics

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions