Misc. bug: Could save ~3 GB VRAM in `graph_reserve` when caller doesn't need logits (big-vocab models at large ub)

### Name and Version

version: 9272 (1d7ab2b94)
built with GNU 13.3.0 for Linux x86_64

### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server, libllama (core library)

### Command line

```shell
./build/bin/llama-server -m /path/to/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
    --cpu-moe -fa on -c 16384 -b 4096 -ub 4096 \
    -ctk q8_0 -ctv q8_0 --kv-unified -np 1
```

### Problem description & steps to reproduce

`llama_context::reserve` calls `graph_reserve(n_tokens, n_seqs, n_tokens, ...)`. The third argument is `n_outputs` : the number of token rows that go through `lm_head` to produce logits, controlled per-batch by the caller via `batch.logits[i]`. Passing `n_outputs = n_tokens` here sizes the buffer for the worst case where every token in the ubatch produces a logits row.

In chat / completion workloads the caller only flags one token per sequence in `batch.logits[]`, so the executed graph runs at `n_outputs = n_seqs`, often 1. #23433 (@am17an) made the runtime cheap via `inp_out_ids`, but the reservation still sizes for the worst case. On big-vocab models this is significant.

A/B with and without a local hack capping `n_outputs` to `n_seqs` in `graph_reserve`, on Qwen3.6-35B-A3B-UD-Q4_K_XL (n_vocab=211072), RTX 3080 10 GB. Peak sampled at 10 Hz during PP of a 10001-token prompt:

| run        | idle MiB | peak MiB | PP throughput |
|------------|---------:|---------:|--------------:|
| worst case |     6597 |     6671 |  2737.65 t/s  |
| capped     |     3637 |     3711 |  2761.82 t/s  |

2960 MiB freed at idle and peak, PP within noise.

Capping at `n_seqs` only works for callers that do not want per-token logits. Callers like `examples/perplexity` that legitimately request many logits per call would break. The `llama_decode` API lets `batch.logits[i]` be set on any subset, so `graph_reserve` cannot prove `n_outputs` will be small without an explicit signal from the caller. Such a signal would also need to cover MTP / speculative-decoding verify, where the target context produces `n_max+1` logits per sequence rather than just 1.

Related:

- #23433 — runtime counterpart, `inp_out_ids` skip on `lm_head`
- #23244 — OOM with MTP on ROCm at ub=4096 (possibly same root cause)


### First Bad Commit

_No response_

### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misc. bug: Could save ~3 GB VRAM in `graph_reserve` when caller doesn't need logits (big-vocab models at large ub) #23527

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Misc. bug: Could save ~3 GB VRAM in graph_reserve when caller doesn't need logits (big-vocab models at large ub) #23527

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Misc. bug: Could save ~3 GB VRAM in `graph_reserve` when caller doesn't need logits (big-vocab models at large ub) #23527