speculative : fix n_outputs_max and remove draft-simple auto-enable by ggerganov · Pull Request #23988 · ggml-org/llama.cpp

ggerganov · 2026-06-01T18:36:49Z

Overview

Add common_speculative_n_max() helper function to calculate the max number of draft tokens based on speculative parameters
Use this helper in the server to replace n_outputs_max calculation logic
Remove the auto-enable of draft-simple speculative type when a draft model path is specified (users must now explicitly enable it)
Draft context always has n_parallel outputs (simplified)
Log n_outputs_max during llama context initialization
Re-enable server.yml tests in PRs

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES. llama.cpp + pi + Qwen3.6-27B

Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi

CISC · 2026-06-01T21:09:48Z

https://github.com/ggml-org/llama.cpp/actions/runs/26776939227/job/78930896518

ServeurpersoCom · 2026-06-01T21:35:12Z

https://github.com/ggml-org/llama.cpp/actions/runs/26776939227/job/78930896518

I launched this on my fork https://github.com/ServeurpersoCom/llama.cpp/actions/runs/26783430730

--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -619,7 +619,9 @@ void llama_context::sched_reserve() {
         //
         // auto * gf = graph_reserve(n_tokens, 1, n_tokens, mctx.get());
         //
-        auto * gf = graph_reserve(n_tokens, n_seqs, n_outputs_pp, mctx.get(), model.hparams.no_alloc);
+        // this final reservation must use the worst case n_tokens outputs so the warmup decode
+        // and any later batch reuse the same graph layout without triggering a reallocation
+        auto * gf = graph_reserve(n_tokens, n_seqs, n_tokens, mctx.get(), model.hparams.no_alloc);
         if (!gf) {
             throw std::runtime_error("failed to allocate compute pp buffers");
         }
root@pod:/mnt/workspace/llama.cpp#

pwilkin · 2026-06-01T21:54:41Z

That defeats the VRAM savings though, doesn't it? I think the proper way to do this is instead for warmup to restrict itself to n_outputs_pp.

ServeurpersoCom · 2026-06-02T06:38:20Z

Patch A: Re-enables sched_need_reserve = true in set_warmup, so entering warmup re-reserves the graph with all experts active and fixes the reallocation abort on MoE models. One line, green CI and persistent VRAM preserved, but a 7248 MiB transient spike during warmup.
https://github.com/ServeurpersoCom/llama.cpp/actions/runs/26799798196

Patch B: Adds A plus an n_tokens override in sched_reserve, so the warmup re-reserve uses the actual batch size (2 tokens) instead of the full n_ubatch worst case. Same fix, same persistent VRAM, but the transient spike drops to 18 MiB.
https://github.com/ServeurpersoCom/llama.cpp/actions/runs/26802937595

ggerganov · 2026-06-02T06:57:44Z

The #24009 should do the job. The warmup mechanism on master is basically incompatible with the reserve logic because technically we use a different (much larger) graph during the warmup, while we reserve using the standard (small) graph.

git merge's 3-way resolver did not flag two semantic duplicates in tools/server/server-context.cpp because the merge base did not contain either symbol. The duplicate bodies are byte-identical, so removing the second copy of each pair is semantically equivalent. Removed: - `const bool near_prompt_end` declaration at line 3653 (upstream side, e2ef8fe, "server: fix checkpoints creation" PR ggml-org#22929). - `static uint32_t server_n_outputs_max(...)` body at lines 219-232 (upstream side, de6f727, "llama: limit max outputs of llama_context" PR ggml-org#23861; one line modified by 5dcb711, "speculative: fix n_outputs_max and remove draft-simple auto-enable" PR ggml-org#23988). Kept: - The cache-side copies (72cfbcd), which match the cache-optimization chain the Stage 11 work is built on.

…gml-org#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs (cherry picked from commit 5dcb711)

Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.

ggerganov added 4 commits June 1, 2026 21:16

speculative : add common_speculative_n_max helper function

8c41b75

Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi

cont : draft context always has n_parallel outputs

a808e89

llama : log n_outputs_max

016191d

speculative : remove draft-simple auto-enable

6476b67

ggerganov added the refactoring Refactoring label Jun 1, 2026

github-actions Bot added examples server labels Jun 1, 2026

ci : enable server tests on PRs

2f6f998

github-actions Bot added the devops improvements to build systems and github actions label Jun 1, 2026

ServeurpersoCom mentioned this pull request Jun 1, 2026

server: fix n_outputs_max sizing for implicit draft-simple #23987

Closed

ggerganov marked this pull request as ready for review June 1, 2026 19:03

ggerganov requested review from a team as code owners June 1, 2026 19:03

ggerganov merged commit 5dcb711 into master Jun 1, 2026
14 of 25 checks passed

ggerganov deleted the gg/spec-fix-n-max branch June 1, 2026 19:27

porkloin mentioned this pull request Jun 3, 2026

Eval bug: Vulkan: performance drop in recent builds #24066

Open

TheTom mentioned this pull request Jun 8, 2026

Gemma 4 MTP: bring in the upstream MTP lineage (qwen35 post-norm + gemma4) on TurboQuant+ TheTom/llama-cpp-turboquant#172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speculative : fix n_outputs_max and remove draft-simple auto-enable#23988

speculative : fix n_outputs_max and remove draft-simple auto-enable#23988
ggerganov merged 5 commits into
masterfrom
gg/spec-fix-n-max

ggerganov commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

CISC commented Jun 1, 2026

Uh oh!

ServeurpersoCom commented Jun 1, 2026 •

edited

Loading

Uh oh!

pwilkin commented Jun 1, 2026

Uh oh!

ServeurpersoCom commented Jun 2, 2026 •

edited

Loading

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ggerganov commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Requirements

Uh oh!

Uh oh!

CISC commented Jun 1, 2026

Uh oh!

ServeurpersoCom commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pwilkin commented Jun 1, 2026

Uh oh!

ServeurpersoCom commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Jun 1, 2026 •

edited

Loading

ServeurpersoCom commented Jun 1, 2026 •

edited

Loading

ServeurpersoCom commented Jun 2, 2026 •

edited

Loading