speculative : fix n_outputs_max and remove draft-simple auto-enable#23988
Conversation
Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi
I launched this on my fork https://github.com/ServeurpersoCom/llama.cpp/actions/runs/26783430730 |
|
That defeats the VRAM savings though, doesn't it? I think the proper way to do this is instead for warmup to restrict itself to |
|
Patch A: Re-enables sched_need_reserve = true in set_warmup, so entering warmup re-reserves the graph with all experts active and fixes the reallocation abort on MoE models. One line, green CI and persistent VRAM preserved, but a 7248 MiB transient spike during warmup.
|
|
The #24009 should do the job. The warmup mechanism on |
git merge's 3-way resolver did not flag two semantic duplicates in
tools/server/server-context.cpp because the merge base did not contain
either symbol. The duplicate bodies are byte-identical, so removing
the second copy of each pair is semantically equivalent.
Removed:
- `const bool near_prompt_end` declaration at line 3653 (upstream
side, e2ef8fe, "server: fix checkpoints creation" PR ggml-org#22929).
- `static uint32_t server_n_outputs_max(...)` body at lines 219-232
(upstream side, de6f727, "llama: limit max outputs of
llama_context" PR ggml-org#23861; one line modified by 5dcb711,
"speculative: fix n_outputs_max and remove draft-simple auto-enable"
PR ggml-org#23988).
Kept:
- The cache-side copies (72cfbcd), which match the
cache-optimization chain the Stage 11 work is built on.
…gml-org#23988) * speculative : add common_speculative_n_max helper function Extract the speculative max-draft-size logic from server_n_outputs_max into a reusable common_speculative_n_max() function in common/speculative. Assisted-by: llama.cpp:local pi * cont : draft context always has n_parallel outputs * llama : log n_outputs_max * speculative : remove draft-simple auto-enable * ci : enable server tests on PRs (cherry picked from commit 5dcb711)
Integration glue so the upstream MTP lineage (ggml-org#23198..ggml-org#23398) builds on this fork without disturbing TurboQuant+ or the custom kernels: - llama_kv_cache ctor: thread the new `hparams` param and `layer_share_cb` through all call sites (iswa, memory-hybrid, dsa, model.cpp); keep the fork's turbo auto-asymmetric K upgrade, n_layer_kv() sizing (+3 rotation tensors), and per-side LLAMA_ATTN_ROT_* policy (default OFF) — now nested under the new `if (other) { share } else { ... }` KV-sharing branch. - hparams: carry n_layer_all/n_layer_nextn + n_layer()/n_layer_kv() from the refactor while keeping the fork's n_layer_kv_from_start; restore the swa_layers->is_swa_impl / recurrent_layer_arr->is_recr_impl / nextn_predict_layers->n_layer_nextn renames across fork models. - add n_outputs_max to cparams / common_params / llama_context_params and wire it through; restore deepstack_mapping_arr. - server: keep the ggml-org#23398 ctx_other (MTP draft KV-sharing) wiring; drop the ggml-org#23988 --fit VRAM pre-estimation block (depends on upstream helpers not on this fork; MTP does not need it). - drop upstream-only models pulled in by the refactor (deepseek32, mellum, talkie); keep non-MTP fork models on their own source + mechanical refactor. Builds clean on Metal; turbo quant unit test passes (turbo2/3/4 round-trip). Kernels (ggml-cuda / ggml-metal) untouched.
Overview
cont #23861
common_speculative_n_max()helper function to calculate the max number of draft tokens based on speculative parametersn_outputs_maxcalculation logicn_paralleloutputs (simplified)n_outputs_maxduring llama context initializationserver.ymltests in PRsRequirements