B1 mtp qwen rebase#11
Merged
Merged
Conversation
Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.
Recovery snapshot from agent transcript replay after accidental `git checkout` of working-tree-only NextN changes. Build is clean, but NextN inference is broken: target argmax produces garbage tokens when `cparams.embeddings_pre_norm=true` is paired with the server-side `nextn_prefill_all_outputs` + `do_checkpoint=false` overrides. Diagnostic findings (recorded in `.scratch/diag-logs/`): - baseline (SPEC=off): tps=21, coherent - nextn pre-fix (all NextN flags on): tps=4, garbage - step A (prime loop off): tps=4, garbage (prime not the cause) - step B (server overrides off): tps=12, ~coherent (main cause) Safety net before next iteration; no fix applied yet.
Two server-context fixes required to make NextN speculative decoding produce coherent target output and a small acceptance speedup on macOS Metal. Pairs with cherry-pick 8ce2b9e (upstream Metal GDN keep_intermediates=true), which is the actual root cause for the garbage logits. * NextN draft must NOT flip cparams.embeddings=true on the target context. Doing so reroutes the target graph to emit pooled/embedding outputs in place of vocab logits and breaks sampling for every generated token. NextN has its own pre-norm channel (llama_set_embeddings_pre_norm + llama_get_embeddings_pre_norm_ith); only Gemma 4 MTP needs the embeddings flag. * Skip override_arch when --model-draft points at a standalone *_NEXTN_ONLY.gguf whose general.architecture is already qwen35_mtp / qwen35moe_mtp. Avoids the double-mmap of the target file when target and draft are different paths. Also adds scripts/extract-qwen36-nextn-gguf.py to produce a self- contained NextN draft GGUF from a combined *_MTP.gguf for the separate-draft path. Verified on Qwen3.6-27B-UD-Q4_K_XL_MTP + Q4_K_XL_NEXTN_ONLY draft on Apple Silicon: 24.4 tok/s with acc=87.5% (DM=2) vs 20.85 tok/s no-spec baseline. Co-authored-by: Cursor <cursoragent@cursor.com>
…ocessing This update introduces an asynchronous worker thread to enhance the NextN speculative decoding process. The worker overlaps draft computation with server-side token processing, improving efficiency. Key changes include: - Added a worker thread managed by a mutex and condition variable for handling draft requests. - Implemented a pipeline mechanism that allows the system to return results from previous drafts while processing new requests. - Introduced environment variable control for enabling/disabling the pipeline. This enhancement aims to optimize performance and reduce latency in generating coherent outputs during NextN processing.
… GGUF)
Previously, NextN draft contexts loaded a second llama_model from the same
combined *_MTP.gguf with override_arch=qwen35*_mtp. On Apple Silicon (mmap=true)
each llama_model creates its own MTLBuffer covering the full file, so the 22 GB
Qwen3.6-35B-A3B target was mapped twice (~44 GB) and OOMed on M4 Max (38 GB
unified memory): kIOGPUCommandBufferCallbackErrorOutOfMemory.
The target model already loads the NextN-layer tensors into its own layer table
(see LLM_ARCH_QWEN35{,MOE} loaders, `layer.nextn.*` on `i >= n_layer -
nextn_predict_layers`). The draft context can reuse them directly:
- Add cparams.nextn_draft + llama_context_params.nextn_draft (default false).
- Add LLM_GRAPH_TYPE_NEXTN; llama_context routes decode/reserve through this
gtype when cparams.nextn_draft=true.
- In llama_model::build_graph, dispatch QWEN35 / QWEN35MOE + gtype=NEXTN to
the existing llm_build_qwen35*_nextn builders (graphs unchanged otherwise;
swap build_attn_inp_kv() → build_inp_mem_hybrid()->get_attn() because the
target's memory is hybrid attn+GDN, not pure KV — pure-KV cast was UB).
- llama_context ctor temporarily flips hparams.kv_only_nextn=true around
create_memory() so the draft's KV cache only allocates cells for the
NextN layer; the target context (constructed earlier) keeps its own KV
layout via the per-memory hparams copy.
- llama_context::graph_params hands a tweaked hparams_eff to the graph
builder for draft contexts so has_kv routes correctly.
- llm_graph_input_mem_hybrid::set_input: guard the recurrent-state s_copy
backend buffer; NextN graphs never reference it, so the scheduler doesn't
allocate one.
- server-context.cpp: when target has NextN tensors and --model-draft points
at the same file, set speculative.model_dft = model and
cparams_dft.nextn_draft = true (no llama_model_load_from_file). The legacy
standalone NEXTN_ONLY GGUF path is preserved for users shipping the draft
head as a separate artifact.
- common_speculative_are_compatible_nextn accepts model_tgt == model_dft for
the shared-model path.
- Public API: llama_model_has_nextn_layer / llama_model_n_nextn_predict_layers.
Benchmarks (Apple M4 Max, Metal, prompt ~50 tokens, --draft-max=2 --draft-min=1,
ctx=8192, median of 3 runs; full table in NEXTN.md §7):
qwen-35B-A3B MoE f16-nextn long=512: 83.63 tps (+20.7% vs f16-base 69.30)
qwen-35B-A3B MoE turbo3-nextn long=512: 78.41 tps (+26.5% vs turbo3-base 61.97)
35B-A3B MoE no longer OOMs (one MTL0_Mapped buffer = 21784 MiB instead of two).
27B dense remains draft-compute-bound (NextN layer = full transformer block,
t_draft ≈ 2.6× t_verify on dense); async pipeline can't fully overlap, so
NextN is paritetical on f16 and ~-12% on turbo3 (turbo3 dequant inside NextN
attention adds ~7% draft compute). Documented as known limitation in NEXTN.md §7.
Co-authored-by: Cursor <cursoragent@cursor.com>
…model This commit introduces a new script, `sanity-27b-turbo3-base.sh`, which performs a cold-start sanity check for the 27B turbo3-base model. The script measures throughput against a historical baseline of ~18.4 TPS to identify potential thermal or code regressions. Key features include server initialization, health checks, and performance measurement over multiple runs with a predefined prompt. The script aims to ensure the model's operational integrity and performance consistency.
…ma 4 MTP - Updated benchmark logs for Qwen 3.6 NextN, showing improved throughput of +24-36% on MoE targets and +5-7% on dense models. - Revised performance notes in NEXTN.md to reflect shared-model draft path optimizations, eliminating the need for a second mmap. - Enhanced README.md with new Qwen 3.6 NextN features and usage instructions, including shared model configurations. - Adjusted MTP.md to include updated matrix benchmarks and observations for Gemma 4, highlighting performance gains and acceptance rates. - Improved clarity in documentation regarding the integration of TurboQuant KV with NextN for optimal performance.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Additional information
Requirements