Skip to content

B1 mtp qwen rebase#11

Merged
Ooooze merged 11 commits into
feature/turboquant-kv-cachefrom
b1-mtp-qwen-rebase
May 12, 2026
Merged

B1 mtp qwen rebase#11
Ooooze merged 11 commits into
feature/turboquant-kv-cachefrom
b1-mtp-qwen-rebase

Conversation

@Ooooze

@Ooooze Ooooze commented May 12, 2026

Copy link
Copy Markdown

Overview

Additional information

Requirements

am17an and others added 11 commits May 8, 2026 14:30
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
Recovery snapshot from agent transcript replay after accidental
`git checkout` of working-tree-only NextN changes. Build is clean,
but NextN inference is broken: target argmax produces garbage tokens
when `cparams.embeddings_pre_norm=true` is paired with the server-side
`nextn_prefill_all_outputs` + `do_checkpoint=false` overrides.

Diagnostic findings (recorded in `.scratch/diag-logs/`):
- baseline (SPEC=off):                tps=21, coherent
- nextn pre-fix (all NextN flags on): tps=4,  garbage
- step A (prime loop off):            tps=4,  garbage     (prime not the cause)
- step B (server overrides off):      tps=12, ~coherent   (main cause)

Safety net before next iteration; no fix applied yet.
Two server-context fixes required to make NextN speculative decoding
produce coherent target output and a small acceptance speedup on macOS
Metal. Pairs with cherry-pick 8ce2b9e (upstream Metal GDN
keep_intermediates=true), which is the actual root cause for the
garbage logits.

* NextN draft must NOT flip cparams.embeddings=true on the target
  context. Doing so reroutes the target graph to emit pooled/embedding
  outputs in place of vocab logits and breaks sampling for every
  generated token. NextN has its own pre-norm channel
  (llama_set_embeddings_pre_norm + llama_get_embeddings_pre_norm_ith);
  only Gemma 4 MTP needs the embeddings flag.

* Skip override_arch when --model-draft points at a standalone
  *_NEXTN_ONLY.gguf whose general.architecture is already
  qwen35_mtp / qwen35moe_mtp. Avoids the double-mmap of the target
  file when target and draft are different paths.

Also adds scripts/extract-qwen36-nextn-gguf.py to produce a self-
contained NextN draft GGUF from a combined *_MTP.gguf for the
separate-draft path.

Verified on Qwen3.6-27B-UD-Q4_K_XL_MTP + Q4_K_XL_NEXTN_ONLY draft on
Apple Silicon: 24.4 tok/s with acc=87.5% (DM=2) vs 20.85 tok/s
no-spec baseline.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ocessing

This update introduces an asynchronous worker thread to enhance the NextN speculative decoding process. The worker overlaps draft computation with server-side token processing, improving efficiency. Key changes include:

- Added a worker thread managed by a mutex and condition variable for handling draft requests.
- Implemented a pipeline mechanism that allows the system to return results from previous drafts while processing new requests.
- Introduced environment variable control for enabling/disabling the pipeline.

This enhancement aims to optimize performance and reduce latency in generating coherent outputs during NextN processing.
… GGUF)

Previously, NextN draft contexts loaded a second llama_model from the same
combined *_MTP.gguf with override_arch=qwen35*_mtp. On Apple Silicon (mmap=true)
each llama_model creates its own MTLBuffer covering the full file, so the 22 GB
Qwen3.6-35B-A3B target was mapped twice (~44 GB) and OOMed on M4 Max (38 GB
unified memory): kIOGPUCommandBufferCallbackErrorOutOfMemory.

The target model already loads the NextN-layer tensors into its own layer table
(see LLM_ARCH_QWEN35{,MOE} loaders, `layer.nextn.*` on `i >= n_layer -
nextn_predict_layers`). The draft context can reuse them directly:

  - Add cparams.nextn_draft + llama_context_params.nextn_draft (default false).
  - Add LLM_GRAPH_TYPE_NEXTN; llama_context routes decode/reserve through this
    gtype when cparams.nextn_draft=true.
  - In llama_model::build_graph, dispatch QWEN35 / QWEN35MOE + gtype=NEXTN to
    the existing llm_build_qwen35*_nextn builders (graphs unchanged otherwise;
    swap build_attn_inp_kv() → build_inp_mem_hybrid()->get_attn() because the
    target's memory is hybrid attn+GDN, not pure KV — pure-KV cast was UB).
  - llama_context ctor temporarily flips hparams.kv_only_nextn=true around
    create_memory() so the draft's KV cache only allocates cells for the
    NextN layer; the target context (constructed earlier) keeps its own KV
    layout via the per-memory hparams copy.
  - llama_context::graph_params hands a tweaked hparams_eff to the graph
    builder for draft contexts so has_kv routes correctly.
  - llm_graph_input_mem_hybrid::set_input: guard the recurrent-state s_copy
    backend buffer; NextN graphs never reference it, so the scheduler doesn't
    allocate one.
  - server-context.cpp: when target has NextN tensors and --model-draft points
    at the same file, set speculative.model_dft = model and
    cparams_dft.nextn_draft = true (no llama_model_load_from_file). The legacy
    standalone NEXTN_ONLY GGUF path is preserved for users shipping the draft
    head as a separate artifact.
  - common_speculative_are_compatible_nextn accepts model_tgt == model_dft for
    the shared-model path.
  - Public API: llama_model_has_nextn_layer / llama_model_n_nextn_predict_layers.

Benchmarks (Apple M4 Max, Metal, prompt ~50 tokens, --draft-max=2 --draft-min=1,
ctx=8192, median of 3 runs; full table in NEXTN.md §7):

  qwen-35B-A3B MoE f16-nextn      long=512:  83.63 tps (+20.7% vs f16-base 69.30)
  qwen-35B-A3B MoE turbo3-nextn   long=512:  78.41 tps (+26.5% vs turbo3-base 61.97)

35B-A3B MoE no longer OOMs (one MTL0_Mapped buffer = 21784 MiB instead of two).
27B dense remains draft-compute-bound (NextN layer = full transformer block,
t_draft ≈ 2.6× t_verify on dense); async pipeline can't fully overlap, so
NextN is paritetical on f16 and ~-12% on turbo3 (turbo3 dequant inside NextN
attention adds ~7% draft compute). Documented as known limitation in NEXTN.md §7.

Co-authored-by: Cursor <cursoragent@cursor.com>
…model

This commit introduces a new script, `sanity-27b-turbo3-base.sh`, which performs a cold-start sanity check for the 27B turbo3-base model. The script measures throughput against a historical baseline of ~18.4 TPS to identify potential thermal or code regressions. Key features include server initialization, health checks, and performance measurement over multiple runs with a predefined prompt. The script aims to ensure the model's operational integrity and performance consistency.
Brings in Gemma 4 + TurboQuant KV cache fixes:
- fix/turbo-rope-shift-gemma4 (PR #10)
- fix/iswa-get-can-shift-gemma4 (PR #9)
- fix/mtp-assistant-tensor-prefix (PR #7)
…ma 4 MTP

- Updated benchmark logs for Qwen 3.6 NextN, showing improved throughput of +24-36% on MoE targets and +5-7% on dense models.
- Revised performance notes in NEXTN.md to reflect shared-model draft path optimizations, eliminating the need for a second mmap.
- Enhanced README.md with new Qwen 3.6 NextN features and usage instructions, including shared model configurations.
- Adjusted MTP.md to include updated matrix benchmarks and observations for Gemma 4, highlighting performance gains and acceptance rates.
- Improved clarity in documentation regarding the integration of TurboQuant KV with NextN for optimal performance.
@Ooooze Ooooze merged commit 514e600 into feature/turboquant-kv-cache May 12, 2026
27 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants