B1 mtp qwen rebase by Ooooze · Pull Request #11 · AtomicBot-ai/atomic-llama-cpp-turboquant

Ooooze · 2026-05-12T20:32:45Z

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure:

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

Recovery snapshot from agent transcript replay after accidental `git checkout` of working-tree-only NextN changes. Build is clean, but NextN inference is broken: target argmax produces garbage tokens when `cparams.embeddings_pre_norm=true` is paired with the server-side `nextn_prefill_all_outputs` + `do_checkpoint=false` overrides. Diagnostic findings (recorded in `.scratch/diag-logs/`): - baseline (SPEC=off): tps=21, coherent - nextn pre-fix (all NextN flags on): tps=4, garbage - step A (prime loop off): tps=4, garbage (prime not the cause) - step B (server overrides off): tps=12, ~coherent (main cause) Safety net before next iteration; no fix applied yet.

Two server-context fixes required to make NextN speculative decoding produce coherent target output and a small acceptance speedup on macOS Metal. Pairs with cherry-pick 8ce2b9e (upstream Metal GDN keep_intermediates=true), which is the actual root cause for the garbage logits. * NextN draft must NOT flip cparams.embeddings=true on the target context. Doing so reroutes the target graph to emit pooled/embedding outputs in place of vocab logits and breaks sampling for every generated token. NextN has its own pre-norm channel (llama_set_embeddings_pre_norm + llama_get_embeddings_pre_norm_ith); only Gemma 4 MTP needs the embeddings flag. * Skip override_arch when --model-draft points at a standalone *_NEXTN_ONLY.gguf whose general.architecture is already qwen35_mtp / qwen35moe_mtp. Avoids the double-mmap of the target file when target and draft are different paths. Also adds scripts/extract-qwen36-nextn-gguf.py to produce a self- contained NextN draft GGUF from a combined *_MTP.gguf for the separate-draft path. Verified on Qwen3.6-27B-UD-Q4_K_XL_MTP + Q4_K_XL_NEXTN_ONLY draft on Apple Silicon: 24.4 tok/s with acc=87.5% (DM=2) vs 20.85 tok/s no-spec baseline. Co-authored-by: Cursor <cursoragent@cursor.com>

…ocessing This update introduces an asynchronous worker thread to enhance the NextN speculative decoding process. The worker overlaps draft computation with server-side token processing, improving efficiency. Key changes include: - Added a worker thread managed by a mutex and condition variable for handling draft requests. - Implemented a pipeline mechanism that allows the system to return results from previous drafts while processing new requests. - Introduced environment variable control for enabling/disabling the pipeline. This enhancement aims to optimize performance and reduce latency in generating coherent outputs during NextN processing.

… GGUF) Previously, NextN draft contexts loaded a second llama_model from the same combined *_MTP.gguf with override_arch=qwen35*_mtp. On Apple Silicon (mmap=true) each llama_model creates its own MTLBuffer covering the full file, so the 22 GB Qwen3.6-35B-A3B target was mapped twice (~44 GB) and OOMed on M4 Max (38 GB unified memory): kIOGPUCommandBufferCallbackErrorOutOfMemory. The target model already loads the NextN-layer tensors into its own layer table (see LLM_ARCH_QWEN35{,MOE} loaders, `layer.nextn.*` on `i >= n_layer - nextn_predict_layers`). The draft context can reuse them directly: - Add cparams.nextn_draft + llama_context_params.nextn_draft (default false). - Add LLM_GRAPH_TYPE_NEXTN; llama_context routes decode/reserve through this gtype when cparams.nextn_draft=true. - In llama_model::build_graph, dispatch QWEN35 / QWEN35MOE + gtype=NEXTN to the existing llm_build_qwen35*_nextn builders (graphs unchanged otherwise; swap build_attn_inp_kv() → build_inp_mem_hybrid()->get_attn() because the target's memory is hybrid attn+GDN, not pure KV — pure-KV cast was UB). - llama_context ctor temporarily flips hparams.kv_only_nextn=true around create_memory() so the draft's KV cache only allocates cells for the NextN layer; the target context (constructed earlier) keeps its own KV layout via the per-memory hparams copy. - llama_context::graph_params hands a tweaked hparams_eff to the graph builder for draft contexts so has_kv routes correctly. - llm_graph_input_mem_hybrid::set_input: guard the recurrent-state s_copy backend buffer; NextN graphs never reference it, so the scheduler doesn't allocate one. - server-context.cpp: when target has NextN tensors and --model-draft points at the same file, set speculative.model_dft = model and cparams_dft.nextn_draft = true (no llama_model_load_from_file). The legacy standalone NEXTN_ONLY GGUF path is preserved for users shipping the draft head as a separate artifact. - common_speculative_are_compatible_nextn accepts model_tgt == model_dft for the shared-model path. - Public API: llama_model_has_nextn_layer / llama_model_n_nextn_predict_layers. Benchmarks (Apple M4 Max, Metal, prompt ~50 tokens, --draft-max=2 --draft-min=1, ctx=8192, median of 3 runs; full table in NEXTN.md §7): qwen-35B-A3B MoE f16-nextn long=512: 83.63 tps (+20.7% vs f16-base 69.30) qwen-35B-A3B MoE turbo3-nextn long=512: 78.41 tps (+26.5% vs turbo3-base 61.97) 35B-A3B MoE no longer OOMs (one MTL0_Mapped buffer = 21784 MiB instead of two). 27B dense remains draft-compute-bound (NextN layer = full transformer block, t_draft ≈ 2.6× t_verify on dense); async pipeline can't fully overlap, so NextN is paritetical on f16 and ~-12% on turbo3 (turbo3 dequant inside NextN attention adds ~7% draft compute). Documented as known limitation in NEXTN.md §7. Co-authored-by: Cursor <cursoragent@cursor.com>

…model This commit introduces a new script, `sanity-27b-turbo3-base.sh`, which performs a cold-start sanity check for the 27B turbo3-base model. The script measures throughput against a historical baseline of ~18.4 TPS to identify potential thermal or code regressions. Key features include server initialization, health checks, and performance measurement over multiple runs with a predefined prompt. The script aims to ensure the model's operational integrity and performance consistency.

Brings in Gemma 4 + TurboQuant KV cache fixes: - fix/turbo-rope-shift-gemma4 (PR #10) - fix/iswa-get-can-shift-gemma4 (PR #9) - fix/mtp-assistant-tensor-prefix (PR #7)

…ma 4 MTP - Updated benchmark logs for Qwen 3.6 NextN, showing improved throughput of +24-36% on MoE targets and +5-7% on dense models. - Revised performance notes in NEXTN.md to reflect shared-model draft path optimizations, eliminating the need for a second mmap. - Enhanced README.md with new Qwen 3.6 NextN features and usage instructions, including shared model configurations. - Adjusted MTP.md to include updated matrix benchmarks and observations for Gemma 4, highlighting performance gains and acceptance rates. - Improved clarity in documentation regarding the integration of TurboQuant KV with NextN for optimal performance.

am17an and others added 11 commits May 8, 2026 14:30

add enum for part sequence removal to enable checkpoints

7c3fe1c

review: rename rollback to rs_seq and remove public API

1e18bde

metal: add keep_intermediates=true path for GDN

8ce2b9e

Merge origin/feature/turboquant-kv-cache into b1-mtp-qwen-rebase

00e8d49

Brings in Gemma 4 + TurboQuant KV cache fixes: - fix/turbo-rope-shift-gemma4 (PR #10) - fix/iswa-get-can-shift-gemma4 (PR #9) - fix/mtp-assistant-tensor-prefix (PR #7)

Ooooze merged commit 514e600 into feature/turboquant-kv-cache May 12, 2026
27 of 57 checks passed

github-actions Bot added documentation Improvements or additions to documentation testing examples server Apple Metal ggml python script model Nvidia GPU Vulkan labels May 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

B1 mtp qwen rebase#11

B1 mtp qwen rebase#11
Ooooze merged 11 commits into
feature/turboquant-kv-cachefrom
b1-mtp-qwen-rebase

Ooooze commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ooooze commented May 12, 2026

Overview

Additional information

Requirements

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants