Skip to content

server: adaptive low-yield MTP speculation fallback#22931

Closed
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-adaptive-mtp-fallback
Closed

server: adaptive low-yield MTP speculation fallback#22931
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-adaptive-mtp-fallback

Conversation

@leon7609

Copy link
Copy Markdown

PR #5: server: adaptive low-yield MTP speculation fallback

Branch suggestion: feat/server-adaptive-mtp-fallback
Target: ggml-org/llama.cpp master — touches tools/server/server-context.cpp (server-side), common/speculative.cpp (counter plumbing), src/llama-context.cpp (gated timing logs). All files exist upstream.
Commits: 7b128ea08 (profiling logs, gated), c224576e0 (adaptive fallback), 6312c198b (faster trip threshold)

Problem

Speculative decoding (--spec-type mtp) catastrophically regresses throughput when the draft model's useful acceptance rate is low. On DSv4 the regression measured at -79.9%:

Config tok/s Notes
no-MTP baseline 56.49 -ub 512
MTP --spec-draft-n-max 8 (naïve) 11.37 -79.9%
MTP --spec-draft-n-max 1 30.92 first-draft acceptance only 59%, still net negative

Profiling found the slowdown is target verification token amplification, not sidecar forward / checkpoint overhead. With n_max=8, the target model decodes 2,083 batch tokens for 420 visible output tokens (4.70× amplification). The printed acceptance metric is misleading under FULL-memory checkpoint replay — actual useful drafts are far fewer than the counter suggests.

This is a real production concern: enabling MTP for a model whose draft + verification pattern doesn't align well silently turns a fast server into a slow one. The user has no easy way to detect this before paying the cost.

Solution

Adaptive per-request fallback that measures actual verified draft acceptance (accepted.size() - 1, not the misleading final printed counter) and disables further speculation for the rest of that request once useful acceptance falls below threshold.

Patch shape

tools/server/server-context.cpp:

  • Per-slot actual MTP verification counters (drafts generated vs drafts accepted, separately from the existing acceptance-rate counter).
  • After the first verify batch, if accepted < 75% of drafted, set a per-slot mtp_fallback flag.
  • Subsequent decode steps skip new draft generation; the slot's existing partial-accepted tokens are replayed before fully falling back to normal greedy decode.

common/speculative.cpp:

  • New n_actual_accept counter exposed alongside the existing acceptance metric. Tracks only useful drafts (drafts that contributed an extra token beyond what the target would have produced anyway). Used by the server-side fallback decision.

src/llama-context.cpp:

  • Optional timing logs gated behind LLAMA_PHASE7_MTP_TIMING env. Off by default; useful for future MTP perf work.

Threshold tuning

Initial implementation tripped the fallback after the first low-yield batch (8 draft tokens). A follow-up patch lowered this to trip after any single batch with 0 useful accepts, which catches the catastrophic case faster:

n_max=8: disabling MTP speculation
  ... actual draft acceptance 0/8 after 1 verify batches

Further threshold tuning gave only +2.5% (54.32 → 55.66 tok/s) so was stopped per ROI.

Performance after fix

Config tok/s Result
no-MTP baseline 56.49 reference
MTP n_max=8 (without fallback, pre-fix) 11.37 -79.9%
MTP n_max=8 + adaptive fallback 55.66 -1.5% vs no-MTP
MTP n_max=1 + adaptive fallback 55.46 -1.8% vs no-MTP

The fallback does not deliver a positive speedup — useful MTP acceptance on this specific sidecar / model combination is too low to amortize verification cost regardless of n_max. It does, however, prevent a 5× throughput collapse when MTP is enabled, which makes MTP safe to leave on as an experimental mode rather than a deployment risk.

Why upstream

Two reasons:

  1. Safety: any deployment that enables --spec-type mtp without an a-priori measured guarantee that the draft model amortizes verification cost is one bad sidecar away from a 5× throughput regression. The adaptive fallback turns that from a "must verify offline" risk into a runtime self-correcting behavior.

  2. General applicability: nothing in the fallback logic is DSv4-specific. Any model + sidecar combination where target verification cost can exceed useful draft yield will benefit. The n_actual_accept counter is itself a useful general-purpose telemetry addition for MTP / speculative work.

Caveats / non-goals

  • This PR does not make MTP go faster. The path to a publishable speedup requires a better-aligned draft model (higher useful acceptance), which is a model-side problem, not a server-side one.
  • The 75% threshold is currently a constant. A future PR could expose it as a CLI flag (--spec-min-acceptance-ratio) if there's demand.
  • The LLAMA_PHASE7_MTP_TIMING env-gated timing logs are kept off by default; they were used during profiling and have small overhead when enabled.

Hardware tested

  • NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB), CUDA 13.0, driver 595.71.05
  • DSv4 (antirez/deepseek-v4-gguf IQ2XXS, 86 GB) + DSv4 MTP draft (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, 3.8 GB)

Related

  • Companion: fix/deepseek4-mtp-decode-stabilize (correctness fix for first-decode crash; required for MTP to actually run)
  • Original MTP integration: nisparks PR Wip/deepseek v4 support #22378 / am17an mtp-clean

@ggml-gh-bot

ggml-gh-bot Bot commented May 11, 2026

Copy link
Copy Markdown

Hi @leon7609, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@leon7609

Copy link
Copy Markdown
Author

Closing temporarily to comply with the 1-open-PR-per-new-contributor policy flagged by ggml-gh-bot. Will reopen after #22932 (the higher-priority cache PR) has been reviewed.

Sorry for the noise — I should have noticed the policy before opening both at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants