server: adaptive low-yield MTP speculation fallback by leon7609 · Pull Request #22931 · ggml-org/llama.cpp

leon7609 · 2026-05-11T02:04:18Z

PR #5: server: adaptive low-yield MTP speculation fallback

Branch suggestion: feat/server-adaptive-mtp-fallback
Target: ggml-org/llama.cpp master — touches tools/server/server-context.cpp (server-side), common/speculative.cpp (counter plumbing), src/llama-context.cpp (gated timing logs). All files exist upstream.
Commits: 7b128ea08 (profiling logs, gated), c224576e0 (adaptive fallback), 6312c198b (faster trip threshold)

Problem

Speculative decoding (--spec-type mtp) catastrophically regresses throughput when the draft model's useful acceptance rate is low. On DSv4 the regression measured at -79.9%:

Config	tok/s	Notes
no-MTP baseline	56.49	`-ub 512`
MTP `--spec-draft-n-max 8` (naïve)	11.37	-79.9%
MTP `--spec-draft-n-max 1`	30.92	first-draft acceptance only 59%, still net negative

Profiling found the slowdown is target verification token amplification, not sidecar forward / checkpoint overhead. With n_max=8, the target model decodes 2,083 batch tokens for 420 visible output tokens (4.70× amplification). The printed acceptance metric is misleading under FULL-memory checkpoint replay — actual useful drafts are far fewer than the counter suggests.

This is a real production concern: enabling MTP for a model whose draft + verification pattern doesn't align well silently turns a fast server into a slow one. The user has no easy way to detect this before paying the cost.

Solution

Adaptive per-request fallback that measures actual verified draft acceptance (accepted.size() - 1, not the misleading final printed counter) and disables further speculation for the rest of that request once useful acceptance falls below threshold.

Patch shape

tools/server/server-context.cpp:

Per-slot actual MTP verification counters (drafts generated vs drafts accepted, separately from the existing acceptance-rate counter).
After the first verify batch, if accepted < 75% of drafted, set a per-slot mtp_fallback flag.
Subsequent decode steps skip new draft generation; the slot's existing partial-accepted tokens are replayed before fully falling back to normal greedy decode.

common/speculative.cpp:

New n_actual_accept counter exposed alongside the existing acceptance metric. Tracks only useful drafts (drafts that contributed an extra token beyond what the target would have produced anyway). Used by the server-side fallback decision.

src/llama-context.cpp:

Optional timing logs gated behind LLAMA_PHASE7_MTP_TIMING env. Off by default; useful for future MTP perf work.

Threshold tuning

Initial implementation tripped the fallback after the first low-yield batch (8 draft tokens). A follow-up patch lowered this to trip after any single batch with 0 useful accepts, which catches the catastrophic case faster:

n_max=8: disabling MTP speculation
  ... actual draft acceptance 0/8 after 1 verify batches

Further threshold tuning gave only +2.5% (54.32 → 55.66 tok/s) so was stopped per ROI.

Performance after fix

Config	tok/s	Result
no-MTP baseline	56.49	reference
MTP n_max=8 (without fallback, pre-fix)	11.37	-79.9%
MTP n_max=8 + adaptive fallback	55.66	-1.5% vs no-MTP
MTP n_max=1 + adaptive fallback	55.46	-1.8% vs no-MTP

The fallback does not deliver a positive speedup — useful MTP acceptance on this specific sidecar / model combination is too low to amortize verification cost regardless of n_max. It does, however, prevent a 5× throughput collapse when MTP is enabled, which makes MTP safe to leave on as an experimental mode rather than a deployment risk.

Why upstream

Two reasons:

Safety: any deployment that enables --spec-type mtp without an a-priori measured guarantee that the draft model amortizes verification cost is one bad sidecar away from a 5× throughput regression. The adaptive fallback turns that from a "must verify offline" risk into a runtime self-correcting behavior.
General applicability: nothing in the fallback logic is DSv4-specific. Any model + sidecar combination where target verification cost can exceed useful draft yield will benefit. The n_actual_accept counter is itself a useful general-purpose telemetry addition for MTP / speculative work.

Caveats / non-goals

This PR does not make MTP go faster. The path to a publishable speedup requires a better-aligned draft model (higher useful acceptance), which is a model-side problem, not a server-side one.
The 75% threshold is currently a constant. A future PR could expose it as a CLI flag (--spec-min-acceptance-ratio) if there's demand.
The LLAMA_PHASE7_MTP_TIMING env-gated timing logs are kept off by default; they were used during profiling and have small overhead when enabled.

Hardware tested

NVIDIA RTX PRO 6000 Blackwell Workstation Edition (96 GB), CUDA 13.0, driver 595.71.05
DSv4 (antirez/deepseek-v4-gguf IQ2XXS, 86 GB) + DSv4 MTP draft (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, 3.8 GB)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: adaptive low-yield MTP speculation fallback#22931

server: adaptive low-yield MTP speculation fallback#22931
leon7609 wants to merge 1 commit into
ggml-org:masterfrom
leon7609:feat/server-adaptive-mtp-fallback

leon7609 commented May 11, 2026

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

leon7609 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

leon7609 commented May 11, 2026

PR #5: server: adaptive low-yield MTP speculation fallback

Problem

Solution

Patch shape

Threshold tuning

Performance after fix

Why upstream

Caveats / non-goals

Hardware tested

Related

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

leon7609 commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants