server: adaptive low-yield MTP speculation fallback#22931
Closed
leon7609 wants to merge 1 commit into
Closed
Conversation
|
Hi @leon7609, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
Author
|
Closing temporarily to comply with the 1-open-PR-per-new-contributor policy flagged by ggml-gh-bot. Will reopen after #22932 (the higher-priority cache PR) has been reviewed. Sorry for the noise — I should have noticed the policy before opening both at once. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #5: server: adaptive low-yield MTP speculation fallback
Problem
Speculative decoding (
--spec-type mtp) catastrophically regresses throughput when the draft model's useful acceptance rate is low. On DSv4 the regression measured at-79.9%:-ub 512--spec-draft-n-max 8(naïve)--spec-draft-n-max 1Profiling found the slowdown is target verification token amplification, not sidecar forward / checkpoint overhead. With
n_max=8, the target model decodes 2,083 batch tokens for 420 visible output tokens (4.70× amplification). The printed acceptance metric is misleading under FULL-memory checkpoint replay — actual useful drafts are far fewer than the counter suggests.This is a real production concern: enabling MTP for a model whose draft + verification pattern doesn't align well silently turns a fast server into a slow one. The user has no easy way to detect this before paying the cost.
Solution
Adaptive per-request fallback that measures actual verified draft acceptance (
accepted.size() - 1, not the misleading final printed counter) and disables further speculation for the rest of that request once useful acceptance falls below threshold.Patch shape
tools/server/server-context.cpp:accepted < 75%ofdrafted, set a per-slotmtp_fallbackflag.common/speculative.cpp:n_actual_acceptcounter exposed alongside the existing acceptance metric. Tracks only useful drafts (drafts that contributed an extra token beyond what the target would have produced anyway). Used by the server-side fallback decision.src/llama-context.cpp:LLAMA_PHASE7_MTP_TIMINGenv. Off by default; useful for future MTP perf work.Threshold tuning
Initial implementation tripped the fallback after the first low-yield batch (8 draft tokens). A follow-up patch lowered this to trip after any single batch with 0 useful accepts, which catches the catastrophic case faster:
Further threshold tuning gave only +2.5% (
54.32 → 55.66tok/s) so was stopped per ROI.Performance after fix
The fallback does not deliver a positive speedup — useful MTP acceptance on this specific sidecar / model combination is too low to amortize verification cost regardless of
n_max. It does, however, prevent a 5× throughput collapse when MTP is enabled, which makes MTP safe to leave on as an experimental mode rather than a deployment risk.Why upstream
Two reasons:
Safety: any deployment that enables
--spec-type mtpwithout an a-priori measured guarantee that the draft model amortizes verification cost is one bad sidecar away from a 5× throughput regression. The adaptive fallback turns that from a "must verify offline" risk into a runtime self-correcting behavior.General applicability: nothing in the fallback logic is DSv4-specific. Any model + sidecar combination where target verification cost can exceed useful draft yield will benefit. The
n_actual_acceptcounter is itself a useful general-purpose telemetry addition for MTP / speculative work.Caveats / non-goals
--spec-min-acceptance-ratio) if there's demand.LLAMA_PHASE7_MTP_TIMINGenv-gated timing logs are kept off by default; they were used during profiling and have small overhead when enabled.Hardware tested
antirez/deepseek-v4-ggufIQ2XXS, 86 GB) + DSv4 MTP draft (DeepSeek-V4-Flash-MTP-Q4K-Q8_0-F32.gguf, 3.8 GB)Related
fix/deepseek4-mtp-decode-stabilize(correctness fix for first-decode crash; required for MTP to actually run)mtp-clean