spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc#1826
Merged
Conversation
…decision doc Scopes the M32d work that's currently blocking contracts/qwen3-moe-serve-dispatch-v1.yaml V1_004 (CCPA Phase 6 bench non-zero student pass rate against Qwen3-Coder-30B-A3B). Empirical finding (paiml/claude-code-parity-apr Phase 6, 5 dispatches across the post-#1789 fix chain): 30B-MoE full-prefill-per-token at ~0.5 tok/s cannot fit any reasonable per-turn budget. The dense path already has `OwnedQuantizedKVCache` + `forward_single_with_cache`; the MoE path has neither and re-runs the whole prompt on every token. This scope doc: - Surveys the dense KV cache code path (file refs + line numbers) - Inventories the OwnedQuantizedKVCache API (sufficient as-is) - Lays out 5 implementation steps (function skeleton → attention helper lift → MoE FFN helper lift → generate-loop wire → tests) - Identifies 6 risk surfaces (numerical equivalence, dense regression, RoPE offset, GQA shapes, expert routing under cache, streaming SSE) - Estimates 8 focused engineering hours total - Presents three operator decisions (greenlight in-session vs engineer-driven follow-up vs skip) NOT an implementation PR. Documents the work so operator can choose go/no-go without re-deriving the analysis. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scope doc + operator decision matrix for M32d (KV cache on the qwen3_moe inference path). This is the upstream blocker for FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_004 in `contracts/qwen3-moe-serve-dispatch-v1.yaml` v1.1.1.
Not an implementation PR. Documents the work + estimates effort/risk so operator can choose go/no-go without re-deriving the analysis.
Why now
The empirical evidence chain (5 Phase 6 dispatches across the post-#1789 fix series — #1806, #1812, #1814, #1819) shows that the 30B-MoE student fails uniformly at the per-turn timeout, regardless of how the timeouts are tuned. Root cause: full-prefill-per-token at ~0.5 tok/s. Symptom-class progression documented in paiml/claude-code-parity-apr evidence/phase-6/30b-moe-empirical-2026-05-19.md.
The dense path already has `OwnedQuantizedKVCache` + `forward_single_with_cache`. The MoE path has neither.
What this scope covers
What this does NOT cover
Test plan
🤖 Generated with Claude Code