fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A)#1806
Merged
Conversation
apr serve's chat-completions handler dispatches inference through
Arc<Model>::generate(), which calls the dense FFN matmul path. For
qwen3_moe GGUFs that path fails — per-expert tensors live under
ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data.
aprender#1790's defensive guard surfaces this as RealizarError::
InvalidShape, but the underlying dispatch is wrong: the MoE-aware path
already exists at infer::run_inference (used by `apr run` CLI) and is
not wired into the HTTP handler.
Until the full HTTP-to-MoE wire-up lands (Option B in the new scope
doc), this PR inserts a clean architectural guard: detect qwen3_moe
arch via AppState::model_architecture() + return StatusCode::
NOT_IMPLEMENTED with a structured error citing aprender#1789 + the
new contract YAML. The cryptic matmul error class becomes an
actionable "MoE HTTP dispatch not yet implemented" class at the API
surface.
Adds:
- contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates
V1_001..V1_004; V1_002 discharged by this PR's unit tests)
- docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause
5-whys, 3-option engineering trade-off, Option A implementation
plan, companion-side CCPA Phase 6 integration plan)
- crates/aprender-serve/src/api/cuda_chat_backend.rs:
- guard_qwen3_moe_dispatch() guard fn (called early in the
chat-completions handler before any backend-specific path)
- is_qwen3_moe_arch() testable predicate
- 5 unit tests under qwen3_moe_dispatch_guard_tests covering
canonical name + HuggingFace class names + lowercase variants +
dense-arch negatives + unknown-arch negatives
Companion-side integration (paiml/claude-code-parity-apr): the M280
CCPA suspension's "harness-validation done; agent-quality measurement
blocked on #1789" stance is unchanged. After this PR, Phase 6
re-dispatch against Qwen3-Coder-30B-MoE will produce a clean
moe_dispatch_not_implemented driver-error class instead of the
previous opaque matmul/InvalidShape class. Meaningful CCPA
measurement still requires Option B (actual MoE inference via HTTP).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…l dep) `tracing` is declared `optional = true` in aprender-serve/Cargo.toml, so unconditional `tracing::warn!` in the new guard fn fails to compile on CI feature combos that don't enable it (ci/test + workspace-test + ci/lint all observed E0433 "unresolved module `tracing`" against cuda_chat_backend.rs:659). The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level logging (verbose-mode prints, lock-failure messages). Match that style — eliminates the optional-dep dependency entirely + keeps the guard's warning output consistent with surrounding code. Local re-verify: - cargo check -p aprender-serve --lib --no-default-features — clean - cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
to paiml/claude-code-parity-apr
that referenced
this pull request
May 19, 2026
…flight (#249) Adds cross-references to paiml/aprender#1806 (Option A: qwen3_moe arch guard at /v1/chat/completions) in: - evidence/phase-6/1.5b-calibration-run.md — new "Upstream-fix progress" section documenting aprender#1806's scope + post-merge expected outcome on companion side (clean moe_dispatch_not_implemented driver-error class instead of opaque matmul/InvalidShape) - docs/specifications/phase-6-results-and-next-steps.md — adds aprender#1806 to the cross-reference list This is a MECHANICAL status-tracking update consistent with the M280 suspension. No substantive new CCPA scope; no new contract gates; no new code. Pure cross-link refresh tracking external upstream-fix progress. M-counter NOT bumped per the M-counter discipline doctrine. The CCPA project remains OFFICIALLY SUSPENDED pending Option B (the actual MoE-via-HTTP wire-up that closes aprender#1789). #1806 is Option A (clean error class at the API surface), which is strictly an improvement over the current opaque-failure state but does NOT enable MoE inference. Option B is a follow-up PR after #1806 merges. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
…ons handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
…positive Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line ` + any future streaming/batch backends. See` because it parses the `+` at the start of a wrapped doc-comment line as a markdown list-item marker. Reword to use "and" instead of "+" + move the "See" line to its own sentence. Local re-verify: - cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4 tasks
noahgift
added a commit
that referenced
this pull request
May 19, 2026
…li serve handlers (#1812) * fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g) Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This was referenced May 19, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Inserts a structured architectural guard at
/v1/chat/completionsfor qwen3_moe-arch GGUFs. ReturnsStatusCode::NOT_IMPLEMENTEDwith an error pointing at #1789 + the new contract YAML, instead of the crypticRealizarError::InvalidShapethat aprender#1790 currently surfaces.This is Option A of three engineering options documented in the new scope doc; it is strictly an improvement over the current state but does NOT enable actual MoE inference via HTTP (that is Option B, deferred to a follow-up PR).
Root cause (5-whys)
apr serve'scuda_chat_backend.rs::openai_chat_completions_handlercallsArc<Model>::generate(). For qwen3_moe GGUFs that path:Model::forward()→ dense FFN matmulblk.{L}.ffn_up.weight.data(the dense tensor name)ffn_*_exps.weight; the dense name has zero-byte dataInvalidShapeinfer/inference_result.rs:225that's already wired into theapr runCLIWhat this PR does
contracts/qwen3-moe-serve-dispatch-v1.yaml(4 falsification gates V1_001..V1_004)docs/specifications/qwen3-moe-serve-dispatch-fix.md(root cause, 3-option trade-off, Option A plan)guard_qwen3_moe_dispatch()incuda_chat_backend.rs(called early in the handler)is_qwen3_moe_arch()covering all HuggingFace + canonical + lowercase variantsqwen3_moe_dispatch_guard_tests: canonical, HuggingFace class names, lowercase variants, dense-arch negatives, unknown-arch negatives — discharges FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_002What this PR does NOT do
NOT_IMPLEMENTEDinstead ofInvalidShape. The MoE path exists atinfer::run_inferencebut is only wired into theapr runCLI today. Option B (full HTTP-to-MoE wire-up) is deferred to a follow-up PR.Companion-side integration
paiml/claude-code-parity-apr is officially SUSPENDED at M280 pending this fix. After merge, the companion can re-dispatch Phase 6 against Qwen3-Coder-30B-MoE + observe the new
moe_dispatch_not_implementeddriver-error class. Meaningful agent-quality measurement (compliance_cost_ratio > 0) still requires Option B.Test plan
cargo check -p aprender-serve --lib --no-default-features— cleancargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda— 5/5 passcargo clippy -p aprender-serve --lib --no-default-features -- -D warnings— cleanmoe_dispatch_not_implementedclassReferences
🤖 Generated with Claude Code