fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A) by noahgift · Pull Request #1806 · paiml/aprender

noahgift · 2026-05-19T07:17:25Z

Summary

Inserts a structured architectural guard at /v1/chat/completions for qwen3_moe-arch GGUFs. Returns StatusCode::NOT_IMPLEMENTED with an error pointing at #1789 + the new contract YAML, instead of the cryptic RealizarError::InvalidShape that aprender#1790 currently surfaces.

This is Option A of three engineering options documented in the new scope doc; it is strictly an improvement over the current state but does NOT enable actual MoE inference via HTTP (that is Option B, deferred to a follow-up PR).

Root cause (5-whys)

apr serve's cuda_chat_backend.rs::openai_chat_completions_handler calls Arc<Model>::generate(). For qwen3_moe GGUFs that path:

Routes through Model::forward() → dense FFN matmul
Tries to read blk.{L}.ffn_up.weight.data (the dense tensor name)
Qwen3-MoE stores per-expert tensors under ffn_*_exps.weight; the dense name has zero-byte data
aprender#1790's defensive guard catches the empty-data condition + returns InvalidShape
Underlying issue: the HTTP handler bypasses the MoE-aware path at infer/inference_result.rs:225 that's already wired into the apr run CLI

What this PR does

New contract: contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates V1_001..V1_004)
New scope doc: docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause, 3-option trade-off, Option A plan)
New guard fn guard_qwen3_moe_dispatch() in cuda_chat_backend.rs (called early in the handler)
New testable predicate is_qwen3_moe_arch() covering all HuggingFace + canonical + lowercase variants
5 unit tests in qwen3_moe_dispatch_guard_tests: canonical, HuggingFace class names, lowercase variants, dense-arch negatives, unknown-arch negatives — discharges FALSIFY-QWEN3_MOE_SERVE_DISPATCH_V1_002

What this PR does NOT do

It does NOT enable MoE inference via HTTP. Qwen3-MoE chat-completions will return NOT_IMPLEMENTED instead of InvalidShape. The MoE path exists at infer::run_inference but is only wired into the apr run CLI today. Option B (full HTTP-to-MoE wire-up) is deferred to a follow-up PR.

Companion-side integration

paiml/claude-code-parity-apr is officially SUSPENDED at M280 pending this fix. After merge, the companion can re-dispatch Phase 6 against Qwen3-Coder-30B-MoE + observe the new moe_dispatch_not_implemented driver-error class. Meaningful agent-quality measurement (compliance_cost_ratio > 0) still requires Option B.

Test plan

cargo check -p aprender-serve --lib --no-default-features — clean
cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass
cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean
PMAT pre-commit gates — clean
CI: standard workflow (full test suite + clippy + fmt across feature matrix)
Post-merge: companion-side CCPA Phase 6 re-dispatch produces moe_dispatch_not_implemented class

References

Fixes (partially): apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789 (Qwen3-MoE F32 routing); Option B in scope doc closes apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789 fully
Builds on: fix(serve): #1789 matmul defensive guard against empty / undersized weights #1790 (matmul defensive guard, MERGED 2026-05-18)
Cross-repo: paiml/claude-code-parity-apr M280 (CCPA suspension)

🤖 Generated with Claude Code

apr serve's chat-completions handler dispatches inference through Arc<Model>::generate(), which calls the dense FFN matmul path. For qwen3_moe GGUFs that path fails — per-expert tensors live under ffn_*_exps.weight; the dense ffn_up.weight references zero-byte data. aprender#1790's defensive guard surfaces this as RealizarError:: InvalidShape, but the underlying dispatch is wrong: the MoE-aware path already exists at infer::run_inference (used by `apr run` CLI) and is not wired into the HTTP handler. Until the full HTTP-to-MoE wire-up lands (Option B in the new scope doc), this PR inserts a clean architectural guard: detect qwen3_moe arch via AppState::model_architecture() + return StatusCode:: NOT_IMPLEMENTED with a structured error citing aprender#1789 + the new contract YAML. The cryptic matmul error class becomes an actionable "MoE HTTP dispatch not yet implemented" class at the API surface. Adds: - contracts/qwen3-moe-serve-dispatch-v1.yaml (4 falsification gates V1_001..V1_004; V1_002 discharged by this PR's unit tests) - docs/specifications/qwen3-moe-serve-dispatch-fix.md (root cause 5-whys, 3-option engineering trade-off, Option A implementation plan, companion-side CCPA Phase 6 integration plan) - crates/aprender-serve/src/api/cuda_chat_backend.rs: - guard_qwen3_moe_dispatch() guard fn (called early in the chat-completions handler before any backend-specific path) - is_qwen3_moe_arch() testable predicate - 5 unit tests under qwen3_moe_dispatch_guard_tests covering canonical name + HuggingFace class names + lowercase variants + dense-arch negatives + unknown-arch negatives Companion-side integration (paiml/claude-code-parity-apr): the M280 CCPA suspension's "harness-validation done; agent-quality measurement blocked on #1789" stance is unchanged. After this PR, Phase 6 re-dispatch against Qwen3-Coder-30B-MoE will produce a clean moe_dispatch_not_implemented driver-error class instead of the previous opaque matmul/InvalidShape class. Meaningful CCPA measurement still requires Option B (actual MoE inference via HTTP). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…l dep) `tracing` is declared `optional = true` in aprender-serve/Cargo.toml, so unconditional `tracing::warn!` in the new guard fn fails to compile on CI feature combos that don't enable it (ci/test + workspace-test + ci/lint all observed E0433 "unresolved module `tracing`" against cuda_chat_backend.rs:659). The rest of cuda_chat_backend.rs uses `eprintln!` for warn-level logging (verbose-mode prints, lock-failure messages). Match that style — eliminates the optional-dep dependency entirely + keeps the guard's warning output consistent with surrounding code. Local re-verify: - cargo check -p aprender-serve --lib --no-default-features — clean - cargo test -p aprender-serve --lib qwen3_moe_dispatch_guard_tests --features cuda — 5/5 pass Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…flight (#249) Adds cross-references to paiml/aprender#1806 (Option A: qwen3_moe arch guard at /v1/chat/completions) in: - evidence/phase-6/1.5b-calibration-run.md — new "Upstream-fix progress" section documenting aprender#1806's scope + post-merge expected outcome on companion side (clean moe_dispatch_not_implemented driver-error class instead of opaque matmul/InvalidShape) - docs/specifications/phase-6-results-and-next-steps.md — adds aprender#1806 to the cross-reference list This is a MECHANICAL status-tracking update consistent with the M280 suspension. No substantive new CCPA scope; no new contract gates; no new code. Pure cross-link refresh tracking external upstream-fix progress. M-counter NOT bumped per the M-counter discipline doctrine. The CCPA project remains OFFICIALLY SUSPENDED pending Option B (the actual MoE-via-HTTP wire-up that closes aprender#1789). #1806 is Option A (clean error class at the API surface), which is strictly an improvement over the current opaque-failure state but does NOT enable MoE inference. Option B is a follow-up PR after #1806 merges. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…ons handler (#1807) Replaces the Option A NOT_IMPLEMENTED guard (#1806) with a full MoE-aware dispatch path through the existing `run_qwen3_moe_generate` (the same code path used by `apr run` CLI). qwen3_moe-arch GGUFs served via /v1/chat/completions now actually generate tokens instead of returning NOT_IMPLEMENTED. ## Root cause closed apr serve's chat handler at `cuda_chat_backend.rs:564` previously called `Arc<Model>::generate()` unconditionally → `Model::forward()` → dense FFN matmul on `ffn_up.weight`. For Qwen3-MoE GGUFs that tensor's data slice is empty (per-expert weights live under `ffn_up_exps.weight`), producing either a matmul panic OR (post-#1790) a clean `RealizarError::InvalidShape`. The MoE-aware path at `infer/inference_result.rs:225` already existed but was only wired into the CLI `apr run`. This PR threads it into the HTTP serve path. ## Implementation - `AppState::mapped_gguf_model: Option<Arc<MappedGGUFModel>>` field + `with_mapped_gguf_model()` builder + `mapped_gguf_model()` accessor. Required because `run_qwen3_moe_generate` borrows per-expert tensors directly from the mmap; the mapped model must outlive any inference call (Arc gives it shared ownership across handler invocations). - All 16 AppState ctor sites initialize `mapped_gguf_model: None,` (mechanical insertion via python regex; non-MoE paths unaffected). - `prepare_gguf_serve_state` (CLI server-command load path) wraps the loaded `MappedGGUFModel` in an `Arc` + attaches it to the final AppState via `.with_mapped_gguf_model(...)`. For non-MoE archs this is just an extra Arc reference; for MoE it's the critical lifetime anchor. - `try_qwen3_moe_backend()` replaces `guard_qwen3_moe_dispatch()` in `cuda_chat_backend.rs`. For non-qwen3_moe archs returns None (handler falls through to existing dense backends — no regression). For qwen3_moe arch with retained mmap: tokenize prompt, build QuantizedGenerateConfig, call run_qwen3_moe_generate, decode + format chat-completions response. For qwen3_moe arch without retained mmap: returns NOT_IMPLEMENTED with actionable error (same class as Option A; defensive fallback). - Contract v1.1.0: status_history records Phase 1 (Option A, #1806) + Phase 2 (Option B, this PR). FALSIFY-V1_001 + V1_003 are now discharged at the code level (integration-test fixture availability is a follow-up task — see V1_001's evidence note). ## What this PR does NOT do - Streaming SSE: chat-completions stream=true falls back to the pregenerated_sse_response after the full batch generation. True per-token streaming would require run_qwen3_moe_generate to expose a per-token callback; that's a follow-up refactor. - KV cache: run_qwen3_moe_generate is full-prefill-per-token. For 30B MoE this is catastrophically slow (~minutes per token). M32d's KV cache work would speed this up but is out of scope here. - Integration test against a real qwen3_moe GGUF fixture: V1_001 contract gate. Deferred because the fixture infrastructure (small synthetic MoE GGUF) doesn't exist yet. The 5 unit tests carried over from #1806 still pass (they cover `is_qwen3_moe_arch` predicate). ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench against Qwen3-Coder-30B-MoE should now produce non-zero student pass rate (V1_004 falsification discharge). Operator-coordinated re-dispatch required. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…positive Clippy's `doc_lazy_continuation` lint trips on the wrapped doc line ` + any future streaming/batch backends. See` because it parses the `+` at the start of a wrapped doc-comment line as a markdown list-item marker. Reword to use "and" instead of "+" + move the "See" line to its own sentence. Local re-verify: - cargo clippy -p aprender-serve --lib --no-default-features -- -D warnings — clean Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…li serve handlers (#1812) * fix(distill): pre-warm non-LoRA backward kernels too (PMAT-698g) Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(#1789): make agent HTTP timeout configurable + raise default to 1800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

This was referenced May 19, 2026

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

Closed

docs(M281): upstream-fix progress note — aprender#1806 (Option A) in flight paiml/claude-code-parity-apr#249

Merged

Merge branch 'main' into fix/1789-qwen3-moe-serve-dispatch

2151e73

noahgift mentioned this pull request May 19, 2026

fix(#1789): Option B — wire run_qwen3_moe_generate into chat-completions handler #1807

Merged

3 tasks

noahgift and others added 3 commits May 19, 2026 10:07

Merge branch 'main' into fix/1789-qwen3-moe-serve-dispatch

351964c

noahgift merged commit 9c97452 into main May 19, 2026
10 checks passed

noahgift deleted the fix/1789-qwen3-moe-serve-dispatch branch May 19, 2026 08:38

noahgift mentioned this pull request May 19, 2026

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers #1812

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A)#1806

fix(#1789): guard qwen3_moe arch at /v1/chat/completions (Option A)#1806
noahgift merged 6 commits into
mainfrom
fix/1789-qwen3-moe-serve-dispatch

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Root cause (5-whys)

What this PR does

What this PR does NOT do

Companion-side integration

Test plan

References

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant