fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers#1812
Merged
Merged
Conversation
Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…li serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Companion to #1806/#1807. Option B's
with_mapped_gguf_model()was wired intoaprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state— but that's not the entry pointapr serveuses. The actual path isapr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server, which constructs AppState viastart_gguf_server_cuda/start_gguf_server_gpu_batched/run_cpu_server. None of those called.with_mapped_gguf_model(), so productionapr serveruns hit the defensiveNOT_IMPLEMENTEDfallback intry_qwen3_moe_backend.Empirical evidence the bug existed
paiml/claude-code-parity-apr Phase 6 dispatched against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf at 13:05Z produced 4 of 4 student-side captures with:
{ "outcome": { "kind": "driver_error", "reason": "driver subprocess exited with status 1: ... apr serve HTTP 501: {\"error\":\"qwen3_moe arch detected but mapped GGUF not retained in AppState. See aprender#1789 + contracts/qwen3-moe-serve-dispatch-v1.yaml.\"}\n" } }That 501 message comes from THIS PR's branch in
try_qwen3_moe_backend's defensive fallback (whenstate.mapped_gguf_model()returns None). Empirical proof the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through.Fix
start_gguf_serverwrapsMappedGGUFModel::from_pathresult inArc<MappedGGUFModel>immediately. Threaded through all dispatch branches:start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)—.with_mapped_gguf_model(mapped.clone())on the constructed AppState.start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)— same.run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)—Optionso APR / SafeTensors callers can passNone(the qwen3_moe dispatch guard's defensive fallback returnsNOT_IMPLEMENTEDcleanly).Empirical verification post-fix
The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in
contracts/qwen3-moe-serve-dispatch-v1.yamlare empirically discharged.paiml/claude-code-parity-apr Phase 6 bench re-dispatched at 13:30Z with this binary; V1_004 verification in flight.
Test plan
cargo check -p apr-cli --features "inference cuda"— cleanapr serveagainst Qwen3-Coder-30B-MoE GGUF returns generated tokens via/v1/chat/completions🤖 Generated with Claude Code