fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers by noahgift · Pull Request #1812 · paiml/aprender

noahgift · 2026-05-19T11:32:57Z

Summary

Companion to #1806/#1807. Option B's with_mapped_gguf_model() was wired into aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state — but that's not the entry point apr serve uses. The actual path is apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server, which constructs AppState via start_gguf_server_cuda / start_gguf_server_gpu_batched / run_cpu_server. None of those called .with_mapped_gguf_model(), so production apr serve runs hit the defensive NOT_IMPLEMENTED fallback in try_qwen3_moe_backend.

Empirical evidence the bug existed

paiml/claude-code-parity-apr Phase 6 dispatched against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf at 13:05Z produced 4 of 4 student-side captures with:

{
  "outcome": {
    "kind": "driver_error",
    "reason": "driver subprocess exited with status 1: ... apr serve HTTP 501: {\"error\":\"qwen3_moe arch detected but mapped GGUF not retained in AppState. See aprender#1789 + contracts/qwen3-moe-serve-dispatch-v1.yaml.\"}\n"
  }
}

That 501 message comes from THIS PR's branch in try_qwen3_moe_backend's defensive fallback (when state.mapped_gguf_model() returns None). Empirical proof the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through.

Fix

start_gguf_server wraps MappedGGUFModel::from_path result in Arc<MappedGGUFModel> immediately. Threaded through all dispatch branches:

start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config) — .with_mapped_gguf_model(mapped.clone()) on the constructed AppState.
start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config) — same.
run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config) — Option so APR / SafeTensors callers can pass None (the qwen3_moe dispatch guard's defensive fallback returns NOT_IMPLEMENTED cleanly).

Empirical verification post-fix

apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --port 19999 --host 127.0.0.1 --gpu

curl -X POST http://127.0.0.1:19999/v1/chat/completions \
  -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"Reply with exactly: HELLO"}],"max_tokens":10}'

→ HTTP 200, valid OpenAI-shape JSON with non-empty generated content

The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged.

paiml/claude-code-parity-apr Phase 6 bench re-dispatched at 13:30Z with this binary; V1_004 verification in flight.

Test plan

cargo check -p apr-cli --features "inference cuda" — clean
Smoke test: real apr serve against Qwen3-Coder-30B-MoE GGUF returns generated tokens via /v1/chat/completions
CI (sovereign-ci full workflow)
CCPA Phase 6 bench non-zero student pass rate

🤖 Generated with Claude Code

Phase 3 dispatch v8 on gx10 reached the training loop and the first backward step began JIT-compiling silu_backward / batched_rms_norm_backward / rms_norm_gamma_reduce ON DEMAND, then failed with: forward_backward_with_grad returned None (CUDA stream poisoned or gradient shape mismatch) This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug (trueno#200, CLAUDE.md "Backward kernels: Crash because they compile on-demand when GPU is already active"). Cause: `pre_warm_lora_backward_kernels` short-circuited the entire function at `lora_rank == 0`, leaving the activation/norm backward kernels to JIT on demand mid-training. The function name implies LoRA-only, but it actually pre-warmed shared non-LoRA kernels (silu_backward, batched_softmax_backward, batched_rms_norm_backward) that distillation training also needs. Fix: restructure — only the LoRA-specific gemm_backward warm-ups are gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward kernels always pre-warm, regardless of LoRA mode. Distillation training (lora_rank == 0) now gets the full backward kernel cache before block upload, eliminating on-demand JIT and the resulting stream poisoning. Test plan: - [x] cargo check --features cuda — clean build - [x] 18 cuda_backward lib tests pass - [ ] Live gx10 dispatch reaches stepping (post-merge verification) Stage 4 in the Phase 3 cuda dispatch defect cascade: PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g Each surfaced the next defect on the gx10 path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…li serve handlers The squashed Option B PR (#1806 + #1807, commit 9c97452) wired `with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs` — but that's the wrong entry point. `apr serve` actually dispatches through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion, handlers_include_01, server}.rs`. None of those called `.with_mapped_gguf_model()`, so production `apr serve` runs against qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in `try_qwen3_moe_backend` (state.mapped_gguf_model() returned None). ## Root cause apr-cli has TWO entry points to serve: 1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state` — fixed in #1806/#1807, never called by `apr serve` subcommand 2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server` → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` / `run_cpu_server` — this is the actual serve path The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4 captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch detected but mapped GGUF not retained in AppState"`. That error fires from the defensive fallback branch in `try_qwen3_moe_backend` — proving the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through. ## Fix `start_gguf_server` now wraps the `MappedGGUFModel` in `Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump shared across all dispatch branches). Threaded into: - `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` — `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState. - `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config)` — same. - `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)` — `Option` so the APR-format / non-GGUF callers can pass `None` (defensive fallback path remains the clean NOT_IMPLEMENTED). Callers updated: - `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes through three branches. - `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU branch — passes `Some(mapped_model)`. - `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format). - `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`. ## Empirical verification Smoke-test post-fix: ``` apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --port 19999 --host 127.0.0.1 --gpu curl -X POST http://127.0.0.1:19999/v1/chat/completions \ -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}' ``` Returns HTTP 200 with valid OpenAI-shape JSON containing generated tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged. ## Companion-side impact paiml/claude-code-parity-apr Phase 6 bench is dispatching against this binary now. Expected outcome: student_pass_rate > 0 on at least some fixtures (V1_004 falsifier discharge condition). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…800s for MoE Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too short for 30B MoE inference without KV cache. Each generated token requires a full prefill of the entire sequence; a 256-token request on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench fixture died with "Error: driver error: network error: apr serve: error sending request for url" at exactly the 120s mark. Same root-cause class as aprender#1782 (apr serve startup 30s timeout that wasn't configurable + size-aware). Fix is symmetric: env-var override + size-aware default. Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s (30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s ceiling + leaves headroom for large MoE inference until M32d KV cache lands. For dense models / KV-cache builds this is effectively unbounded. Empirical post-fix evidence pending: Phase 6 bench re-dispatch against Qwen3-Coder-30B-A3B with this binary expected to produce non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns, or oracle_failed). Discharges the implicit `max_http_timeout_must_accommodate_inference_wall` precondition embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 4 commits May 19, 2026 13:04

Merge branch 'main' into fix/1789-option-b-apr-cli-serve-wire

a6a68e8

noahgift merged commit 2552749 into main May 19, 2026
10 checks passed

noahgift deleted the fix/1789-option-b-apr-cli-serve-wire branch May 19, 2026 12:24

This was referenced May 19, 2026

fix(#1789): make AprServeDriver max_tokens cap env-configurable #1814

Merged

spec(M32d): KV cache for qwen3_moe inference path — scope + operator decision doc #1826

Merged

M32d: KV cache for qwen3_moe inference path (engineer-driven, 1-2 week) #1830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers#1812

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers#1812
noahgift merged 4 commits into
mainfrom
fix/1789-option-b-apr-cli-serve-wire

noahgift commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 19, 2026

Summary

Empirical evidence the bug existed

Fix

Empirical verification post-fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant