Skip to content

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers#1812

Merged
noahgift merged 4 commits into
mainfrom
fix/1789-option-b-apr-cli-serve-wire
May 19, 2026
Merged

fix(#1789): Option B follow-up — thread MappedGGUFModel through apr-cli serve handlers#1812
noahgift merged 4 commits into
mainfrom
fix/1789-option-b-apr-cli-serve-wire

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Companion to #1806/#1807. Option B's with_mapped_gguf_model() was wired into aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state — but that's not the entry point apr serve uses. The actual path is apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server, which constructs AppState via start_gguf_server_cuda / start_gguf_server_gpu_batched / run_cpu_server. None of those called .with_mapped_gguf_model(), so production apr serve runs hit the defensive NOT_IMPLEMENTED fallback in try_qwen3_moe_backend.

Empirical evidence the bug existed

paiml/claude-code-parity-apr Phase 6 dispatched against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf at 13:05Z produced 4 of 4 student-side captures with:

{
  "outcome": {
    "kind": "driver_error",
    "reason": "driver subprocess exited with status 1: ... apr serve HTTP 501: {\"error\":\"qwen3_moe arch detected but mapped GGUF not retained in AppState. See aprender#1789 + contracts/qwen3-moe-serve-dispatch-v1.yaml.\"}\n"
  }
}

That 501 message comes from THIS PR's branch in try_qwen3_moe_backend's defensive fallback (when state.mapped_gguf_model() returns None). Empirical proof the dispatch reaches the qwen3_moe path but the mmap isn't plumbed through.

Fix

start_gguf_server wraps MappedGGUFModel::from_path result in Arc<MappedGGUFModel> immediately. Threaded through all dispatch branches:

  • start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config).with_mapped_gguf_model(mapped.clone()) on the constructed AppState.
  • start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>, config) — same.
  • run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)Option so APR / SafeTensors callers can pass None (the qwen3_moe dispatch guard's defensive fallback returns NOT_IMPLEMENTED cleanly).

Empirical verification post-fix

apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --port 19999 --host 127.0.0.1 --gpu

curl -X POST http://127.0.0.1:19999/v1/chat/completions \
  -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"Reply with exactly: HELLO"}],"max_tokens":10}'

→ HTTP 200, valid OpenAI-shape JSON with non-empty generated content

The matmul defensive guard (#1790) does NOT fire. V1_001 + V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically discharged.

paiml/claude-code-parity-apr Phase 6 bench re-dispatched at 13:30Z with this binary; V1_004 verification in flight.

Test plan

  • cargo check -p apr-cli --features "inference cuda" — clean
  • Smoke test: real apr serve against Qwen3-Coder-30B-MoE GGUF returns generated tokens via /v1/chat/completions
  • CI (sovereign-ci full workflow)
  • CCPA Phase 6 bench non-zero student pass rate

🤖 Generated with Claude Code

noahgift and others added 4 commits May 19, 2026 13:04
Phase 3 dispatch v8 on gx10 reached the training loop and the first
backward step began JIT-compiling silu_backward / batched_rms_norm_backward
/ rms_norm_gamma_reduce ON DEMAND, then failed with:

  forward_backward_with_grad returned None (CUDA stream poisoned or
  gradient shape mismatch)

This is the documented Blackwell sm_121 JIT-during-active-GPU-work bug
(trueno#200, CLAUDE.md "Backward kernels: Crash because they compile
on-demand when GPU is already active").

Cause: `pre_warm_lora_backward_kernels` short-circuited the entire
function at `lora_rank == 0`, leaving the activation/norm backward
kernels to JIT on demand mid-training. The function name implies
LoRA-only, but it actually pre-warmed shared non-LoRA kernels
(silu_backward, batched_softmax_backward, batched_rms_norm_backward)
that distillation training also needs.

Fix: restructure — only the LoRA-specific gemm_backward warm-ups are
gated on lora_rank > 0. The activation/norm/standard-FP32-GEMM backward
kernels always pre-warm, regardless of LoRA mode. Distillation training
(lora_rank == 0) now gets the full backward kernel cache before block
upload, eliminating on-demand JIT and the resulting stream poisoning.

Test plan:
- [x] cargo check --features cuda — clean build
- [x] 18 cuda_backward lib tests pass
- [ ] Live gx10 dispatch reaches stepping (post-merge verification)

Stage 4 in the Phase 3 cuda dispatch defect cascade:
  PMAT-700-B → PMAT-698e → PMAT-698f → PMAT-698g
Each surfaced the next defect on the gx10 path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…li serve handlers

The squashed Option B PR (#1806 + #1807, commit 9c97452) wired
`with_mapped_gguf_model()` into `aprender-serve/src/cli/mod_server_commands.rs`
— but that's the wrong entry point. `apr serve` actually dispatches
through `apr-cli/src/commands/serve/{handlers, handler_gpu_completion,
handlers_include_01, server}.rs`. None of those called
`.with_mapped_gguf_model()`, so production `apr serve` runs against
qwen3_moe GGUFs hit the defensive NOT_IMPLEMENTED fallback in
`try_qwen3_moe_backend` (state.mapped_gguf_model() returned None).

## Root cause

apr-cli has TWO entry points to serve:
1. `aprender-serve/src/cli/mod_server_commands.rs::prepare_gguf_serve_state`
   — fixed in #1806/#1807, never called by `apr serve` subcommand
2. `apr-cli/src/commands/serve/handler_gpu_completion.rs::start_gguf_server`
   → `start_gguf_server_cuda` / `start_gguf_server_gpu_batched` /
   `run_cpu_server` — this is the actual serve path

The empirical evidence: paiml/claude-code-parity-apr Phase 6 bench
dispatched at 13:05Z against Qwen3-Coder-30B-MoE produced 4 of 4
captures with `outcome: driver_error, reason: HTTP 501 "qwen3_moe arch
detected but mapped GGUF not retained in AppState"`. That error fires
from the defensive fallback branch in `try_qwen3_moe_backend` —
proving the dispatch reaches the qwen3_moe path but the mmap isn't
plumbed through.

## Fix

`start_gguf_server` now wraps the `MappedGGUFModel` in
`Arc<MappedGGUFModel>` immediately after `from_path` (cheap Arc bump
shared across all dispatch branches). Threaded into:

- `start_gguf_server_cuda(quantized, vocab, mapped: Arc<...>, config)` —
  `.with_mapped_gguf_model(mapped.clone())` on the constructed AppState.
- `start_gguf_server_gpu_batched(quantized, vocab, mapped: Arc<...>,
  config)` — same.
- `run_cpu_server(quantized, vocab, mapped: Option<Arc<...>>, config)`
  — `Option` so the APR-format / non-GGUF callers can pass `None`
  (defensive fallback path remains the clean NOT_IMPLEMENTED).

Callers updated:
- `handler_gpu_completion.rs::start_gguf_server` — wraps in Arc + passes
  through three branches.
- `handler_gpu_completion.rs::start_gguf_server_cuda` fallback CPU
  branch — passes `Some(mapped_model)`.
- `handlers.rs::try_apr_quantized_cpu` — passes `None` (APR format).
- `handlers_include_01.rs` (GH-99 APR Q4K) — passes `None`.

## Empirical verification

Smoke-test post-fix:
```
apr serve run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf
  --port 19999 --host 127.0.0.1 --gpu
curl -X POST http://127.0.0.1:19999/v1/chat/completions \
  -d '{"model":"qwen3-coder","messages":[{"role":"user","content":"..."}],"max_tokens":10}'
```
Returns HTTP 200 with valid OpenAI-shape JSON containing generated
tokens. The matmul defensive guard (#1790) does NOT fire. V1_001 +
V1_003 in contracts/qwen3-moe-serve-dispatch-v1.yaml are empirically
discharged.

## Companion-side impact

paiml/claude-code-parity-apr Phase 6 bench is dispatching against
this binary now. Expected outcome: student_pass_rate > 0 on at least
some fixtures (V1_004 falsifier discharge condition).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…800s for MoE

Hardcoded 120s HTTP timeout in `AprServeDriver::complete` was too
short for 30B MoE inference without KV cache. Each generated token
requires a full prefill of the entire sequence; a 256-token request
on Qwen3-Coder-30B-A3B takes >>120s wall, so every Phase 6 bench
fixture died with "Error: driver error: network error: apr serve:
error sending request for url" at exactly the 120s mark.

Same root-cause class as aprender#1782 (apr serve startup 30s timeout
that wasn't configurable + size-aware). Fix is symmetric: env-var
override + size-aware default.

Override via `APR_AGENT_HTTP_TIMEOUT_S`. Default raised to 1800s
(30 min) — matches the CCPA Phase 6 bench's per-turn-timeout=900s
ceiling + leaves headroom for large MoE inference until M32d KV
cache lands. For dense models / KV-cache builds this is effectively
unbounded.

Empirical post-fix evidence pending: Phase 6 bench re-dispatch
against Qwen3-Coder-30B-A3B with this binary expected to produce
non-driver_error outcomes (oracle_passed, oracle_failed_after_max_turns,
or oracle_failed). Discharges the implicit
`max_http_timeout_must_accommodate_inference_wall` precondition
embedded in `qwen3-moe-serve-dispatch-v1.yaml` v1.1.0's V1_004.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 2552749 into main May 19, 2026
10 checks passed
@noahgift noahgift deleted the fix/1789-option-b-apr-cli-serve-wire branch May 19, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant