fix(apr-cli serve): try Q4K pool-allocator path before generic GPU dispatch (closes #471) by noahgift · Pull Request #1663 · paiml/aprender

noahgift · 2026-05-14T10:20:03Z

Summary

Closes #471. `apr serve run <model.apr> --gpu` hung indefinitely on 17 GB / 18k-tensor Q4K MoE models (qwen3-coder-30b). `realizar serve --gpu` loaded the same file in ~12 s using `spawn_apr_q4k_inference_thread`'s pool allocator.

Root cause: `start_apr_q4k_server_gpu` (already implemented in `handlers_include_01.rs:96`) was compiled in but NEVER called from `start_apr_server`.

Fix

Dispatch `start_apr_q4k_server_gpu` before `start_apr_server_gpu` when `--gpu` is requested:

```rust
match start_apr_q4k_server_gpu(model_path, config) {
Ok(()) => return Ok(()),
Err(e) => { eprintln!("[GH-471] declined: {e}"); }
}
match start_apr_server_gpu(model_path, config) { /* existing fallback */ }
```

The Q4K path's `parse_apr_q4k_config` + `upload_apr_q4k_weights` already validate the model shape, so passing non-Q4K APRs through it fails cleanly. The `cuda`-only stub (no `cuda-batch`) errors immediately, preserving existing behavior for non-batch builds.

Test plan

`cargo check -p apr-cli --features cuda` clean
`cargo check -p apr-cli --features cuda-batch` clean
CI: workspace-test
Optional: `apr serve run qwen3-coder-30b-q4k.apr --gpu --features cuda-batch` exits load phase in <20s

🤖 Generated with Claude Code

…spatch (closes #471) `apr serve run <model.apr> --gpu` was routing all APR models through the generic `OwnedQuantizedModel::from_apr()` path, which does per-tensor `cuMemAlloc`. For large Q4K MoE models (e.g. qwen3-coder-30b — 17 GB, 18,867 tensors) that path hangs indefinitely during weight upload. The ALB-095/098 `start_apr_q4k_server_gpu` function (already implemented via `spawn_apr_q4k_inference_thread` → `upload_apr_q4k_weights`'s pool allocator → single `cuMemAlloc` for all tensors) was compiled in but NEVER dispatched from `start_apr_server`. `realizar serve --gpu` uses this path and loads the same 30B MoE in ~12s. Fix: in `start_apr_server`, call `start_apr_q4k_server_gpu` first when `--gpu` is requested. If the model isn't Q4K, `parse_apr_q4k_config` or `upload_apr_q4k_weights` returns an error and we fall through to the existing generic GPU path (which still handles dequantized + small non-Q4K APR fine). Without the `cuda-batch` feature the Q4K function is a stub that errors immediately, so non-batch builds keep their existing fallback chain. Verified: - `cargo check -p apr-cli --features cuda` builds clean - `cargo check -p apr-cli --features cuda-batch` builds clean

noahgift enabled auto-merge (squash) May 14, 2026 10:20

noahgift merged commit caccb68 into main May 14, 2026
11 checks passed

noahgift deleted the fix/471-apr-serve-q4k-gpu-dispatch branch May 14, 2026 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(apr-cli serve): try Q4K pool-allocator path before generic GPU dispatch (closes #471)#1663

fix(apr-cli serve): try Q4K pool-allocator path before generic GPU dispatch (closes #471)#1663
noahgift merged 1 commit into
mainfrom
fix/471-apr-serve-q4k-gpu-dispatch

noahgift commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 14, 2026

Summary

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant