Skip to content

fix(apr-cli serve): try Q4K pool-allocator path before generic GPU dispatch (closes #471)#1663

Merged
noahgift merged 1 commit into
mainfrom
fix/471-apr-serve-q4k-gpu-dispatch
May 14, 2026
Merged

fix(apr-cli serve): try Q4K pool-allocator path before generic GPU dispatch (closes #471)#1663
noahgift merged 1 commit into
mainfrom
fix/471-apr-serve-q4k-gpu-dispatch

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes #471. `apr serve run <model.apr> --gpu` hung indefinitely on 17 GB / 18k-tensor Q4K MoE models (qwen3-coder-30b). `realizar serve --gpu` loaded the same file in ~12 s using `spawn_apr_q4k_inference_thread`'s pool allocator.

Root cause: `start_apr_q4k_server_gpu` (already implemented in `handlers_include_01.rs:96`) was compiled in but NEVER called from `start_apr_server`.

Fix

Dispatch `start_apr_q4k_server_gpu` before `start_apr_server_gpu` when `--gpu` is requested:

```rust
match start_apr_q4k_server_gpu(model_path, config) {
Ok(()) => return Ok(()),
Err(e) => { eprintln!("[GH-471] declined: {e}"); }
}
match start_apr_server_gpu(model_path, config) { /* existing fallback */ }
```

The Q4K path's `parse_apr_q4k_config` + `upload_apr_q4k_weights` already validate the model shape, so passing non-Q4K APRs through it fails cleanly. The `cuda`-only stub (no `cuda-batch`) errors immediately, preserving existing behavior for non-batch builds.

Test plan

  • `cargo check -p apr-cli --features cuda` clean
  • `cargo check -p apr-cli --features cuda-batch` clean
  • CI: workspace-test
  • Optional: `apr serve run qwen3-coder-30b-q4k.apr --gpu --features cuda-batch` exits load phase in <20s

🤖 Generated with Claude Code

…spatch (closes #471)

`apr serve run <model.apr> --gpu` was routing all APR models through
the generic `OwnedQuantizedModel::from_apr()` path, which does per-tensor
`cuMemAlloc`. For large Q4K MoE models (e.g. qwen3-coder-30b — 17 GB,
18,867 tensors) that path hangs indefinitely during weight upload.

The ALB-095/098 `start_apr_q4k_server_gpu` function (already implemented
via `spawn_apr_q4k_inference_thread` → `upload_apr_q4k_weights`'s pool
allocator → single `cuMemAlloc` for all tensors) was compiled in but
NEVER dispatched from `start_apr_server`. `realizar serve --gpu` uses
this path and loads the same 30B MoE in ~12s.

Fix: in `start_apr_server`, call `start_apr_q4k_server_gpu` first when
`--gpu` is requested. If the model isn't Q4K, `parse_apr_q4k_config` or
`upload_apr_q4k_weights` returns an error and we fall through to the
existing generic GPU path (which still handles dequantized + small
non-Q4K APR fine).

Without the `cuda-batch` feature the Q4K function is a stub that errors
immediately, so non-batch builds keep their existing fallback chain.

Verified:
- `cargo check -p apr-cli --features cuda` builds clean
- `cargo check -p apr-cli --features cuda-batch` builds clean
@noahgift noahgift enabled auto-merge (squash) May 14, 2026 10:20
@noahgift noahgift merged commit caccb68 into main May 14, 2026
11 checks passed
@noahgift noahgift deleted the fix/471-apr-serve-q4k-gpu-dispatch branch May 14, 2026 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

apr serve run: Q4K APR GPU hangs on large MoE models (30B)

1 participant