feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749) by noahgift · Pull Request #1751 · paiml/aprender

noahgift · 2026-05-17T13:24:53Z

Summary

Closes #1749. apr bench against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) used to panic in matmul_fused.rs:211 (index out of bounds: len=0 but index ≈ 91M) because the dense forward_single_with_cache path looks up 2D ffn_{gate,up,down}.weight tensors that don't exist on MoE models (which have 3D *_exps instead).

This PR detects MoE via gguf.expert_count().is_some() and routes to forward_qwen3_moe (CPU) / forward_qwen3_moe_cuda (GPU), running them autoregressively. Unblocks M-GPU-MOE-3 PR-4 throughput measurement.

Empirical (lambda-vector RTX 4090, 2026-05-17)

apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --max-tokens 8 --warmup 1 --iterations 4 --json

{
  "total_tokens": 4,
  "total_time_ms": 87085,
  "mean_time_ms": 21771,
  "time_to_first_token_ms": 22839,
  "latency_p50_ms": 22216,
  "latency_p99_ms": 32470
}

0.046 tok/s effective — autoregressive re-prefill cost (no KV cache yet). PR-4's job is to push this to ≥150 tok/s by adding the cache.

What's in this PR

crates/apr-cli/src/commands/bench_moe.rs (new): is_moe_gguf predicate + run_gguf_moe_benchmark + CUDA/CPU autoregressive bench helpers using greedy argmax decode.
crates/apr-cli/src/commands/bench.rs: new include!(\"bench_moe.rs\") (same pattern as the existing bench sub-files).
crates/apr-cli/src/commands/benchmark.rs: MoE detection in run_gguf_benchmark + tail-call to run_gguf_moe_benchmark.

What's NOT in this PR

True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target proper)
MoE bench for SafeTensors / APR formats (only GGUF MoE today)

Test plan

cargo build -p apr-cli --features 'inference cuda' clean
apr bench Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --json exits cleanly with valid JSON (instead of panicking)
Dense GGUF bench path unchanged (route only flips on expert_count > 0)

Cross-refs

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this
docs(contracts): qwen3-moe-forward-gpu-v1 v1.7.1 → v1.7.2 — M-GPU-MOE-3 PR-3 cascade CLOSED, L47 marked KNOWN_DIVERGENCE_NOT_BENIGN #1747 contract v1.7.2 — L47 known-divergence + cascade pause

🤖 Generated with Claude Code

…rward_qwen3_moe (#1749) Closes #1749. Pre-fix, `apr bench` against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense `forward_single_with_cache` path which calls `matmul_fused.rs:211` on tensor names that don't exist on MoE models (the 3D `*_exps` tensors are stored at different names than the 2D dense `ffn_{gate,up,down}.weight` the dense path looks up). Result: hundreds of parallel thread panics — `index out of bounds: len=0 but index ≈ 91M`. This PR adds MoE detection via `gguf.expert_count().is_some()` and routes to the MoE-aware forward path: CPU: realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda Both helpers do not currently expose a KV cache, so the bench runs them **autoregressively with re-prefill** — each iteration runs full forward over `prompt + previously-generated tokens` and appends the argmax to the prompt for the next iter. O(N²) in N tokens but bounded by `--max-tokens` (default 32). This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4 throughput measurement. True KV-cache MoE decoding is the actual PR-4 work; this PR makes `apr bench` produce a real (if pessimistic) tok/s number for MoE GGUFs instead of panicking. ## Empirical (lambda-vector RTX 4090, 2026-05-17) apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ --max-tokens 8 --warmup 1 --iterations 4 --json → total_time_ms: 87085 ; total_tokens: 4 → 0.046 tok/s effective (auto-regressive re-prefill cost dominates) → 22.8s ttft, 22.2s p50, 32.5s p99 The 0.046 tok/s is the upper bound on what `apr bench` can currently measure for MoE without KV cache. PR-4's job is to add the cache and push this to ≥ 150 tok/s. ## What's in this PR crates/apr-cli/src/commands/bench_moe.rs (new): - `is_moe_gguf(&GGUFModel)` predicate - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU bench helper. - `run_cuda_moe_benchmark` — autoregressive forward_qwen3_moe_cuda + greedy argmax decode. - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe. crates/apr-cli/src/commands/bench.rs: + `include!("bench_moe.rs")` after the existing `include!("bench_safetensors.rs")` (same pattern as the other bench sub-files). crates/apr-cli/src/commands/benchmark.rs: + In `run_gguf_benchmark`, after parsing the GGUF and tokenising the prompt, check `is_moe_gguf(&gguf)`. If true, log the detection (`expert_count` + top-k) and tail-call `run_gguf_moe_benchmark`. Otherwise fall through to the existing dense path. ## What's NOT in this PR - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target) - Streaming/per-token JSON output for MoE (existing JSON output works; just reflects the autoregressive re-prefill cost) - MoE bench for SafeTensors / APR formats (only GGUF MoE supported today; the other formats don't have MoE production paths in the realizar inference engine) ## Cross-refs - #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this - #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) — L47 known-divergence + cascade pause point - The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly, which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this bench measures *post-fp64-acc* throughput, not the pre-fix path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 17, 2026 13:25

noahgift mentioned this pull request May 17, 2026

feat: BERT encoder inference for cross-encoder reranking (.apr) #326

Closed

6 tasks

noahgift added 2 commits May 17, 2026 15:44

Merge branch 'main' into feat/apr-bench-moe-dispatch

1fde42e

Merge branch 'main' into feat/apr-bench-moe-dispatch

824db4a

noahgift merged commit 541c247 into main May 17, 2026
10 checks passed

noahgift deleted the feat/apr-bench-moe-dispatch branch May 17, 2026 15:01

noahgift mentioned this pull request May 18, 2026

apr serve: matmul_fused.rs:211 panics with 'index out of bounds: len 0' on Qwen3-Coder-30B-MoE F32 weight #1789

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749)#1751

feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749)#1751
noahgift merged 3 commits into
mainfrom
feat/apr-bench-moe-dispatch

noahgift commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 17, 2026

Summary

Empirical (lambda-vector RTX 4090, 2026-05-17)

What's in this PR

What's NOT in this PR

Test plan

Cross-refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant