feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749)#1751
Merged
Merged
Conversation
…rward_qwen3_moe (#1749) Closes #1749. Pre-fix, `apr bench` against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense `forward_single_with_cache` path which calls `matmul_fused.rs:211` on tensor names that don't exist on MoE models (the 3D `*_exps` tensors are stored at different names than the 2D dense `ffn_{gate,up,down}.weight` the dense path looks up). Result: hundreds of parallel thread panics — `index out of bounds: len=0 but index ≈ 91M`. This PR adds MoE detection via `gguf.expert_count().is_some()` and routes to the MoE-aware forward path: CPU: realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda Both helpers do not currently expose a KV cache, so the bench runs them **autoregressively with re-prefill** — each iteration runs full forward over `prompt + previously-generated tokens` and appends the argmax to the prompt for the next iter. O(N²) in N tokens but bounded by `--max-tokens` (default 32). This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4 throughput measurement. True KV-cache MoE decoding is the actual PR-4 work; this PR makes `apr bench` produce a real (if pessimistic) tok/s number for MoE GGUFs instead of panicking. ## Empirical (lambda-vector RTX 4090, 2026-05-17) apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ --max-tokens 8 --warmup 1 --iterations 4 --json → total_time_ms: 87085 ; total_tokens: 4 → 0.046 tok/s effective (auto-regressive re-prefill cost dominates) → 22.8s ttft, 22.2s p50, 32.5s p99 The 0.046 tok/s is the upper bound on what `apr bench` can currently measure for MoE without KV cache. PR-4's job is to add the cache and push this to ≥ 150 tok/s. ## What's in this PR crates/apr-cli/src/commands/bench_moe.rs (new): - `is_moe_gguf(&GGUFModel)` predicate - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU bench helper. - `run_cuda_moe_benchmark` — autoregressive forward_qwen3_moe_cuda + greedy argmax decode. - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe. crates/apr-cli/src/commands/bench.rs: + `include!("bench_moe.rs")` after the existing `include!("bench_safetensors.rs")` (same pattern as the other bench sub-files). crates/apr-cli/src/commands/benchmark.rs: + In `run_gguf_benchmark`, after parsing the GGUF and tokenising the prompt, check `is_moe_gguf(&gguf)`. If true, log the detection (`expert_count` + top-k) and tail-call `run_gguf_moe_benchmark`. Otherwise fall through to the existing dense path. ## What's NOT in this PR - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target) - Streaming/per-token JSON output for MoE (existing JSON output works; just reflects the autoregressive re-prefill cost) - MoE bench for SafeTensors / APR formats (only GGUF MoE supported today; the other formats don't have MoE production paths in the realizar inference engine) ## Cross-refs - #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this - #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) — L47 known-divergence + cascade pause point - The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly, which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this bench measures *post-fp64-acc* throughput, not the pre-fix path. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #1749.
apr benchagainst any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) used to panic inmatmul_fused.rs:211(index out of bounds: len=0 but index ≈ 91M) because the denseforward_single_with_cachepath looks up 2Dffn_{gate,up,down}.weighttensors that don't exist on MoE models (which have 3D*_expsinstead).This PR detects MoE via
gguf.expert_count().is_some()and routes toforward_qwen3_moe(CPU) /forward_qwen3_moe_cuda(GPU), running them autoregressively. Unblocks M-GPU-MOE-3 PR-4 throughput measurement.Empirical (lambda-vector RTX 4090, 2026-05-17)
{ "total_tokens": 4, "total_time_ms": 87085, "mean_time_ms": 21771, "time_to_first_token_ms": 22839, "latency_p50_ms": 22216, "latency_p99_ms": 32470 }0.046 tok/s effective — autoregressive re-prefill cost (no KV cache yet). PR-4's job is to push this to ≥150 tok/s by adding the cache.
What's in this PR
crates/apr-cli/src/commands/bench_moe.rs(new):is_moe_ggufpredicate +run_gguf_moe_benchmark+ CUDA/CPU autoregressive bench helpers using greedy argmax decode.crates/apr-cli/src/commands/bench.rs: newinclude!(\"bench_moe.rs\")(same pattern as the existing bench sub-files).crates/apr-cli/src/commands/benchmark.rs: MoE detection inrun_gguf_benchmark+ tail-call torun_gguf_moe_benchmark.What's NOT in this PR
Test plan
cargo build -p apr-cli --features 'inference cuda'cleanapr bench Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --jsonexits cleanly with valid JSON (instead of panicking)expert_count > 0)Cross-refs
🤖 Generated with Claude Code