Skip to content

feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749)#1751

Merged
noahgift merged 3 commits into
mainfrom
feat/apr-bench-moe-dispatch
May 17, 2026
Merged

feat(apr-cli): bench MoE dispatch — routes Qwen3-MoE GGUFs through forward_qwen3_moe (#1749)#1751
noahgift merged 3 commits into
mainfrom
feat/apr-bench-moe-dispatch

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes #1749. apr bench against any MoE GGUF (Qwen3-Coder-30B-A3B-Instruct etc.) used to panic in matmul_fused.rs:211 (index out of bounds: len=0 but index ≈ 91M) because the dense forward_single_with_cache path looks up 2D ffn_{gate,up,down}.weight tensors that don't exist on MoE models (which have 3D *_exps instead).

This PR detects MoE via gguf.expert_count().is_some() and routes to forward_qwen3_moe (CPU) / forward_qwen3_moe_cuda (GPU), running them autoregressively. Unblocks M-GPU-MOE-3 PR-4 throughput measurement.

Empirical (lambda-vector RTX 4090, 2026-05-17)

apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --max-tokens 8 --warmup 1 --iterations 4 --json
{
  "total_tokens": 4,
  "total_time_ms": 87085,
  "mean_time_ms": 21771,
  "time_to_first_token_ms": 22839,
  "latency_p50_ms": 22216,
  "latency_p99_ms": 32470
}

0.046 tok/s effective — autoregressive re-prefill cost (no KV cache yet). PR-4's job is to push this to ≥150 tok/s by adding the cache.

What's in this PR

  • crates/apr-cli/src/commands/bench_moe.rs (new): is_moe_gguf predicate + run_gguf_moe_benchmark + CUDA/CPU autoregressive bench helpers using greedy argmax decode.
  • crates/apr-cli/src/commands/bench.rs: new include!(\"bench_moe.rs\") (same pattern as the existing bench sub-files).
  • crates/apr-cli/src/commands/benchmark.rs: MoE detection in run_gguf_benchmark + tail-call to run_gguf_moe_benchmark.

What's NOT in this PR

  • True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target proper)
  • MoE bench for SafeTensors / APR formats (only GGUF MoE today)

Test plan

  • cargo build -p apr-cli --features 'inference cuda' clean
  • apr bench Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --json exits cleanly with valid JSON (instead of panicking)
  • Dense GGUF bench path unchanged (route only flips on expert_count > 0)

Cross-refs

🤖 Generated with Claude Code

…rward_qwen3_moe (#1749)

Closes #1749. Pre-fix, `apr bench` against any MoE GGUF
(Qwen3-Coder-30B-A3B-Instruct etc.) routed through the dense
`forward_single_with_cache` path which calls `matmul_fused.rs:211`
on tensor names that don't exist on MoE models (the 3D `*_exps`
tensors are stored at different names than the 2D dense
`ffn_{gate,up,down}.weight` the dense path looks up). Result:
hundreds of parallel thread panics — `index out of bounds: len=0 but
index ≈ 91M`.

This PR adds MoE detection via `gguf.expert_count().is_some()` and
routes to the MoE-aware forward path:

  CPU:  realizar::gguf::OwnedQuantizedModel::forward_qwen3_moe
  CUDA: realizar::gguf::OwnedQuantizedModelCuda::forward_qwen3_moe_cuda

Both helpers do not currently expose a KV cache, so the bench runs
them **autoregressively with re-prefill** — each iteration runs full
forward over `prompt + previously-generated tokens` and appends the
argmax to the prompt for the next iter. O(N²) in N tokens but bounded
by `--max-tokens` (default 32).

This is intentionally a stop-gap to unblock M-GPU-MOE-3 PR-4
throughput measurement. True KV-cache MoE decoding is the actual
PR-4 work; this PR makes `apr bench` produce a real (if pessimistic)
tok/s number for MoE GGUFs instead of panicking.

## Empirical (lambda-vector RTX 4090, 2026-05-17)

  apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
    --max-tokens 8 --warmup 1 --iterations 4 --json

  → total_time_ms: 87085 ; total_tokens: 4
  → 0.046 tok/s effective (auto-regressive re-prefill cost dominates)
  → 22.8s ttft, 22.2s p50, 32.5s p99

The 0.046 tok/s is the upper bound on what `apr bench` can currently
measure for MoE without KV cache. PR-4's job is to add the cache and
push this to ≥ 150 tok/s.

## What's in this PR

  crates/apr-cli/src/commands/bench_moe.rs (new):
    - `is_moe_gguf(&GGUFModel)` predicate
    - `run_gguf_moe_benchmark` — loads MappedGGUFModel + N
      Qwen3MoeQuantizedLayer descriptors + (optionally) wraps in
      OwnedQuantizedModelCuda, then dispatches to the CUDA or CPU
      bench helper.
    - `run_cuda_moe_benchmark` — autoregressive
      forward_qwen3_moe_cuda + greedy argmax decode.
    - `run_cpu_moe_benchmark` — autoregressive forward_qwen3_moe.

  crates/apr-cli/src/commands/bench.rs:
    + `include!("bench_moe.rs")` after the existing
      `include!("bench_safetensors.rs")` (same pattern as the other
      bench sub-files).

  crates/apr-cli/src/commands/benchmark.rs:
    + In `run_gguf_benchmark`, after parsing the GGUF and tokenising
      the prompt, check `is_moe_gguf(&gguf)`. If true, log the
      detection (`expert_count` + top-k) and tail-call
      `run_gguf_moe_benchmark`. Otherwise fall through to the
      existing dense path.

## What's NOT in this PR

  - True KV-cache MoE decoding (= M-GPU-MOE-3 PR-4 throughput target)
  - Streaming/per-token JSON output for MoE (existing JSON output
    works; just reflects the autoregressive re-prefill cost)
  - MoE bench for SafeTensors / APR formats (only GGUF MoE supported
    today; the other formats don't have MoE production paths in the
    realizar inference engine)

## Cross-refs

- #1583 M-GPU-MOE-3 — PR-4 throughput unblocks on this
- #1747 contract qwen3-moe-forward-gpu-v1 v1.7.2 (just merged) —
  L47 known-divergence + cascade pause point
- The MoE bench helpers reuse `forward_qwen3_moe[_cuda]` directly,
  which means PR-2 #1737's fp64 q6k_gemv acc is in effect; this
  bench measures *post-fp64-acc* throughput, not the pre-fix path.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 541c247 into main May 17, 2026
10 checks passed
@noahgift noahgift deleted the feat/apr-bench-moe-dispatch branch May 17, 2026 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

defect: apr bench panics on MoE GGUF models (matmul_fused.rs:211 index OOB)

1 participant