Repro
apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--max-tokens 32 --warmup 1 --iterations 3 --json
Symptom
Hundreds of parallel thread panics:
thread '<unnamed>' panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 91324416
Same failure with --fast (which routes through realizar).
Why this matters
This blocks M-GPU-MOE-3 PR-4 throughput tuning (the part-2 acceptance criterion of #1583 — ≥150 tok/s on RTX 4090 + VRAM ≤95%). Without a working bench surface for MoE GGUFs, there's no shippable measurement step for PR-4.
The model itself works end-to-end via apr run:
apr run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--prompt "def hello():" --max-tokens 32
→ Output: ```python\ndef hello():\n print("Hello, World!")\n```\n...
→ Completed in 215.68s (CPU path — 0.15 tok/s)
So routing + inference work; only apr bench is broken.
Root cause hypothesis
matmul_fused.rs:211 is the dense FFN matmul path. For MoE models, the FFN tensors are STACKED (3D: [num_experts, hidden_dim, intermediate]) rather than 2D, and the dense matmul kernel tries to index them as flat [hidden_dim * intermediate] which gives len=0 (correctly: the 2D dense tensor at this name doesn't exist, only the 3D *_exps does). The bench path likely doesn't dispatch to moe_ffn_forward_layer like the regular forward does.
Fix scope
In apr bench codegen (likely crates/apr-cli/src/commands/bench.rs), detect MoE models (presence of *_exps tensors or model.expert_count > 0 metadata) and route to the MoE forward path (forward_qwen3_moe / forward_qwen3_moe_cuda) instead of the dense matmul path.
Alternative quick fix: hard-fail with a clear error message ("MoE models not yet supported by apr bench; use apr run with --time-it") and surface a --moe flag to opt into the MoE-aware bench path.
Acceptance
apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
--fast --max-tokens 32 --warmup 1 --iterations 3 --json
- exits 0
- emits JSON with
tok_per_sec, p50_latency_ms, p95_latency_ms, p99_latency_ms
- on RTX 4090, decode tok/s reflects the GPU MoE forward path
Cross-refs
Compute lane
Lambda-vector RTX 4090 (sm_89) — model cached at /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf.
Repro
Symptom
Hundreds of parallel thread panics:
Same failure with
--fast(which routes throughrealizar).Why this matters
This blocks M-GPU-MOE-3 PR-4 throughput tuning (the part-2 acceptance criterion of #1583 — ≥150 tok/s on RTX 4090 + VRAM ≤95%). Without a working bench surface for MoE GGUFs, there's no shippable measurement step for PR-4.
The model itself works end-to-end via
apr run:So routing + inference work; only
apr benchis broken.Root cause hypothesis
matmul_fused.rs:211is the dense FFN matmul path. For MoE models, the FFN tensors are STACKED (3D:[num_experts, hidden_dim, intermediate]) rather than 2D, and the dense matmul kernel tries to index them as flat[hidden_dim * intermediate]which gives len=0 (correctly: the 2D dense tensor at this name doesn't exist, only the 3D*_expsdoes). The bench path likely doesn't dispatch tomoe_ffn_forward_layerlike the regular forward does.Fix scope
In
apr benchcodegen (likelycrates/apr-cli/src/commands/bench.rs), detect MoE models (presence of*_expstensors ormodel.expert_count > 0metadata) and route to the MoE forward path (forward_qwen3_moe/forward_qwen3_moe_cuda) instead of the dense matmul path.Alternative quick fix: hard-fail with a clear error message ("MoE models not yet supported by
apr bench; useapr runwith --time-it") and surface a--moeflag to opt into the MoE-aware bench path.Acceptance
tok_per_sec,p50_latency_ms,p95_latency_ms,p99_latency_msCross-refs
apr runworks against this exact GGUF (CPU path validated above)Compute lane
Lambda-vector RTX 4090 (sm_89) — model cached at
/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf.