Skip to content

defect: apr bench panics on MoE GGUF models (matmul_fused.rs:211 index OOB) #1749

@noahgift

Description

@noahgift

Repro

apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --max-tokens 32 --warmup 1 --iterations 3 --json

Symptom

Hundreds of parallel thread panics:

thread '<unnamed>' panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 91324416

Same failure with --fast (which routes through realizar).

Why this matters

This blocks M-GPU-MOE-3 PR-4 throughput tuning (the part-2 acceptance criterion of #1583 — ≥150 tok/s on RTX 4090 + VRAM ≤95%). Without a working bench surface for MoE GGUFs, there's no shippable measurement step for PR-4.

The model itself works end-to-end via apr run:

apr run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --prompt "def hello():" --max-tokens 32
→ Output: ```python\ndef hello():\n    print("Hello, World!")\n```\n...
→ Completed in 215.68s (CPU path — 0.15 tok/s)

So routing + inference work; only apr bench is broken.

Root cause hypothesis

matmul_fused.rs:211 is the dense FFN matmul path. For MoE models, the FFN tensors are STACKED (3D: [num_experts, hidden_dim, intermediate]) rather than 2D, and the dense matmul kernel tries to index them as flat [hidden_dim * intermediate] which gives len=0 (correctly: the 2D dense tensor at this name doesn't exist, only the 3D *_exps does). The bench path likely doesn't dispatch to moe_ffn_forward_layer like the regular forward does.

Fix scope

In apr bench codegen (likely crates/apr-cli/src/commands/bench.rs), detect MoE models (presence of *_exps tensors or model.expert_count > 0 metadata) and route to the MoE forward path (forward_qwen3_moe / forward_qwen3_moe_cuda) instead of the dense matmul path.

Alternative quick fix: hard-fail with a clear error message ("MoE models not yet supported by apr bench; use apr run with --time-it") and surface a --moe flag to opt into the MoE-aware bench path.

Acceptance

apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --fast --max-tokens 32 --warmup 1 --iterations 3 --json
  • exits 0
  • emits JSON with tok_per_sec, p50_latency_ms, p95_latency_ms, p99_latency_ms
  • on RTX 4090, decode tok/s reflects the GPU MoE forward path

Cross-refs

Compute lane

Lambda-vector RTX 4090 (sm_89) — model cached at /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions