defect: `apr bench` panics on MoE GGUF models (matmul_fused.rs:211 index OOB)

## Repro

```bash
apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --max-tokens 32 --warmup 1 --iterations 3 --json
```

## Symptom

Hundreds of parallel thread panics:

```
thread '<unnamed>' panicked at crates/aprender-serve/src/gguf/inference/matmul_fused.rs:211:54:
index out of bounds: the len is 0 but the index is 91324416
```

Same failure with `--fast` (which routes through `realizar`).

## Why this matters

This blocks **M-GPU-MOE-3 PR-4 throughput tuning** (the part-2 acceptance criterion of #1583 — ≥150 tok/s on RTX 4090 + VRAM ≤95%). Without a working bench surface for MoE GGUFs, there's no shippable measurement step for PR-4.

The model itself works end-to-end via `apr run`:

```bash
apr run /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --prompt "def hello():" --max-tokens 32
→ Output: ```python\ndef hello():\n    print("Hello, World!")\n```\n...
→ Completed in 215.68s (CPU path — 0.15 tok/s)
```

So routing + inference work; only `apr bench` is broken.

## Root cause hypothesis

`matmul_fused.rs:211` is the dense FFN matmul path. For MoE models, the FFN tensors are STACKED (3D: `[num_experts, hidden_dim, intermediate]`) rather than 2D, and the dense matmul kernel tries to index them as flat `[hidden_dim * intermediate]` which gives len=0 (correctly: the 2D dense tensor at this name doesn't exist, only the 3D `*_exps` does). The bench path likely doesn't dispatch to `moe_ffn_forward_layer` like the regular forward does.

## Fix scope

In `apr bench` codegen (likely `crates/apr-cli/src/commands/bench.rs`), detect MoE models (presence of `*_exps` tensors or `model.expert_count > 0` metadata) and route to the MoE forward path (`forward_qwen3_moe` / `forward_qwen3_moe_cuda`) instead of the dense matmul path.

Alternative quick fix: hard-fail with a clear error message ("MoE models not yet supported by `apr bench`; use `apr run` with --time-it") and surface a `--moe` flag to opt into the MoE-aware bench path.

## Acceptance

```bash
apr bench /home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  --fast --max-tokens 32 --warmup 1 --iterations 3 --json
```
- exits 0
- emits JSON with `tok_per_sec`, `p50_latency_ms`, `p95_latency_ms`, `p99_latency_ms`
- on RTX 4090, decode tok/s reflects the GPU MoE forward path

## Cross-refs

- #1583 M-GPU-MOE-3 — PR-4 throughput blocked on this
- #1737 PR-2 fp64 q6k_gemv acc already shipped — GPU path works for forward, just not bench
- `apr run` works against this exact GGUF (CPU path validated above)

## Compute lane

Lambda-vector RTX 4090 (sm_89) — model cached at `/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

defect: `apr bench` panics on MoE GGUF models (matmul_fused.rs:211 index OOB) #1749

Repro

Symptom

Why this matters

Root cause hypothesis

Fix scope

Acceptance

Cross-refs

Compute lane

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

defect: apr bench panics on MoE GGUF models (matmul_fused.rs:211 index OOB) #1749

Description

Repro

Symptom

Why this matters

Root cause hypothesis

Fix scope

Acceptance

Cross-refs

Compute lane

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

defect: `apr bench` panics on MoE GGUF models (matmul_fused.rs:211 index OOB) #1749