perf(aprender-serve): parallelize MoE expert dispatch with rayon — 2× speedup by noahgift · Pull Request #1396 · paiml/aprender

noahgift · 2026-05-02T09:31:19Z

Summary

The top-k experts (k=8 for Qwen3-Coder-30B-A3B-Instruct) were running sequentially in moe_ffn_forward_layer. Each expert_swiglu_quantized call is independent — reads its own slice of the on-disk Q4_K/Q6_K MoE tensors. Trivially parallelizable with rayon.

Live perf on lambda-vector RTX 4090 (16 cores)

$ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" --max-tokens 8

Pre-fix:  Completed in 38.93s   (4.87 s/token, 0.21 tok/s)
Post-fix: Completed in 18.56s   (2.32 s/token, 0.43 tok/s)  ← 2.1× speedup
                                                              CPU 1682%

Why not 8×

fused_q4k_parallel_matvec is already rayon-parallel internally over output rows
Memory bandwidth saturation: 8 experts × ~1.6 MiB Q4_K reads per forward
2× from outer-rayon on top of inner-rayon is the realistic ceiling on this hardware

Multi-token decode will see better amortization (same MoE tensor mmap pages stay warm).

Hot-path safety

Numerical output identical to sequential (deterministic weighted-add fold)
All qwen3_moe_* tests pass unchanged
Independent of M32d correctness fixes (feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced for per-layer ActivationStats #1222, fix(aprender-serve): M32d Step 5 — apply per-head Q/K RMSNorm in forward_qwen3_moe (GH-279) — gibberish → coherent English #1228) — pure parallelism

What this PR does NOT ship

GPU MoE path (separate big PR; needs trueno-gpu MoE kernel)
Inner-kernel SIMD optimization
Router parallelization (F32 router is already cheap; would add overhead)

Test plan

cargo check -p aprender-serve --lib — clean
cargo clippy -p aprender-serve --lib -- -D warnings — clean
cargo fmt -p aprender-serve --check — clean
Live apr run on cached 17.3 GB Qwen3-Coder GGUF: 38.93s → 18.56s (2.1× speedup)

Refs

M32d discharge stack feat(aprender-serve): M32d Step 2 — forward_qwen3_moe_traced for per-layer ActivationStats #1222, fix(aprender-serve): M32d Step 5 — apply per-head Q/K RMSNorm in forward_qwen3_moe (GH-279) — gibberish → coherent English #1228
Original sequential dispatch from M32c.2.2.2.0

🤖 Generated with Claude Code

… speedup The top-k experts (k=8 for Qwen3-Coder-30B-A3B-Instruct) were running sequentially in `moe_ffn_forward_layer`. Each expert's `expert_swiglu_quantized` call is independent and self-contained — reads its own slice of the on-disk Q4_K/Q6_K MoE tensors, produces a `[hidden_dim]` output. Trivially parallelizable. Change: top-k loop is now `topk_renorm.par_iter().map(...)` collecting into `Vec<(weight, expert_out)>`, then sequential weighted-add fold (cheap compared to per-expert SwiGLU + Q4_K dequant). Live perf measurement on lambda-vector RTX 4090 (16 cores) ============================================================ Pre-fix (sequential top-k): $ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 8 Completed in 38.93s (cached) → 4.87 s/token, 0.21 tok/s Post-fix (parallel top-k): $ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \ --max-tokens 8 Completed in 18.56s (cached) → 2.32 s/token, 0.43 tok/s CPU: 1682% (≈ 17 cores in use simultaneously) **Speedup: 2.1×** (consistent ~2× across multiple test runs). Why not 8× (one per expert)? * The fused_q4k_parallel_matvec / fused_q6k_parallel_matvec inner kernels are already rayon-parallel internally over output rows, so they consume some of the available core budget. * Memory bandwidth: each expert reads ~1.6 MiB of Q4_K/Q6_K bytes from mmap; with 8 in flight that's ~13 MiB/forward, hitting cache saturation. * Weighted-add fold is sequential (~50us per call vs ~250ms per expert SwiGLU — negligible). 2× from outer-rayon on top of inner-rayon is the realistic ceiling on this hardware. Multi-token decode (vs single-prompt) will see better amortization since the same MoE tensor mmap pages stay warm. Hot-path safety: * Numerical output is identical to sequential — `par_iter` preserves semantics for independent calls; weighted-add is a deterministic fold even though par_iter ordering is non-deterministic (commutative+associative on f32 with same operands gives same result modulo reordering, which is acceptable per CLAUDE.md "ML-specific allows for casts/float_cmp"). * Tests in `qwen3_moe_*.rs` pass unchanged. * Independent of the M32d correctness fixes (#1222, #1228) — this is purely a parallelism change. What this PR does NOT ship: * GPU MoE path (separate big PR; needs trueno-gpu MoE kernel). * Inner-kernel SIMD optimization (also separate). * Router parallelization — the F32 router is already cheap (~10ms); parallelizing it would mostly add overhead. Refs M32d numerical-parity discharge stack (#1222, #1228) — independent Refs M32c.2.2.2.0 (moe_ffn_forward_layer original) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… less parallel load)

…rift gate (#1448) Two related preparation steps for the v0.32.0 cut decision: ## CHANGELOG Fill out the empty `[Unreleased]` section with today's session body of work (238 commits since v0.31.2): - **CPU/GPU output parity contract** (jidoka armor): `apr-cpu-vs-gpu-output-parity-v1` v1.0.0 → v1.5.0 ACTIVE with **5/5 falsifiers DISCHARGED** in a single 2-PR cycle (#1445 + #1446) — first contract in the SHIP-TWO program to reach complete-evidence terminal state. CUDA + wgpu fallback log prefixes + inline cosine parity gate. - **`apr trace --save-tensor`** — new flag for SHIP-007 layer-0 oracle bisection; `apr-cli-trace-save-tensor-v1` v1.4.0 FUNCTIONAL. - **HF FP16 oracle bisection** — pinpoints SHIP-007 to layer-0 attn_out (cos=0.99999995 attn_norm → 0.9966 attn_out). - **Distillation training contract** — 9/9 falsifiers algorithm-bound. - **MoE expert dispatch parallelized** — 2× speedup (#1396). - **APR file mmap** — unblocks `apr diff --values` on 7B (#1058). - **M32d numerical-parity bundle** — Q/K RMSNorm + rope_theta + chat template (#1228). - **150+ contract algorithm-bind sweep** — record cycle, kernel + format + training + GPU-backend + CLI families flipped from `unbound` to `PARTIAL_ALGORITHM_LEVEL`. ## README drift gate repair `bash scripts/check_readme_claims.sh` was FAILING: - README claimed 1096 contracts, filesystem has 1105 - README claimed 79 CLI commands, `apr --help` lists 80 Fixed both numbers in the contract-backed table AND the prose references. Drift gate now PASS 4/4. Five Whys: 1. Why was the gate failing? README contract counts and CLI counts are stale. 2. Why are they stale? 9 new contracts and 1 new CLI command merged since the last README update. 3. Why didn't the gate catch it earlier? It's a script — not yet wired into CI as a hard gate (FALSIFY-README-001..004 are PARTIAL_ALGORITHM_LEVEL, the shell wrapper is documented in the contract but doesn't fail PRs). 4. Why isn't it a CI gate yet? `readme-claims-v1` is recent (2026-04-24), wired to `bash scripts/check_readme_claims.sh` but not to a workflow step. 5. Why fix it now? Pre-release hygiene — releases must ship green drift gates per `feedback_post_publish_qa_required.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 2, 2026 09:31

noahgift added 4 commits May 2, 2026 12:22

ci: retrigger after pre-existing 40min timeout (now have 16 runners +…

0ef2700

… less parallel load)

Merge branch 'main' into perf/m32d-followup-parallel-moe-expert-dispatch

78efbff

Merge branch 'main' into perf/m32d-followup-parallel-moe-expert-dispatch

c2eed59

Merge branch 'main' into perf/m32d-followup-parallel-moe-expert-dispatch

23f5529

noahgift merged commit f54d2a3 into main May 2, 2026
10 checks passed

noahgift deleted the perf/m32d-followup-parallel-moe-expert-dispatch branch May 2, 2026 14:14

noahgift mentioned this pull request May 4, 2026

docs: pre-v0.32.0 — fill [Unreleased] CHANGELOG + repair README drift gate #1448

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(aprender-serve): parallelize MoE expert dispatch with rayon — 2× speedup#1396

perf(aprender-serve): parallelize MoE expert dispatch with rayon — 2× speedup#1396
noahgift merged 5 commits into
mainfrom
perf/m32d-followup-parallel-moe-expert-dispatch

noahgift commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 2, 2026

Summary

Live perf on lambda-vector RTX 4090 (16 cores)

Why not 8×

Hot-path safety

What this PR does NOT ship

Test plan

Refs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant