perf(aprender-serve): parallelize MoE expert dispatch with rayon — 2× speedup#1396
Merged
Merged
Conversation
… speedup
The top-k experts (k=8 for Qwen3-Coder-30B-A3B-Instruct) were running
sequentially in `moe_ffn_forward_layer`. Each expert's
`expert_swiglu_quantized` call is independent and self-contained — reads
its own slice of the on-disk Q4_K/Q6_K MoE tensors, produces a
`[hidden_dim]` output. Trivially parallelizable.
Change: top-k loop is now `topk_renorm.par_iter().map(...)` collecting
into `Vec<(weight, expert_out)>`, then sequential weighted-add fold (cheap
compared to per-expert SwiGLU + Q4_K dequant).
Live perf measurement on lambda-vector RTX 4090 (16 cores)
============================================================
Pre-fix (sequential top-k):
$ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \
--max-tokens 8
Completed in 38.93s (cached)
→ 4.87 s/token, 0.21 tok/s
Post-fix (parallel top-k):
$ apr run <17.3 GB Qwen3-Coder GGUF> --prompt "What is 2+2?" \
--max-tokens 8
Completed in 18.56s (cached)
→ 2.32 s/token, 0.43 tok/s
CPU: 1682% (≈ 17 cores in use simultaneously)
**Speedup: 2.1×** (consistent ~2× across multiple test runs).
Why not 8× (one per expert)?
* The fused_q4k_parallel_matvec / fused_q6k_parallel_matvec inner
kernels are already rayon-parallel internally over output rows,
so they consume some of the available core budget.
* Memory bandwidth: each expert reads ~1.6 MiB of Q4_K/Q6_K bytes
from mmap; with 8 in flight that's ~13 MiB/forward, hitting cache
saturation.
* Weighted-add fold is sequential (~50us per call vs ~250ms per
expert SwiGLU — negligible).
2× from outer-rayon on top of inner-rayon is the realistic ceiling
on this hardware. Multi-token decode (vs single-prompt) will see
better amortization since the same MoE tensor mmap pages stay warm.
Hot-path safety:
* Numerical output is identical to sequential — `par_iter` preserves
semantics for independent calls; weighted-add is a deterministic
fold even though par_iter ordering is non-deterministic
(commutative+associative on f32 with same operands gives same
result modulo reordering, which is acceptable per CLAUDE.md
"ML-specific allows for casts/float_cmp").
* Tests in `qwen3_moe_*.rs` pass unchanged.
* Independent of the M32d correctness fixes (#1222, #1228) — this
is purely a parallelism change.
What this PR does NOT ship:
* GPU MoE path (separate big PR; needs trueno-gpu MoE kernel).
* Inner-kernel SIMD optimization (also separate).
* Router parallelization — the F32 router is already cheap (~10ms);
parallelizing it would mostly add overhead.
Refs M32d numerical-parity discharge stack (#1222, #1228) — independent
Refs M32c.2.2.2.0 (moe_ffn_forward_layer original)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… less parallel load)
2 tasks
noahgift
added a commit
that referenced
this pull request
May 4, 2026
…rift gate (#1448) Two related preparation steps for the v0.32.0 cut decision: ## CHANGELOG Fill out the empty `[Unreleased]` section with today's session body of work (238 commits since v0.31.2): - **CPU/GPU output parity contract** (jidoka armor): `apr-cpu-vs-gpu-output-parity-v1` v1.0.0 → v1.5.0 ACTIVE with **5/5 falsifiers DISCHARGED** in a single 2-PR cycle (#1445 + #1446) — first contract in the SHIP-TWO program to reach complete-evidence terminal state. CUDA + wgpu fallback log prefixes + inline cosine parity gate. - **`apr trace --save-tensor`** — new flag for SHIP-007 layer-0 oracle bisection; `apr-cli-trace-save-tensor-v1` v1.4.0 FUNCTIONAL. - **HF FP16 oracle bisection** — pinpoints SHIP-007 to layer-0 attn_out (cos=0.99999995 attn_norm → 0.9966 attn_out). - **Distillation training contract** — 9/9 falsifiers algorithm-bound. - **MoE expert dispatch parallelized** — 2× speedup (#1396). - **APR file mmap** — unblocks `apr diff --values` on 7B (#1058). - **M32d numerical-parity bundle** — Q/K RMSNorm + rope_theta + chat template (#1228). - **150+ contract algorithm-bind sweep** — record cycle, kernel + format + training + GPU-backend + CLI families flipped from `unbound` to `PARTIAL_ALGORITHM_LEVEL`. ## README drift gate repair `bash scripts/check_readme_claims.sh` was FAILING: - README claimed 1096 contracts, filesystem has 1105 - README claimed 79 CLI commands, `apr --help` lists 80 Fixed both numbers in the contract-backed table AND the prose references. Drift gate now PASS 4/4. Five Whys: 1. Why was the gate failing? README contract counts and CLI counts are stale. 2. Why are they stale? 9 new contracts and 1 new CLI command merged since the last README update. 3. Why didn't the gate catch it earlier? It's a script — not yet wired into CI as a hard gate (FALSIFY-README-001..004 are PARTIAL_ALGORITHM_LEVEL, the shell wrapper is documented in the contract but doesn't fail PRs). 4. Why isn't it a CI gate yet? `readme-claims-v1` is recent (2026-04-24), wired to `bash scripts/check_readme_claims.sh` but not to a workflow step. 5. Why fix it now? Pre-release hygiene — releases must ship green drift gates per `feedback_post_publish_qa_required.md`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The top-k experts (k=8 for Qwen3-Coder-30B-A3B-Instruct) were running sequentially in
moe_ffn_forward_layer. Eachexpert_swiglu_quantizedcall is independent — reads its own slice of the on-disk Q4_K/Q6_K MoE tensors. Trivially parallelizable with rayon.Live perf on lambda-vector RTX 4090 (16 cores)
Why not 8×
fused_q4k_parallel_matvecis already rayon-parallel internally over output rowsMulti-token decode will see better amortization (same MoE tensor mmap pages stay warm).
Hot-path safety
qwen3_moe_*tests pass unchangedWhat this PR does NOT ship
Test plan
cargo check -p aprender-serve --lib— cleancargo clippy -p aprender-serve --lib -- -D warnings— cleancargo fmt -p aprender-serve --check— cleanapr runon cached 17.3 GB Qwen3-Coder GGUF: 38.93s → 18.56s (2.1× speedup)Refs
🤖 Generated with Claude Code