Skip to content

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583

@noahgift

Description

@noahgift

Context

Final stage of qwen3-moe-forward-gpu-v1 v1.7.0 cascade. After M85 qtype-aware dispatch fix (#1529 squash 89cb26af7) closed the L6 moe_ffn_out NaN root cause, M-GPU-MOE-1.x reached ACTIVE_ALGORITHM_LEVEL (M86 #1530 squash 65bc42577). The remaining work to flip ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME is M-GPU-MOE-3.

Two-part scope

Part 1: fp-accumulator-order alignment

Post-M85 cosine: ~85% of layers cos > 0.99 between CPU forward_qwen3_moe (LAZY-FUSED-MATVEC) and GPU forward_qwen3_moe_cuda. The remaining ~7-8 layers (L7, L9, L12, L20, L23, L29, L46) sit at cos 0.94-0.987.

Cause: fp-accumulator-order between:

  • CPU fused_q6k_parallel_matvec (Rust SIMD via rayon, deterministic per-thread reduction)
  • GPU q6k_gemv (CUDA warp-shuffle reduction)

Both decode the same Q6_K bytes correctly; f32 sum-of-products is non-associative. Fix is kernel-level reduction-order alignment, not algorithmic.

Part 2: throughput + memory target

  • ≥150 tok/s on RTX 4090 (≥5× CPU baseline of ~30 tok/s; allows headroom below dense Q4_K target of ~440 tok/s since MoE has expert-dispatch overhead)
  • VRAM ≤95% on Qwen3-Coder-30B-A3B-Instruct-Q4_K_M (cached 17.3 GB GGUF)

Acceptance

  • All 48 layers cos ≥ 0.99 between CPU and GPU forward
  • apr run throughput ≥150 tok/s on the cached 17.3 GB Qwen3-Coder GGUF (RTX 4090, Ada sm_89)
  • VRAM steady-state ≤95% utilization
  • qwen3-moe-forward-gpu-v1 v1.7.0 → v1.8.0 ACTIVE_ALGORITHM_LEVEL → ACTIVE_RUNTIME

Cross-refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions