contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL by noahgift · Pull Request #1530 · paiml/aprender

noahgift · 2026-05-06T08:51:09Z

Summary

Status promotion amendment after the M-GPU-MOE-1.4 step (c) cascade closure (v1.6.0 / aprender PR #1529 squash 89cb26af7).

What flips

Field	Was	Now
`metadata.status`	`DRAFT`	`ACTIVE_ALGORITHM_LEVEL`
`metadata.status` comment	"Scaffold + architecture amendments + preload-bug fix" (stale)	"1.x cascade DISCHARGED — wgpu (2) + throughput (3) PENDING"
`M-GPU-MOE-1` implementation_stage	`PENDING`	`SHIPPED` (umbrella covers 1.0 → 1.4 step c)

Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME

Mirrors CPU sibling qwen3-moe-forward-v1 cadence — ALGORITHM_LEVEL = "algorithm bound on main; finite output for canonical prompt". ACTIVE_RUNTIME flip waits on M-GPU-MOE-3 (throughput ≥150 tok/s + memory budget) per original v1.0 contract convention.

Per-AC status

Acceptance Criterion	Status	Notes
AC_GPU_MOE_001 (cosine ≥0.99 vs CPU)	ALGORITHM_LEVEL_DISCHARGED	~85% layers ≥0.99; ~7-8 at 0.94-0.987 (fp accumulator order)
AC_GPU_MOE_002 (cosine ≥0.99 vs HF FP16)	blocked on fixture	M32d.1 operator-confirm pending
AC_GPU_MOE_003 (top-5 recovery)	pending heavy re-run
AC_GPU_MOE_004 (output finiteness)	DISCHARGED	M85 qtype-aware dispatch fix
AC_GPU_MOE_005 (deterministic)	ALGORITHM_LEVEL_DISCHARGED
AC_GPU_MOE_006 (throughput ≥150 tok/s)	PENDING	M-GPU-MOE-3
AC_GPU_MOE_007 (VRAM ≤95%)	PENDING	M-GPU-MOE-3

4/5 algorithm-bound + 1 fixture-blocked → ACTIVE_ALGORITHM_LEVEL threshold crossed.

Sub-cascade pinned in M-GPU-MOE-1 SHIPPED

1.0 stub (PR feat(aprender-serve): forward_qwen3_moe_gpu M-GPU-MOE-1.0 stub #1460)
1.1.0 expert_swiglu_cuda helper (PR contract(qwen3-moe-forward-gpu-v1): v1.0.0 → v1.1.0 — option D integration architecture (M-GPU-MOE-0.5) #1462)
1.1.1 moe_ffn_forward_layer_cuda integration (PR feat(aprender-serve): forward_qwen3_moe_cuda — M-GPU-MOE-1.0-redo on correct type #1464)
1.1.2 forward_qwen3_moe_cuda body (PR feat(aprender-serve): moe_ffn_forward_layer_cuda — M-GPU-MOE-1.1.1 #1469 + feat(aprender-serve): forward_qwen3_moe_cuda full integration — M-GPU-MOE-1.1.2 #1477)
1.2 cosine test (PR test(aprender-serve): qwen3_moe_gpu_parity — M-GPU-MOE-1.2 cosine ≥0.99 falsifier #1484)
1.3 preload_weights_gpu MoE-awareness fix (PR feat(aprender-serve): M-GPU-MOE-1.3 — preload_weights_gpu MoE-aware (partial discharge) #1491)
1.4 step (a) instrumentation cascade (M50→M81)
1.4 step (b) LIVE bisection on gx10 (M83+M84)
1.4 step (c) qtype-aware dispatch fix (M85, PR fix(M-GPU-MOE-1.4 step c): qtype-aware dispatch in expert_swiglu_cuda — closes L6 moe_ffn_out NaN #1529)

What stays PENDING

M-GPU-MOE-2 (wgpu fallback) — blocked on trueno-gpu wgpu surface
M-GPU-MOE-3 (throughput) — kernel-level fp-order alignment

Test plan

pv validate 0/0
No production code touched (YAML-only)
Cross-references all M-GPU-MOE-1.x SHIPPED PRs by number + squash

🤖 Generated with Claude Code

…ALGORITHM_LEVEL post 1.x cascade Status promotion amendment after the M-GPU-MOE-1.4 step (c) cascade closure (v1.6.0 / aprender PR #1529). What flips: - metadata.status: DRAFT → ACTIVE_ALGORITHM_LEVEL - M-GPU-MOE-1 implementation_stage (umbrella): PENDING → SHIPPED (covers full 1.x sub-cascade 1.0 → 1.4 step c) - metadata.status comment refreshed (was stale "Scaffold + architecture amendments + preload-bug fix") Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME: Mirrors CPU sibling qwen3-moe-forward-v1 cadence — ALGORITHM_LEVEL = "algorithm bound on main; finite output for canonical prompt". RUNTIME flip waits on M-GPU-MOE-3 (throughput ≥150 tok/s + memory budget) per original v1.0 contract convention. Per-AC status: - AC_GPU_MOE_001 (cosine ≥0.99 vs CPU): ALGORITHM_LEVEL_DISCHARGED - AC_GPU_MOE_002 (cosine ≥0.99 vs HF FP16): blocked on fixture - AC_GPU_MOE_003 (top-5 token recovery): pending heavy re-run - AC_GPU_MOE_004 (output finiteness): DISCHARGED (M85) - AC_GPU_MOE_005 (deterministic per-token): ALGORITHM_LEVEL_DISCHARGED - AC_GPU_MOE_006 (throughput ≥150 tok/s): PENDING M-GPU-MOE-3 - AC_GPU_MOE_007 (VRAM ≤95%): PENDING M-GPU-MOE-3 YAML-only — production hot paths byte-unchanged. `pv validate` 0/0. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 6, 2026 08:51

noahgift merged commit 65bc425 into main May 6, 2026
11 checks passed

noahgift deleted the contract/qwen3-moe-gpu-v1.7.0-active-algorithm-level branch May 6, 2026 09:14

This was referenced May 9, 2026

M-GPU-MOE-2.x — wgpu helpers + integration + parity test for qwen3-moe-forward-gpu-v1 #1582

Open

M-GPU-MOE-3 — throughput ≥150 tok/s on RTX 4090 + VRAM ≤95% + fp-accumulator-order alignment #1583

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL#1530

contract(qwen3-moe-forward-gpu-v1): v1.6.0 → v1.7.0 — DRAFT → ACTIVE_ALGORITHM_LEVEL#1530
noahgift merged 1 commit into
mainfrom
contract/qwen3-moe-gpu-v1.7.0-active-algorithm-level

noahgift commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 6, 2026

Summary

What flips

Why ACTIVE_ALGORITHM_LEVEL not ACTIVE_RUNTIME

Per-AC status

Sub-cascade pinned in M-GPU-MOE-1 SHIPPED

What stays PENDING

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant