evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell (3/3 falsifiers PASS) by noahgift · Pull Request #1855 · paiml/aprender

noahgift · 2026-05-21T08:29:45Z

Summary

Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml` v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B (qwen3moe arch) running on gx10 Blackwell GB10.

Results

Falsifier	Test	Verdict
V1_001	`v1_001_callback_fires_per_token`	✅ PASS
V1_002	(regression guard via serve-dispatch test)	GUARD
V1_003	`v1_003_inter_token_latency_floor`	✅ PASS
ancillary	`callback_stop_short_circuits`	✅ PASS

V1_003 throughput on real 30B-MoE: median 338 ms inter-token gap over 32 callbacks (floor 500 ms). ≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and consistent with M32d's KV-cache-amortized per-token cost.

Gap distribution: min=250 ms, max=518 ms, p50=338 ms.

Artifacts

`evidence/qwen3-moe-streaming-sse-v1-discharge/findings.json` — machine-readable discharge record
`evidence/qwen3-moe-streaming-sse-v1-discharge/gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines)

Captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10. Binary at commit `6bff4ce2c` (post-#1854 merge).

Reproducer

```bash
QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
cargo test --test qwen3_moe_streaming_sse_v1 \
-p aprender-serve --features cuda --release \
-- --ignored --nocapture
```

Test plan

All 3 in-band falsifiers PASS on real 30B-A3B MoE GGUF
Throughput floor (≥2 tok/s) cleared with healthy margin
Negative-path (callback-stop) confirmed

🤖 Generated with Claude Code

…3-moe-streaming-sse-v1) ## Summary Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835). Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this contract codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion. ## Changes - `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming` — callback variant of `run_qwen3_moe_generate`. Mirrors the non-streaming function step-for-step, but invokes `on_token(u32) -> bool` after each decoded token (BEFORE the stop check, so the client sees every sampled token). Callback returning `false` short-circuits the loop for client disconnect handling. - `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream`. If true, spin up an mpsc channel, run the streaming variant on a `spawn_blocking` worker, and route the channel through the dense path's `true_streaming_sse_response` helper. Non-streaming path unchanged. - `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE framing as the dense path. No behavior change. - `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003: * V1_001: streaming callback fires per-token, captured tokens equal the non-streaming greedy baseline. * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s). * Bonus: callback returning `false` short-circuits the loop. V1_002 (`stream=false` regression) is covered by `qwen3_moe_serve_dispatch_v1.rs`. ## Why #1832 made KV cache available → per-token gen amortizes to ~100ms. Before this PR, MoE `stream=true` requests on qwen3_moe still went through `run_qwen3_moe_generate` (synchronous) and the client got the full response in a single late SSE event — UX regression vs dense path. Now the client sees the first token within `prefill_wall + 100ms` and subsequent tokens stream at ~M32d throughput. ## Test plan - [x] `cargo check -p aprender-serve --lib` — clean - [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass - [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…streaming-sse

## Summary Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml` v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B (qwen3moe arch) running on gx10 Blackwell GB10. ## Results | Falsifier | Test | Verdict | |-------------|-------------------------------------------|---------| | V1_001 | v1_001_callback_fires_per_token | PASS | | V1_002 | (regression guard via serve-dispatch test) | GUARD | | V1_003 | v1_003_inter_token_latency_floor | PASS | V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over 32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms. ≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and consistent with M32d's KV-cache-amortized per-token cost. Plus the negative-path `callback_stop_short_circuits` test confirmed that returning `false` from the per-token callback short-circuits the decode loop (client-disconnect handling). ## Artifacts - `findings.json` — machine-readable discharge record - `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines) Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10. ## Reproducer ```bash QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ cargo test --test qwen3_moe_streaming_sse_v1 \ -p aprender-serve --features cuda --release \ -- --ignored --nocapture ``` Binary commit: 6bff4ce (post-#1854 merge). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Force-added (matches `.log` gitignore pattern but this one is a load-bearing discharge artifact, not a temp file).

noahgift and others added 4 commits May 21, 2026 03:09

Merge branch 'main' of github.com:paiml/aprender into feat/qwen3-moe-…

2b36b27

…streaming-sse

evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge)

a6dd7ee

Force-added (matches `.log` gitignore pattern but this one is a load-bearing discharge artifact, not a temp file).

noahgift enabled auto-merge (squash) May 21, 2026 08:29

noahgift merged commit c148f5e into main May 21, 2026
11 checks passed

noahgift deleted the evidence/qwen3-moe-streaming-sse-v1-discharge branch May 21, 2026 08:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell (3/3 falsifiers PASS)#1855

evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell (3/3 falsifiers PASS)#1855
noahgift merged 4 commits into
mainfrom
evidence/qwen3-moe-streaming-sse-v1-discharge

noahgift commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 21, 2026

Summary

Results

Artifacts

Reproducer

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant