feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003) by noahgift · Pull Request #1854 · paiml/aprender

noahgift · 2026-05-21T01:10:12Z

Summary

Discharges contracts/qwen3-moe-streaming-sse-v1.yaml v1.0.0 (spec landed in #1835).

Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this PR codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion.

Falsifiers discharged

V1_001: per-token callback fires, captured tokens equal non-streaming greedy baseline
V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s)
V1_002 (`stream=false` regression): already covered by `qwen3_moe_serve_dispatch_v1.rs`

Changes

`infer/qwen3_moe_generate.rs`: new `run_qwen3_moe_generate_streaming` (callback variant of `run_qwen3_moe_generate`). Mirrors the non-streaming function step-for-step but invokes `on_token(u32) -> bool` per decoded token. Callback returning `false` short-circuits for client disconnect.
`api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream` — if true, spin up mpsc channel, run streaming variant on `spawn_blocking`, route through dense path's `true_streaming_sse_response`.
`api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` → `pub(crate) fn` so MoE backend can reuse the same SSE framing.
`tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`).

Test plan

`cargo check -p aprender-serve --lib` — clean
`cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture`

🤖 Generated with Claude Code

…3-moe-streaming-sse-v1) ## Summary Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835). Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this contract codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion. ## Changes - `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming` — callback variant of `run_qwen3_moe_generate`. Mirrors the non-streaming function step-for-step, but invokes `on_token(u32) -> bool` after each decoded token (BEFORE the stop check, so the client sees every sampled token). Callback returning `false` short-circuits the loop for client disconnect handling. - `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream`. If true, spin up an mpsc channel, run the streaming variant on a `spawn_blocking` worker, and route the channel through the dense path's `true_streaming_sse_response` helper. Non-streaming path unchanged. - `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE framing as the dense path. No behavior change. - `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003: * V1_001: streaming callback fires per-token, captured tokens equal the non-streaming greedy baseline. * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s). * Bonus: callback returning `false` short-circuits the loop. V1_002 (`stream=false` regression) is covered by `qwen3_moe_serve_dispatch_v1.rs`. ## Why #1832 made KV cache available → per-token gen amortizes to ~100ms. Before this PR, MoE `stream=true` requests on qwen3_moe still went through `run_qwen3_moe_generate` (synchronous) and the client got the full response in a single late SSE event — UX regression vs dense path. Now the client sees the first token within `prefill_wall + 100ms` and subsequent tokens stream at ~M32d throughput. ## Test plan - [x] `cargo check -p aprender-serve --lib` — clean - [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass - [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…3 falsifiers PASS) (#1855) * feat: qwen3-moe streaming SSE — per-token emit when stream=true (qwen3-moe-streaming-sse-v1) ## Summary Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835). Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this contract codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion. ## Changes - `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming` — callback variant of `run_qwen3_moe_generate`. Mirrors the non-streaming function step-for-step, but invokes `on_token(u32) -> bool` after each decoded token (BEFORE the stop check, so the client sees every sampled token). Callback returning `false` short-circuits the loop for client disconnect handling. - `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream`. If true, spin up an mpsc channel, run the streaming variant on a `spawn_blocking` worker, and route the channel through the dense path's `true_streaming_sse_response` helper. Non-streaming path unchanged. - `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE framing as the dense path. No behavior change. - `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003: * V1_001: streaming callback fires per-token, captured tokens equal the non-streaming greedy baseline. * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s). * Bonus: callback returning `false` short-circuits the loop. V1_002 (`stream=false` regression) is covered by `qwen3_moe_serve_dispatch_v1.rs`. ## Why #1832 made KV cache available → per-token gen amortizes to ~100ms. Before this PR, MoE `stream=true` requests on qwen3_moe still went through `run_qwen3_moe_generate` (synchronous) and the client got the full response in a single late SSE event — UX regression vs dense path. Now the client sees the first token within `prefill_wall + 100ms` and subsequent tokens stream at ~M32d throughput. ## Test plan - [x] `cargo check -p aprender-serve --lib` — clean - [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass - [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell ## Summary Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml` v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B (qwen3moe arch) running on gx10 Blackwell GB10. ## Results | Falsifier | Test | Verdict | |-------------|-------------------------------------------|---------| | V1_001 | v1_001_callback_fires_per_token | PASS | | V1_002 | (regression guard via serve-dispatch test) | GUARD | | V1_003 | v1_003_inter_token_latency_floor | PASS | V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over 32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms. ≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and consistent with M32d's KV-cache-amortized per-token cost. Plus the negative-path `callback_stop_short_circuits` test confirmed that returning `false` from the per-token callback short-circuits the decode loop (client-disconnect handling). ## Artifacts - `findings.json` — machine-readable discharge record - `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines) Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10. ## Reproducer ```bash QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \ cargo test --test qwen3_moe_streaming_sse_v1 \ -p aprender-serve --features cuda --release \ -- --ignored --nocapture ``` Binary commit: 6bff4ce (post-#1854 merge). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge) Force-added (matches `.log` gitignore pattern but this one is a load-bearing discharge artifact, not a temp file). --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 21, 2026 01:10

Merge branch 'main' into feat/qwen3-moe-streaming-sse

0ac93c7

noahgift merged commit 6bff4ce into main May 21, 2026
10 checks passed

noahgift deleted the feat/qwen3-moe-streaming-sse branch May 21, 2026 06:53

noahgift mentioned this pull request May 21, 2026

evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell (3/3 falsifiers PASS) #1855

Merged

3 tasks

noahgift mentioned this pull request May 22, 2026

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003)#1854

feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003)#1854
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-streaming-sse

noahgift commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 21, 2026

Summary

Falsifiers discharged

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant