Skip to content

feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003)#1854

Merged
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-streaming-sse
May 21, 2026
Merged

feat: qwen3-moe streaming SSE impl (qwen3-moe-streaming-sse-v1 V1_001+V1_003)#1854
noahgift merged 2 commits into
mainfrom
feat/qwen3-moe-streaming-sse

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Discharges contracts/qwen3-moe-streaming-sse-v1.yaml v1.0.0 (spec landed in #1835).

Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this PR codifies that `stream=true` on a qwen3_moe model emits SSE events per-token instead of buffering the full completion.

Falsifiers discharged

  • V1_001: per-token callback fires, captured tokens equal non-streaming greedy baseline
  • V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well below M32d's ~5 tok/s)
  • V1_002 (`stream=false` regression): already covered by `qwen3_moe_serve_dispatch_v1.rs`

Changes

  • `infer/qwen3_moe_generate.rs`: new `run_qwen3_moe_generate_streaming` (callback variant of `run_qwen3_moe_generate`). Mirrors the non-streaming function step-for-step but invokes `on_token(u32) -> bool` per decoded token. Callback returning `false` short-circuits for client disconnect.
  • `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on `request.stream` — if true, spin up mpsc channel, run streaming variant on `spawn_blocking`, route through dense path's `true_streaming_sse_response`.
  • `api/openai_handlers.rs`: promote `true_streaming_sse_response` from `fn` → `pub(crate) fn` so MoE backend can reuse the same SSE framing.
  • `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`).

Test plan

  • `cargo check -p aprender-serve --lib` — clean
  • `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
  • Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture`

🤖 Generated with Claude Code

…3-moe-streaming-sse-v1)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 21, 2026 01:10
@noahgift noahgift merged commit 6bff4ce into main May 21, 2026
10 checks passed
@noahgift noahgift deleted the feat/qwen3-moe-streaming-sse branch May 21, 2026 06:53
noahgift added a commit that referenced this pull request May 21, 2026
…3 falsifiers PASS) (#1855)

* feat: qwen3-moe streaming SSE — per-token emit when stream=true (qwen3-moe-streaming-sse-v1)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell

## Summary

Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml`
v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B
(qwen3moe arch) running on gx10 Blackwell GB10.

## Results

| Falsifier   | Test                                      | Verdict |
|-------------|-------------------------------------------|---------|
| V1_001      | v1_001_callback_fires_per_token            | PASS    |
| V1_002      | (regression guard via serve-dispatch test) | GUARD   |
| V1_003      | v1_003_inter_token_latency_floor           | PASS    |

V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over
32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms.
≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and
consistent with M32d's KV-cache-amortized per-token cost.

Plus the negative-path `callback_stop_short_circuits` test confirmed
that returning `false` from the per-token callback short-circuits the
decode loop (client-disconnect handling).

## Artifacts

- `findings.json` — machine-readable discharge record
- `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines)

Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10.

## Reproducer

```bash
QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  cargo test --test qwen3_moe_streaming_sse_v1 \
    -p aprender-serve --features cuda --release \
    -- --ignored --nocapture
```

Binary commit: 6bff4ce (post-#1854 merge).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge)

Force-added (matches `.log` gitignore pattern but this one is a
load-bearing discharge artifact, not a temp file).

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant