Skip to content

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835

Merged
noahgift merged 8 commits into
mainfrom
spec/qwen3-moe-streaming-sse-v1
May 20, 2026
Merged

spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache#1835
noahgift merged 8 commits into
mainfrom
spec/qwen3-moe-streaming-sse-v1

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Registers a new provable contract `qwen3-moe-streaming-sse-v1` for per-token SSE streaming on the qwen3_moe chat-completions path. Natural follow-up to #1832 (M32d KV cache).

Why now

Pre-M32d, streaming SSE was meaningless on qwen3_moe — full-prefill-per-token mode at ~0.5 tok/s meant the client would see no output for ~30 minutes before all 256 tokens arrived at once. Post-M32d at 9.62 tok/s sustained, per-token emits become valuable for chat UX.

Contract gates

  • V1_001: chat-completions with `stream=true` emits per-token SSE events (not buffered)
  • V1_002: `stream=false` still returns a single JSON response (regression check)
  • V1_003: streaming throughput ≥ 2 tok/s median inter-event time

Implementation phases (engineer playbook)

  • Phase 1 (~2hr): callback variant of `run_qwen3_moe_generate` accepting `&mut dyn FnMut(u32) -> bool`
  • Phase 2 (~4hr): wire into `try_qwen3_moe_backend` — `tokio::task::spawn_blocking` + mpsc channel + axum SSE stream
  • Phase 3 (~2hr): cargo integration test (env-gated, `#[ignore]` by default; mirrors `qwen3_moe_serve_dispatch_v1.rs`)

Total ~6-8 hours; operator-actionable once #1832 merges.

NOT in scope

Test plan

  • Pure contract YAML; no code touched
  • CI: `pv validate` (if wired)

🤖 Generated with Claude Code

noahgift and others added 2 commits May 20, 2026 07:52
Registers a new provable contract for per-token SSE streaming on the
qwen3_moe chat-completions path. This is the natural follow-up to
#1832 (M32d KV cache) — pre-M32d, streaming was
meaningless because full-prefill-per-token mode took ~30 minutes per
256-token completion. Post-M32d at 9.62 tok/s sustained, per-token
SSE emits become valuable for chat UX.

## Falsification gates

- V1_001: chat-completions with stream=true emits per-token SSE events
  (not buffered into pregenerated SSE)
- V1_002: stream=false still returns a single JSON response (regression)
- V1_003: streaming throughput ≥ 2 tok/s median inter-event time

## Implementation phases (engineer playbook)

- Phase 1 (~2hr): callback variant of run_qwen3_moe_generate
- Phase 2 (~4hr): wire into try_qwen3_moe_backend in cuda_chat_backend.rs
- Phase 3 (~2hr): cargo integration test

Total ~6-8 hours, operator-actionable once #1832 merges.

NOT in scope:
- MoE inference correctness (covered by qwen3-moe-serve-dispatch-v1)
- KV cache mechanics (M32d / #1832)
- Streaming for dense models (already exists via OwnedQuantizedModelCachedSync)
- Tool-call streaming (separate contract)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
…2d (#1837)

Registers a new provable contract for temperature + top_k + top_p
sampling on the qwen3_moe inference path. Sibling to
qwen3-moe-streaming-sse-v1 (#1835); independent follow-up to
M32d (#1832).

## Motivation

`run_qwen3_moe_generate` currently does unconditional greedy argmax.
QuantizedGenerateConfig.temperature/top_k/top_p are silently ignored.
HTTP chat requests with non-zero temperature get the SAME output every
time — unacceptable for production chat.

## Why now

The V1_004 bench (paiml/claude-code-parity-apr Phase 6 against
Qwen3-Coder-30B-A3B) is hitting 900s per-turn timeout because the
30B-MoE is verbose. Temperature scaling (e.g. 0.3) could concentrate
probability mass on high-confidence tokens and reduce rambling.
Greedy-only forces ONE point on the spectrum.

## Falsification gates

- V1_001: greedy-fallback (temperature=0 OR top_k=1) → deterministic
- V1_002: temperature>0 + fixed seed → deterministic
- V1_003: temperature>0 + different seeds → different outputs
- V1_004: top_k=1 with high temperature → still equivalent to greedy

## Implementation phases (engineer playbook)

- Phase 1 (~2hr): lift dense-path sampling block into reusable helper
- Phase 2 (~1hr): wire into run_qwen3_moe_generate decode loop
- Phase 3 (~2-3hr): cargo test battery + optional companion sub-bench

Total ~5-6 hours; operator-actionable any time. Independent of
qwen3-moe-streaming-sse-v1.

NOT in scope:
- Repetition penalty (separate qwen3-moe-repetition-penalty-v1 contract)
- Mirostat / logit bias / streaming (separate concerns)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 09:18
noahgift added a commit that referenced this pull request May 20, 2026
…ischarged; supersedes #1843) (#1844)

* spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d

Registers a new provable contract for repetition penalty
(`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path.
Completes the 3-knob toolkit alongside:

- qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p
- qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming
- qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n

## Motivation

`QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n`
(usize) fields. The dense path's `sample_advanced` applies them. The
qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT.

Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is
generating REPEATED restatements in its turn-1 output (same Rust
snippet 3× in fixture leetcode__01-two-sum). Repetition penalty
(typically 1.1-1.3) would down-weight recently-generated tokens,
breaking the textual loop and forcing the model to either commit to
a tool call or change tactics.

## Falsification gates

- V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat)
- V1_002: `repeat_penalty > 1.0` down-weights repeated tokens
- V1_003: `repeat_last_n` bounds the window correctly
- V1_004: companion-side bench with penalty produces a measurably
  different outcome distribution than the greedy baseline

## Implementation phases (engineer playbook)

- Phase 1 (~1hr): extend `sample_from_logits` signature with
  `recent_tokens: &[u32]` parameter; apply penalty as Step 1
- Phase 2 (~30min): plumb `&tokens` through decode loop
- Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing

Total ~3-4 hours; operator-actionable any time post-M32d-merge.

## NOT in scope

- Mirostat / DRY / other penalty schemes
- Per-token logit biases
- Dynamic per-position penalty
- Companion-side bench env-var plumbing (separate companion PR;
  this contract is aprender-side only)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged)

Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the
contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843
(which was contract-only).

## Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:

- `sample_from_logits` signature extended with `recent_tokens: &[u32]`
  parameter
- Repetition penalty applied as **Step 1** (BEFORE temperature scaling)
- Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384,
  dense path's `sample_advanced` in `gguf/inference/fails.rs:100`):
  - For positive logits: `logits[idx] /= repeat_penalty`
  - For negative logits: `logits[idx] *= repeat_penalty`
- Window: last `repeat_last_n` tokens from `recent_tokens`
- No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0`
- Decode loop passes `&tokens` slice (cheap borrow; no allocation per token)

## Test results

12/12 tests pass in `cargo test sample_from_logits_tests`:

- 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG,
  V1_003 seed divergence, V1_004 top_k=1 forces greedy)
- 5 NEW rep-penalty tests:
  - V1_001a: repeat_penalty=1.0 no-op
  - V1_001b: repeat_last_n=0 no-op
  - V1_002a: positive-logit branch (division)
  - V1_002b: negative-logit branch (multiplication; Candle convention)
  - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects)
- 3 edge cases (empty logits error, top_p=1 no-op, signature
  compatibility)

## V1_004 (operator-coordinated bench discharge)

Operator can dispatch companion-side bench with
APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once
companion-side env-var plumbing (Phase 3 of #1843) lands. Bench
script + analyzer already pass through `gen_config` to
`run_qwen3_moe_generate` — this PR makes that pass-through actually
apply repetition penalty.

## Stacked on

#1842 (qwen3-moe-sampling-v1 implementation; the
`sample_from_logits` helper this PR extends).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
…(all 4 falsifiers discharged) (#1842)

* test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002)

Closes the cross-component contract gap that the Blackwell cascade
post-mortem (lesson #2) identified: cache machinery is silent on
divergences between producer and consumer, until live dispatch
surfaces the failure. Same risk class for ShardBatchSource: its
wrap-around / cursor / chunk semantics need fixture-driven verification.

Adds two tests gated on `shard-batch-source` feature:

  F-DISTILL-SHARD-BATCH-001 — happy path
    Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via
    ShardBatchSource::from_dir, asserts:
      - batch shape (4 rows × 16 tokens)
      - all returned tokens lie in [0, 4096) (fixture range)
      - labels in same range
    Catches: any cursor-off-by-one or layout swap that produces
    garbage outside the fixture range.

  F-DISTILL-SHARD-BATCH-002 — wrap-around
    Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16),
    consumes 5 batches in a row. Asserts no error — wrap_around=true is
    the default for ShardBatchSource. Catches: regression where the
    iterator returns None on exhaustion despite the constructor
    setting wrap_around.

Test plan:
- [x] 63 distill lib tests pass (was 61; 2 new)
- [x] `cargo test --features shard-batch-source` clean

These two tests would have caught most ShardBatchSource bugs at PR-time
instead of at gx10-dispatch-time, where each failure costs 5-15min.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p

Implements the qwen3-moe-sampling-v1 contract (#1837;
v1.0.0 → v1.1.0). Discharges ALL 4 falsifiers via in-session work +
empirical validation against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

## Empirical results

4-test battery (19.28s total wall):

- V1_001 (greedy determinism): 3 runs returned identical
  [9707, 198, 40, 614, 264, 3405] ("I have a question")
- V1_002 (seeded RNG reproducibility): 2 runs seed=42 temp=0.7
  returned identical [9707, 198, 40, 1079, 264, 220]
- V1_003 (seed divergence): seed=42 vs seed=43 produced
  totally different generated tokens
- V1_004 (top_k=1 forces greedy regardless of temperature):
  byte-identical to pure greedy output

## Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:
- New private `sample_from_logits(logits, config, rng) -> Result<u32>`
- Pipeline mirrors the dense `Self::sample_advanced` (in
  `gguf/inference/fails.rs:100`): temperature scale → top_k filter
  → top_p filter → multinomial draw
- Greedy fallback when `temperature == 0` OR `top_k == 1`
- Uses `rand::rngs::StdRng` (ChaCha12; seedable from u64) for
  reproducibility. NOT `rand::thread_rng()` like the dense path
  (intentional — V1_002 requires deterministic re-runs from same seed)
- Decode loop seeds the RNG from `QuantizedGenerateConfig.seed`

New test: `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` —
4 falsifier tests, env-gated on QWEN3_MOE_GGUF_PATH (mirrors the
existing qwen3_moe_serve_dispatch_v1 + moe_kv_cache_equivalence tests).

## NOT in scope

- Repetition penalty (separate contract qwen3-moe-repetition-penalty-v1;
  still pending operator authorization)
- Mirostat / logit bias / streaming (separate concerns)

## Companion-side downstream

The V1_004 bench (paiml/claude-code-parity-apr Phase 6) is hitting
900s per-turn timeout because the 30B-MoE is verbose under greedy
decoding. With sampling shipped, operator can dispatch a follow-up
bench with `temperature=0.3` (or similar) which may concentrate
probability mass on action tokens and reduce rambling. The bench
script + analyzer already pass temperature through `gen_config` to
`run_qwen3_moe_generate` — this PR makes that pass-through actually
take effect.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* test(qwen3-moe-sampling-v1): add CI-runnable unit tests (no GGUF needed)

Complements the env-gated integration tests in
`crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` (which require
a real Qwen3-MoE GGUF) with pure-Rust unit tests against synthetic
logits arrays. These run unconditionally in CI and pin all 4
falsifiers as runnable gates without requiring `QWEN3_MOE_GGUF_PATH`.

7 tests total in `sample_from_logits_tests`:

- v1_001_temperature_zero_is_argmax_deterministic — greedy fallback
  via temperature == 0
- v1_001_top_k_one_is_argmax_deterministic — greedy fallback via
  top_k == 1 (independent path)
- v1_002_seeded_rng_is_reproducible — seed=42 produces same token
  across 5 invocations
- v1_003_different_seeds_diverge — 32 seeds produce ≥ 3 distinct
  tokens (statistical bound)
- v1_004_top_k_one_equals_pure_greedy — top_k=1 with high temp
  byte-identical to pure greedy
- empty_logits_returns_error — edge case, no panic
- top_p_one_is_no_op — edge case, top_p=1.0 equivalent to top_p
  sentinel-off path

Empirical:
```
cargo test -p aprender-serve --lib sample_from_logits_tests --features cuda
test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured;
16890 filtered out; finished in 0.00s
```

CI gates the sampling invariants at every PR going forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) (#1844)

* spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d

Registers a new provable contract for repetition penalty
(`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path.
Completes the 3-knob toolkit alongside:

- qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p
- qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming
- qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n

## Motivation

`QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n`
(usize) fields. The dense path's `sample_advanced` applies them. The
qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT.

Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is
generating REPEATED restatements in its turn-1 output (same Rust
snippet 3× in fixture leetcode__01-two-sum). Repetition penalty
(typically 1.1-1.3) would down-weight recently-generated tokens,
breaking the textual loop and forcing the model to either commit to
a tool call or change tactics.

## Falsification gates

- V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat)
- V1_002: `repeat_penalty > 1.0` down-weights repeated tokens
- V1_003: `repeat_last_n` bounds the window correctly
- V1_004: companion-side bench with penalty produces a measurably
  different outcome distribution than the greedy baseline

## Implementation phases (engineer playbook)

- Phase 1 (~1hr): extend `sample_from_logits` signature with
  `recent_tokens: &[u32]` parameter; apply penalty as Step 1
- Phase 2 (~30min): plumb `&tokens` through decode loop
- Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing

Total ~3-4 hours; operator-actionable any time post-M32d-merge.

## NOT in scope

- Mirostat / DRY / other penalty schemes
- Per-token logit biases
- Dynamic per-position penalty
- Companion-side bench env-var plumbing (separate companion PR;
  this contract is aprender-side only)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged)

Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the
contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843
(which was contract-only).

## Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:

- `sample_from_logits` signature extended with `recent_tokens: &[u32]`
  parameter
- Repetition penalty applied as **Step 1** (BEFORE temperature scaling)
- Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384,
  dense path's `sample_advanced` in `gguf/inference/fails.rs:100`):
  - For positive logits: `logits[idx] /= repeat_penalty`
  - For negative logits: `logits[idx] *= repeat_penalty`
- Window: last `repeat_last_n` tokens from `recent_tokens`
- No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0`
- Decode loop passes `&tokens` slice (cheap borrow; no allocation per token)

## Test results

12/12 tests pass in `cargo test sample_from_logits_tests`:

- 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG,
  V1_003 seed divergence, V1_004 top_k=1 forces greedy)
- 5 NEW rep-penalty tests:
  - V1_001a: repeat_penalty=1.0 no-op
  - V1_001b: repeat_last_n=0 no-op
  - V1_002a: positive-logit branch (division)
  - V1_002b: negative-logit branch (multiplication; Candle convention)
  - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects)
- 3 edge cases (empty logits error, top_p=1 no-op, signature
  compatibility)

## V1_004 (operator-coordinated bench discharge)

Operator can dispatch companion-side bench with
APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once
companion-side env-var plumbing (Phase 3 of #1843) lands. Bench
script + analyzer already pass through `gen_config` to
`run_qwen3_moe_generate` — this PR makes that pass-through actually
apply repetition penalty.

## Stacked on

#1842 (qwen3-moe-sampling-v1 implementation; the
`sample_from_logits` helper this PR extends).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 9ee74c5 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the spec/qwen3-moe-streaming-sse-v1 branch May 20, 2026 11:35
noahgift added a commit that referenced this pull request May 21, 2026
…3-moe-streaming-sse-v1) (#1854)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 21, 2026
…3 falsifiers PASS) (#1855)

* feat: qwen3-moe streaming SSE — per-token emit when stream=true (qwen3-moe-streaming-sse-v1)

## Summary

Discharges qwen3-moe-streaming-sse-v1.yaml v1.0.0 (landed in #1835).
Post-M32d (#1832), MoE per-token generation amortizes to ~100ms; this
contract codifies that `stream=true` on a qwen3_moe model emits SSE
events per-token instead of buffering the full completion.

## Changes

- `infer/qwen3_moe_generate.rs`: add `run_qwen3_moe_generate_streaming`
  — callback variant of `run_qwen3_moe_generate`. Mirrors the
  non-streaming function step-for-step, but invokes `on_token(u32) -> bool`
  after each decoded token (BEFORE the stop check, so the client sees
  every sampled token). Callback returning `false` short-circuits the
  loop for client disconnect handling.

- `api/cuda_chat_backend.rs`: in `try_qwen3_moe_backend`, branch on
  `request.stream`. If true, spin up an mpsc channel, run the streaming
  variant on a `spawn_blocking` worker, and route the channel through
  the dense path's `true_streaming_sse_response` helper. Non-streaming
  path unchanged.

- `api/openai_handlers.rs`: promote `true_streaming_sse_response` from
  `fn` to `pub(crate) fn` so the MoE backend can reuse the same SSE
  framing as the dense path. No behavior change.

- `tests/qwen3_moe_streaming_sse_v1.rs`: env-gated integration tests
  (`QWEN3_MOE_GGUF_PATH`, `#[ignore]`'d) discharging V1_001 + V1_003:
    * V1_001: streaming callback fires per-token, captured tokens
      equal the non-streaming greedy baseline.
    * V1_003: median inter-token gap < 500ms (≥2 tok/s floor, well
      below M32d's ~5 tok/s).
    * Bonus: callback returning `false` short-circuits the loop.

V1_002 (`stream=false` regression) is covered by
`qwen3_moe_serve_dispatch_v1.rs`.

## Why

#1832 made KV cache available → per-token gen amortizes to ~100ms.
Before this PR, MoE `stream=true` requests on qwen3_moe still went
through `run_qwen3_moe_generate` (synchronous) and the client got the
full response in a single late SSE event — UX regression vs dense path.
Now the client sees the first token within `prefill_wall + 100ms` and
subsequent tokens stream at ~M32d throughput.

## Test plan
- [x] `cargo check -p aprender-serve --lib` — clean
- [x] `cargo test -p aprender-serve --lib qwen3_moe_generate` — 12/12 pass
- [ ] Operator-dispatched: `QWEN3_MOE_GGUF_PATH=… cargo test --test qwen3_moe_streaming_sse_v1 -- --ignored --nocapture` (env-gated, requires GGUF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: qwen3-moe-streaming-sse-v1 DISCHARGED on gx10 Blackwell

## Summary

Operator-dispatched verification of `contracts/qwen3-moe-streaming-sse-v1.yaml`
v1.0.0 (shipped via #1854). All three falsifiers PASS on Qwen3-Coder-30B-A3B
(qwen3moe arch) running on gx10 Blackwell GB10.

## Results

| Falsifier   | Test                                      | Verdict |
|-------------|-------------------------------------------|---------|
| V1_001      | v1_001_callback_fires_per_token            | PASS    |
| V1_002      | (regression guard via serve-dispatch test) | GUARD   |
| V1_003      | v1_003_inter_token_latency_floor           | PASS    |

V1_003 throughput on real 30B-MoE: **median 338 ms inter-token gap** over
32 callbacks (floor 500 ms), distribution p_min=250 ms / p_max=518 ms.
≈ 3 tok/s streamed — comfortably above the 2 tok/s contract floor and
consistent with M32d's KV-cache-amortized per-token cost.

Plus the negative-path `callback_stop_short_circuits` test confirmed
that returning `false` from the per-token callback short-circuits the
decode loop (client-disconnect handling).

## Artifacts

- `findings.json` — machine-readable discharge record
- `gx10-sse-smoke.log` — full cargo test stdout/stderr (549 lines)

Both captured from `/home/noah/runs/sse-smoke-20260521-080640/` on gx10.

## Reproducer

```bash
QWEN3_MOE_GGUF_PATH=/path/to/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
  cargo test --test qwen3_moe_streaming_sse_v1 \
    -p aprender-serve --features cuda --release \
    -- --ignored --nocapture
```

Binary commit: 6bff4ce (post-#1854 merge).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* evidence: add gx10 cargo test log (qwen3-moe-streaming-sse-v1 discharge)

Force-added (matches `.log` gitignore pattern but this one is a
load-bearing discharge artifact, not a temp file).

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant