fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation#1852
Merged
Merged
Conversation
1b1e456 to
0ed601c
Compare
…7 runaway The dense chat backend at cuda_chat_backend.rs:113 populates `QuantizedGenerateConfig.stop_tokens` with the model's EOS token so generation halts at natural turn-end. `try_qwen3_moe_backend` (the MoE chat path, added in #1807) was missing this, causing qwen3_moe to burn the full max_tokens budget per turn. Root cause from M287: 30B-MoE student emits self-prompted 'Human: ...' text. With empty stop_tokens, the decoder ignored EOS and kept generating ~1024 tokens of runaway per turn — bench hit per-turn timeout after 4-7 turns of useless output. Fix: mirror the dense path. Read state.model_eos_token_id() and populate stop_tokens. For Qwen3 GGUF models the EOS is typically <|im_end|> (token 151645). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
0ed601c to
fbbc79c
Compare
This was referenced May 21, 2026
noahgift
added a commit
that referenced
this pull request
May 21, 2026
…:" prefix (#1853) The existing stop sequences in `clean_chat_output` anchor on "\nHuman:" / "\n\nHuman:" — requiring a preceding newline. When a model leaks the turn marker at start-of-string (no newline before it), the truncate-at- earliest loop misses the marker and the prefix bleeds into the captured chat reply verbatim. Empirically observed at paiml/claude-code-parity-apr Phase 6 sub-bench B (M291) on Qwen3-Coder-30B-A3B: the model began turns 1-20 with "Human: ..." which the dense-path / MoE-path cleaner alike failed to strip. (The new stop_token + EOS plumbing in #1852 stops generation at "<|im_end|>" — but if the model emitted "Human:" *before* the EOS, the prefix stayed.) Fix: before the stop-sequence pass, trim leading whitespace and strip a leading "Human:" / "User:" / "Assistant:" prefix if present. The mid-sentence "Human:" case is preserved (only stripped at start-of- string), and the existing "\nHuman:" truncation still fires for embedded turn-boundary leaks. 6 new pin tests in api::tests::format_chat_02: test_clean_chat_output_leading_human_prefix test_clean_chat_output_leading_user_prefix test_clean_chat_output_leading_assistant_prefix test_clean_chat_output_leading_prefix_with_whitespace test_clean_chat_output_inline_human_after_leading_strip test_clean_chat_output_no_false_positive_on_human_in_middle Refs: CCPA M291 V1_004 follow-up snapshot. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 21, 2026
…on invariants (#1859) Author the provable contract behind `clean_chat_output` so the six invariants established by the M287 → #1852 → #1853 cascade are falsifier-backed instead of merely tested. ## Why The cascade fixed three things in concert: - #1852: EOS stop-token detection (`<|im_end|>` / `<|endoftext|>`) - #1853: leading "Human:"/"User:"/"Assistant:" prefix strip - M287 surface: 'Human: I need to...' runaway pattern post-EOS-miss The implementation (`crates/aprender-serve/src/api/realize_handlers.rs::clean_chat_output`) already lives; this contract retroactively codifies its guarantees so future stop-sequence changes require a contract bump alongside the code change. Hooks `pv lint` / contract-coverage audits onto a previously-uncontracted sanitization layer. ## Six falsifiers - V1_001: leading "Human:" / "User:" / "Assistant:" stripped - V1_002: stop sequence inside body truncates at first occurrence - V1_003: earliest stop sequence wins when multiple are present - V1_004: clean text passes through (trim-only) - V1_005: empty / whitespace / stop-only collapses to "" - V1_006: STOP_SEQUENCES code constant ↔ contract YAML stay synced (manual audit for now; could be pv-lint check later) ## Evidence All six are already covered by existing unit tests in `crates/aprender-serve/src/api/realize_handlers_clean_chat.rs` and `crates/aprender-serve/src/api/tests/format_chat_02.rs`. ## Validation `pv validate contracts/clean-chat-output-v1.yaml` → "Contract is valid." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The dense chat backend at `cuda_chat_backend.rs:113` populates `QuantizedGenerateConfig.stop_tokens` with the model's EOS token so generation halts at natural turn-end. `try_qwen3_moe_backend` (the MoE chat path, added in #1807) was MISSING this — causing qwen3_moe to burn the full max_tokens budget per turn even after the model emits EOS.
Root cause (paiml/claude-code-parity-apr M287 evidence)
Fixture 10 (`oo__05-builder-pattern`) captured 7 turns of student output where turn 2+ was literally:
The model emits its own ChatML user-turn boundary tokens (`<|im_end|>`, `<|im_start|>user`, ...) but since `stop_tokens` was empty, the decoder ignored them and kept generating. Bench hit per-turn timeout after 4-7 turns of useless self-prompted runaway.
Fix
Mirror the dense path. Read `state.model_eos_token_id()` and populate `stop_tokens`. For Qwen3 GGUF models the EOS is typically `<|im_end|>` (token 151645). When the model emits it, the decode loop's existing check fires and the turn ends cleanly:
```rust
// run_qwen3_moe_generate already has this:
if gen_config.stop_tokens.contains(&next_token) {
break;
}
```
This is complementary to the other M287 follow-ups:
THIS PR makes EOS detection actually work. Without it, the other fixes are downstream of a leaking-faucet bug.
Test plan
🤖 Generated with Claude Code