Skip to content

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation#1852

Merged
noahgift merged 3 commits into
mainfrom
feat/qwen3-moe-eos-stop-tokens
May 20, 2026
Merged

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation#1852
noahgift merged 3 commits into
mainfrom
feat/qwen3-moe-eos-stop-tokens

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

The dense chat backend at `cuda_chat_backend.rs:113` populates `QuantizedGenerateConfig.stop_tokens` with the model's EOS token so generation halts at natural turn-end. `try_qwen3_moe_backend` (the MoE chat path, added in #1807) was MISSING this — causing qwen3_moe to burn the full max_tokens budget per turn even after the model emits EOS.

Root cause (paiml/claude-code-parity-apr M287 evidence)

Fixture 10 (`oo__05-builder-pattern`) captured 7 turns of student output where turn 2+ was literally:

Human: I need to see the full implementation of the build method...
Human: I need to see the full implementation of the build method...
Human: I need to see the full implementation of the build method...

The model emits its own ChatML user-turn boundary tokens (`<|im_end|>`, `<|im_start|>user`, ...) but since `stop_tokens` was empty, the decoder ignored them and kept generating. Bench hit per-turn timeout after 4-7 turns of useless self-prompted runaway.

Fix

Mirror the dense path. Read `state.model_eos_token_id()` and populate `stop_tokens`. For Qwen3 GGUF models the EOS is typically `<|im_end|>` (token 151645). When the model emits it, the decode loop's existing check fires and the turn ends cleanly:

```rust
// run_qwen3_moe_generate already has this:
if gen_config.stop_tokens.contains(&next_token) {
break;
}
```

This is complementary to the other M287 follow-ups:

THIS PR makes EOS detection actually work. Without it, the other fixes are downstream of a leaking-faucet bug.

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` — clean
  • CI
  • Companion-side V1_004 sub-bench post-rebuild expected to show shorter turns + cleaner output

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 20, 2026 14:47
@noahgift noahgift force-pushed the feat/qwen3-moe-eos-stop-tokens branch 2 times, most recently from 1b1e456 to 0ed601c Compare May 20, 2026 15:15
…7 runaway

The dense chat backend at cuda_chat_backend.rs:113 populates
`QuantizedGenerateConfig.stop_tokens` with the model's EOS token so
generation halts at natural turn-end. `try_qwen3_moe_backend` (the
MoE chat path, added in #1807) was missing this, causing qwen3_moe
to burn the full max_tokens budget per turn.

Root cause from M287: 30B-MoE student emits self-prompted 'Human: ...'
text. With empty stop_tokens, the decoder ignored EOS and kept
generating ~1024 tokens of runaway per turn — bench hit per-turn
timeout after 4-7 turns of useless output.

Fix: mirror the dense path. Read state.model_eos_token_id() and
populate stop_tokens. For Qwen3 GGUF models the EOS is typically
<|im_end|> (token 151645).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift force-pushed the feat/qwen3-moe-eos-stop-tokens branch from 0ed601c to fbbc79c Compare May 20, 2026 15:33
@noahgift noahgift merged commit 65f4af0 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/qwen3-moe-eos-stop-tokens branch May 20, 2026 21:41
noahgift added a commit that referenced this pull request May 21, 2026
…:" prefix (#1853)

The existing stop sequences in `clean_chat_output` anchor on "\nHuman:" /
"\n\nHuman:" — requiring a preceding newline. When a model leaks the
turn marker at start-of-string (no newline before it), the truncate-at-
earliest loop misses the marker and the prefix bleeds into the captured
chat reply verbatim.

Empirically observed at paiml/claude-code-parity-apr Phase 6 sub-bench
B (M291) on Qwen3-Coder-30B-A3B: the model began turns 1-20 with
"Human: ..." which the dense-path / MoE-path cleaner alike failed to
strip. (The new stop_token + EOS plumbing in #1852 stops generation
at "<|im_end|>" — but if the model emitted "Human:" *before* the EOS,
the prefix stayed.)

Fix: before the stop-sequence pass, trim leading whitespace and strip
a leading "Human:" / "User:" / "Assistant:" prefix if present. The
mid-sentence "Human:" case is preserved (only stripped at start-of-
string), and the existing "\nHuman:" truncation still fires for
embedded turn-boundary leaks.

6 new pin tests in api::tests::format_chat_02:
  test_clean_chat_output_leading_human_prefix
  test_clean_chat_output_leading_user_prefix
  test_clean_chat_output_leading_assistant_prefix
  test_clean_chat_output_leading_prefix_with_whitespace
  test_clean_chat_output_inline_human_after_leading_strip
  test_clean_chat_output_no_false_positive_on_human_in_middle

Refs: CCPA M291 V1_004 follow-up snapshot.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 21, 2026
…on invariants (#1859)

Author the provable contract behind `clean_chat_output` so the six
invariants established by the M287 → #1852#1853 cascade are
falsifier-backed instead of merely tested.

## Why

The cascade fixed three things in concert:
- #1852: EOS stop-token detection (`<|im_end|>` / `<|endoftext|>`)
- #1853: leading "Human:"/"User:"/"Assistant:" prefix strip
- M287 surface: 'Human: I need to...' runaway pattern post-EOS-miss

The implementation (`crates/aprender-serve/src/api/realize_handlers.rs::clean_chat_output`)
already lives; this contract retroactively codifies its guarantees so
future stop-sequence changes require a contract bump alongside the
code change. Hooks `pv lint` / contract-coverage audits onto a
previously-uncontracted sanitization layer.

## Six falsifiers

- V1_001: leading "Human:" / "User:" / "Assistant:" stripped
- V1_002: stop sequence inside body truncates at first occurrence
- V1_003: earliest stop sequence wins when multiple are present
- V1_004: clean text passes through (trim-only)
- V1_005: empty / whitespace / stop-only collapses to ""
- V1_006: STOP_SEQUENCES code constant ↔ contract YAML stay synced
  (manual audit for now; could be pv-lint check later)

## Evidence

All six are already covered by existing unit tests in
`crates/aprender-serve/src/api/realize_handlers_clean_chat.rs` and
`crates/aprender-serve/src/api/tests/format_chat_02.rs`.

## Validation

`pv validate contracts/clean-chat-output-v1.yaml` → "Contract is valid."

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant