fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation by noahgift · Pull Request #1852 · paiml/aprender

noahgift · 2026-05-20T14:36:59Z

Summary

The dense chat backend at `cuda_chat_backend.rs:113` populates `QuantizedGenerateConfig.stop_tokens` with the model's EOS token so generation halts at natural turn-end. `try_qwen3_moe_backend` (the MoE chat path, added in #1807) was MISSING this — causing qwen3_moe to burn the full max_tokens budget per turn even after the model emits EOS.

Root cause (paiml/claude-code-parity-apr M287 evidence)

Fixture 10 (`oo__05-builder-pattern`) captured 7 turns of student output where turn 2+ was literally:

Human: I need to see the full implementation of the build method...
Human: I need to see the full implementation of the build method...
Human: I need to see the full implementation of the build method...

The model emits its own ChatML user-turn boundary tokens (`<|im_end|>`, `<|im_start|>user`, ...) but since `stop_tokens` was empty, the decoder ignored them and kept generating. Bench hit per-turn timeout after 4-7 turns of useless self-prompted runaway.

Fix

Mirror the dense path. Read `state.model_eos_token_id()` and populate `stop_tokens`. For Qwen3 GGUF models the EOS is typically `<|im_end|>` (token 151645). When the model emits it, the decode loop's existing check fires and the turn ends cleanly:

```rust
// run_qwen3_moe_generate already has this:
if gen_config.stop_tokens.contains(&next_token) {
break;
}
```

This is complementary to the other M287 follow-ups:

feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p (all 4 falsifiers discharged) #1842 (sampling) — shapes which token is chosen
feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) #1844 (rep penalty) — discourages loops
feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code env vars #1846 (3-knob HTTP wire-up) — exposes the knobs
feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up) #1849 (few-shot tool_call prompt) — anti-Markdown training

THIS PR makes EOS detection actually work. Without it, the other fixes are downstream of a leaking-faucet bug.

Test plan

`cargo check -p aprender-serve --lib --features cuda` — clean
CI
Companion-side V1_004 sub-bench post-rebuild expected to show shorter turns + cleaner output

🤖 Generated with Claude Code

…7 runaway The dense chat backend at cuda_chat_backend.rs:113 populates `QuantizedGenerateConfig.stop_tokens` with the model's EOS token so generation halts at natural turn-end. `try_qwen3_moe_backend` (the MoE chat path, added in #1807) was missing this, causing qwen3_moe to burn the full max_tokens budget per turn. Root cause from M287: 30B-MoE student emits self-prompted 'Human: ...' text. With empty stop_tokens, the decoder ignored EOS and kept generating ~1024 tokens of runaway per turn — bench hit per-turn timeout after 4-7 turns of useless output. Fix: mirror the dense path. Read state.model_eos_token_id() and populate stop_tokens. For Qwen3 GGUF models the EOS is typically <|im_end|> (token 151645). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…:" prefix (#1853) The existing stop sequences in `clean_chat_output` anchor on "\nHuman:" / "\n\nHuman:" — requiring a preceding newline. When a model leaks the turn marker at start-of-string (no newline before it), the truncate-at- earliest loop misses the marker and the prefix bleeds into the captured chat reply verbatim. Empirically observed at paiml/claude-code-parity-apr Phase 6 sub-bench B (M291) on Qwen3-Coder-30B-A3B: the model began turns 1-20 with "Human: ..." which the dense-path / MoE-path cleaner alike failed to strip. (The new stop_token + EOS plumbing in #1852 stops generation at "<|im_end|>" — but if the model emitted "Human:" *before* the EOS, the prefix stayed.) Fix: before the stop-sequence pass, trim leading whitespace and strip a leading "Human:" / "User:" / "Assistant:" prefix if present. The mid-sentence "Human:" case is preserved (only stripped at start-of- string), and the existing "\nHuman:" truncation still fires for embedded turn-boundary leaks. 6 new pin tests in api::tests::format_chat_02: test_clean_chat_output_leading_human_prefix test_clean_chat_output_leading_user_prefix test_clean_chat_output_leading_assistant_prefix test_clean_chat_output_leading_prefix_with_whitespace test_clean_chat_output_inline_human_after_leading_strip test_clean_chat_output_no_false_positive_on_human_in_middle Refs: CCPA M291 V1_004 follow-up snapshot. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…on invariants (#1859) Author the provable contract behind `clean_chat_output` so the six invariants established by the M287 → #1852 → #1853 cascade are falsifier-backed instead of merely tested. ## Why The cascade fixed three things in concert: - #1852: EOS stop-token detection (`<|im_end|>` / `<|endoftext|>`) - #1853: leading "Human:"/"User:"/"Assistant:" prefix strip - M287 surface: 'Human: I need to...' runaway pattern post-EOS-miss The implementation (`crates/aprender-serve/src/api/realize_handlers.rs::clean_chat_output`) already lives; this contract retroactively codifies its guarantees so future stop-sequence changes require a contract bump alongside the code change. Hooks `pv lint` / contract-coverage audits onto a previously-uncontracted sanitization layer. ## Six falsifiers - V1_001: leading "Human:" / "User:" / "Assistant:" stripped - V1_002: stop sequence inside body truncates at first occurrence - V1_003: earliest stop sequence wins when multiple are present - V1_004: clean text passes through (trim-only) - V1_005: empty / whitespace / stop-only collapses to "" - V1_006: STOP_SEQUENCES code constant ↔ contract YAML stay synced (manual audit for now; could be pv-lint check later) ## Evidence All six are already covered by existing unit tests in `crates/aprender-serve/src/api/realize_handlers_clean_chat.rs` and `crates/aprender-serve/src/api/tests/format_chat_02.rs`. ## Validation `pv validate contracts/clean-chat-output-v1.yaml` → "Contract is valid." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 20, 2026 14:47

noahgift force-pushed the feat/qwen3-moe-eos-stop-tokens branch 2 times, most recently from 1b1e456 to 0ed601c Compare May 20, 2026 15:15

noahgift force-pushed the feat/qwen3-moe-eos-stop-tokens branch from 0ed601c to fbbc79c Compare May 20, 2026 15:33

noahgift added 2 commits May 20, 2026 22:50

Merge branch 'main' into feat/qwen3-moe-eos-stop-tokens

09f415d

Merge branch 'main' into feat/qwen3-moe-eos-stop-tokens

b0a037b

noahgift merged commit 65f4af0 into main May 20, 2026
10 checks passed

noahgift deleted the feat/qwen3-moe-eos-stop-tokens branch May 20, 2026 21:41

This was referenced May 21, 2026

fix(clean_chat_output): strip leading turn-marker prefix at start-of-string #1853

Merged

docs(M291): V1_004 sub-bench B empirical pattern shift + aprender#1853 fix paiml/claude-code-parity-apr#259

Merged

noahgift mentioned this pull request May 21, 2026

feat(contract): clean-chat-output-v1 — codify M287 cascade sanitization invariants #1859

Merged

noahgift mentioned this pull request May 22, 2026

Qwen2.5-7B Q4_K GPU inference produces gibberish — 'ampiezza' (wgpu) / '<|im_start|>' (cuBLAS) — regression vs #374 / #559 #1864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation#1852

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation#1852
noahgift merged 3 commits into
mainfrom
feat/qwen3-moe-eos-stop-tokens

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Root cause (paiml/claude-code-parity-apr M287 evidence)

Fix

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant