spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d by noahgift · Pull Request #1843 · paiml/aprender

noahgift · 2026-05-20T09:48:24Z

Summary

Third sibling contract completing the 3-knob toolkit for tackling V1_004's verbosity bottleneck:

`qwen3-moe-sampling-v1` (spec: qwen3-moe-sampling-v1 — temperature/top_k/top_p follow-up to M32d #1837 contract + feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p (all 4 falsifiers discharged) #1842 impl) — temperature/top_k/top_p
`qwen3-moe-streaming-sse-v1` (spec: qwen3-moe-streaming-sse-v1 — follow-up contract to M32d KV cache #1835) — per-token SSE streaming
`qwen3-moe-repetition-penalty-v1` (this PR) — repeat_penalty / repeat_last_n

Motivation

`QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path applies them; the qwen3_moe `sample_from_logits` (from #1842) does NOT.

M287 evidence doc shows Qwen3-Coder-30B-A3B generating REPEATED restatements in turn-1 (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (~1.1-1.3) would down-weight recently-generated tokens, breaking the loop and forcing the model to commit to a tool call.

Falsification gates

V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat regression)
V1_002: `repeat_penalty > 1.0` down-weights repeated tokens
V1_003: `repeat_last_n` bounds the window correctly
V1_004: companion-side bench with penalty produces measurably different outcome distribution than greedy baseline

Implementation phases (engineer playbook)

Phase 1 (~1hr): extend `sample_from_logits` with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 (before temperature)
Phase 2 (~30min): plumb `&tokens` through decode loop
Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing

Total ~3-4 hours; operator-actionable any time.

NOT in scope

Mirostat / DRY / other penalty schemes
Per-token logit biases / dynamic penalty
Companion-side bench env-var plumbing (separate companion PR)

Test plan

Pure contract YAML; no code touched
CI: `pv validate` if wired

🤖 Generated with Claude Code

Registers a new provable contract for repetition penalty (`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path. Completes the 3-knob toolkit alongside: - qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p - qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming - qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n ## Motivation `QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path's `sample_advanced` applies them. The qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT. Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is generating REPEATED restatements in its turn-1 output (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (typically 1.1-1.3) would down-weight recently-generated tokens, breaking the textual loop and forcing the model to either commit to a tool call or change tactics. ## Falsification gates - V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat) - V1_002: `repeat_penalty > 1.0` down-weights repeated tokens - V1_003: `repeat_last_n` bounds the window correctly - V1_004: companion-side bench with penalty produces a measurably different outcome distribution than the greedy baseline ## Implementation phases (engineer playbook) - Phase 1 (~1hr): extend `sample_from_logits` signature with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 - Phase 2 (~30min): plumb `&tokens` through decode loop - Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing Total ~3-4 hours; operator-actionable any time post-M32d-merge. ## NOT in scope - Mirostat / DRY / other penalty schemes - Per-token logit biases - Dynamic per-position penalty - Companion-side bench env-var plumbing (separate companion PR; this contract is aprender-side only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift · 2026-05-20T09:57:01Z

Superseded by #1844 which includes both the contract registration AND the implementation + 5 unit tests in one PR. Closing as superseded.

…ischarged; supersedes #1843) (#1844) * spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d Registers a new provable contract for repetition penalty (`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path. Completes the 3-knob toolkit alongside: - qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p - qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming - qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n ## Motivation `QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path's `sample_advanced` applies them. The qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT. Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is generating REPEATED restatements in its turn-1 output (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (typically 1.1-1.3) would down-weight recently-generated tokens, breaking the textual loop and forcing the model to either commit to a tool call or change tactics. ## Falsification gates - V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat) - V1_002: `repeat_penalty > 1.0` down-weights repeated tokens - V1_003: `repeat_last_n` bounds the window correctly - V1_004: companion-side bench with penalty produces a measurably different outcome distribution than the greedy baseline ## Implementation phases (engineer playbook) - Phase 1 (~1hr): extend `sample_from_logits` signature with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 - Phase 2 (~30min): plumb `&tokens` through decode loop - Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing Total ~3-4 hours; operator-actionable any time post-M32d-merge. ## NOT in scope - Mirostat / DRY / other penalty schemes - Per-token logit biases - Dynamic per-position penalty - Companion-side bench env-var plumbing (separate companion PR; this contract is aprender-side only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged) Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843 (which was contract-only). ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - `sample_from_logits` signature extended with `recent_tokens: &[u32]` parameter - Repetition penalty applied as **Step 1** (BEFORE temperature scaling) - Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384, dense path's `sample_advanced` in `gguf/inference/fails.rs:100`): - For positive logits: `logits[idx] /= repeat_penalty` - For negative logits: `logits[idx] *= repeat_penalty` - Window: last `repeat_last_n` tokens from `recent_tokens` - No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0` - Decode loop passes `&tokens` slice (cheap borrow; no allocation per token) ## Test results 12/12 tests pass in `cargo test sample_from_logits_tests`: - 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG, V1_003 seed divergence, V1_004 top_k=1 forces greedy) - 5 NEW rep-penalty tests: - V1_001a: repeat_penalty=1.0 no-op - V1_001b: repeat_last_n=0 no-op - V1_002a: positive-logit branch (division) - V1_002b: negative-logit branch (multiplication; Candle convention) - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects) - 3 edge cases (empty logits error, top_p=1 no-op, signature compatibility) ## V1_004 (operator-coordinated bench discharge) Operator can dispatch companion-side bench with APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once companion-side env-var plumbing (Phase 3 of #1843) lands. Bench script + analyzer already pass through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually apply repetition penalty. ## Stacked on #1842 (qwen3-moe-sampling-v1 implementation; the `sample_from_logits` helper this PR extends). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…(all 4 falsifiers discharged) (#1842) * test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002) Closes the cross-component contract gap that the Blackwell cascade post-mortem (lesson #2) identified: cache machinery is silent on divergences between producer and consumer, until live dispatch surfaces the failure. Same risk class for ShardBatchSource: its wrap-around / cursor / chunk semantics need fixture-driven verification. Adds two tests gated on `shard-batch-source` feature: F-DISTILL-SHARD-BATCH-001 — happy path Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts: - batch shape (4 rows × 16 tokens) - all returned tokens lie in [0, 4096) (fixture range) - labels in same range Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range. F-DISTILL-SHARD-BATCH-002 — wrap-around Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default for ShardBatchSource. Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around. Test plan: - [x] 63 distill lib tests pass (was 61; 2 new) - [x] `cargo test --features shard-batch-source` clean These two tests would have caught most ShardBatchSource bugs at PR-time instead of at gx10-dispatch-time, where each failure costs 5-15min. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p Implements the qwen3-moe-sampling-v1 contract (#1837; v1.0.0 → v1.1.0). Discharges ALL 4 falsifiers via in-session work + empirical validation against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M. ## Empirical results 4-test battery (19.28s total wall): - V1_001 (greedy determinism): 3 runs returned identical [9707, 198, 40, 614, 264, 3405] ("I have a question") - V1_002 (seeded RNG reproducibility): 2 runs seed=42 temp=0.7 returned identical [9707, 198, 40, 1079, 264, 220] - V1_003 (seed divergence): seed=42 vs seed=43 produced totally different generated tokens - V1_004 (top_k=1 forces greedy regardless of temperature): byte-identical to pure greedy output ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - New private `sample_from_logits(logits, config, rng) -> Result<u32>` - Pipeline mirrors the dense `Self::sample_advanced` (in `gguf/inference/fails.rs:100`): temperature scale → top_k filter → top_p filter → multinomial draw - Greedy fallback when `temperature == 0` OR `top_k == 1` - Uses `rand::rngs::StdRng` (ChaCha12; seedable from u64) for reproducibility. NOT `rand::thread_rng()` like the dense path (intentional — V1_002 requires deterministic re-runs from same seed) - Decode loop seeds the RNG from `QuantizedGenerateConfig.seed` New test: `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` — 4 falsifier tests, env-gated on QWEN3_MOE_GGUF_PATH (mirrors the existing qwen3_moe_serve_dispatch_v1 + moe_kv_cache_equivalence tests). ## NOT in scope - Repetition penalty (separate contract qwen3-moe-repetition-penalty-v1; still pending operator authorization) - Mirostat / logit bias / streaming (separate concerns) ## Companion-side downstream The V1_004 bench (paiml/claude-code-parity-apr Phase 6) is hitting 900s per-turn timeout because the 30B-MoE is verbose under greedy decoding. With sampling shipped, operator can dispatch a follow-up bench with `temperature=0.3` (or similar) which may concentrate probability mass on action tokens and reduce rambling. The bench script + analyzer already pass temperature through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually take effect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(qwen3-moe-sampling-v1): add CI-runnable unit tests (no GGUF needed) Complements the env-gated integration tests in `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` (which require a real Qwen3-MoE GGUF) with pure-Rust unit tests against synthetic logits arrays. These run unconditionally in CI and pin all 4 falsifiers as runnable gates without requiring `QWEN3_MOE_GGUF_PATH`. 7 tests total in `sample_from_logits_tests`: - v1_001_temperature_zero_is_argmax_deterministic — greedy fallback via temperature == 0 - v1_001_top_k_one_is_argmax_deterministic — greedy fallback via top_k == 1 (independent path) - v1_002_seeded_rng_is_reproducible — seed=42 produces same token across 5 invocations - v1_003_different_seeds_diverge — 32 seeds produce ≥ 3 distinct tokens (statistical bound) - v1_004_top_k_one_equals_pure_greedy — top_k=1 with high temp byte-identical to pure greedy - empty_logits_returns_error — edge case, no panic - top_p_one_is_no_op — edge case, top_p=1.0 equivalent to top_p sentinel-off path Empirical: ``` cargo test -p aprender-serve --lib sample_from_logits_tests --features cuda test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 16890 filtered out; finished in 0.00s ``` CI gates the sampling invariants at every PR going forward. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) (#1844) * spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d Registers a new provable contract for repetition penalty (`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path. Completes the 3-knob toolkit alongside: - qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p - qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming - qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n ## Motivation `QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path's `sample_advanced` applies them. The qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT. Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is generating REPEATED restatements in its turn-1 output (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (typically 1.1-1.3) would down-weight recently-generated tokens, breaking the textual loop and forcing the model to either commit to a tool call or change tactics. ## Falsification gates - V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat) - V1_002: `repeat_penalty > 1.0` down-weights repeated tokens - V1_003: `repeat_last_n` bounds the window correctly - V1_004: companion-side bench with penalty produces a measurably different outcome distribution than the greedy baseline ## Implementation phases (engineer playbook) - Phase 1 (~1hr): extend `sample_from_logits` signature with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 - Phase 2 (~30min): plumb `&tokens` through decode loop - Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing Total ~3-4 hours; operator-actionable any time post-M32d-merge. ## NOT in scope - Mirostat / DRY / other penalty schemes - Per-token logit biases - Dynamic per-position penalty - Companion-side bench env-var plumbing (separate companion PR; this contract is aprender-side only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged) Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843 (which was contract-only). ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - `sample_from_logits` signature extended with `recent_tokens: &[u32]` parameter - Repetition penalty applied as **Step 1** (BEFORE temperature scaling) - Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384, dense path's `sample_advanced` in `gguf/inference/fails.rs:100`): - For positive logits: `logits[idx] /= repeat_penalty` - For negative logits: `logits[idx] *= repeat_penalty` - Window: last `repeat_last_n` tokens from `recent_tokens` - No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0` - Decode loop passes `&tokens` slice (cheap borrow; no allocation per token) ## Test results 12/12 tests pass in `cargo test sample_from_logits_tests`: - 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG, V1_003 seed divergence, V1_004 top_k=1 forces greedy) - 5 NEW rep-penalty tests: - V1_001a: repeat_penalty=1.0 no-op - V1_001b: repeat_last_n=0 no-op - V1_002a: positive-logit branch (division) - V1_002b: negative-logit branch (multiplication; Candle convention) - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects) - 3 edge cases (empty logits error, top_p=1 no-op, signature compatibility) ## V1_004 (operator-coordinated bench discharge) Operator can dispatch companion-side bench with APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once companion-side env-var plumbing (Phase 3 of #1843) lands. Bench script + analyzer already pass through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually apply repetition penalty. ## Stacked on #1842 (qwen3-moe-sampling-v1 implementation; the `sample_from_logits` helper this PR extends). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 20, 2026 09:51

noahgift mentioned this pull request May 20, 2026

feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) #1844

Merged

4 tasks

noahgift closed this May 20, 2026

auto-merge was automatically disabled May 20, 2026 09:57
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d#1843

spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d#1843
noahgift wants to merge 1 commit into
mainfrom
spec/qwen3-moe-repetition-penalty-v1

noahgift commented May 20, 2026

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Motivation

Falsification gates

Implementation phases (engineer playbook)

NOT in scope

Test plan

Uh oh!

noahgift commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant