feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) by noahgift · Pull Request #1844 · paiml/aprender

noahgift · 2026-05-20T09:56:48Z

Summary

Implements `qwen3-moe-repetition-penalty-v1` (third 3-knob sibling). Bumps contract v1.0.0 → v1.1.0. Supersedes #1843 (which was contract-only) by including both contract + implementation + tests in one PR.

Stacked on: #1842 (qwen3-moe-sampling-v1 impl). Target branch: `feat/m32d-followup-sampling-impl`. Retarget to `main` once #1842 merges.

Empirical results

12/12 tests pass in `cargo test -p aprender-serve --lib sample_from_logits_tests`:

Block	Tests	What's tested
Sampling (#1842)	4	V1_001..V1_004 greedy/seed/top_k
Rep penalty (NEW)	5	V1_001..V1_003 + 2 branches
Edge cases	3	empty logits, top_p=1 no-op, signature compat

Rep-penalty tests

`rep_penalty_v1_001_no_op_at_one`: `repeat_penalty=1.0` is no-op even with repeats
`rep_penalty_v1_001_no_op_when_repeat_last_n_zero`: `repeat_last_n=0` is no-op
`rep_penalty_v1_002_down_weights_repeated`: positive-logit branch (division)
`rep_penalty_v1_002_negative_logit_branch`: negative-logit branch (multiplication; Candle convention)
`rep_penalty_v1_003_window_bounds`: `repeat_last_n` bounds (n=0/n=2/n=8)

Implementation

`sample_from_logits` signature extended with `recent_tokens: &[u32]`. Penalty applied as Step 1 (before temperature). Mirrors Candle's `apply_repeat_penalty` (PMAT-383/384). Decode loop passes `&tokens` slice — cheap borrow, no allocation per token.

Pipeline: repetition penalty → temperature scale → top_k filter → top_p filter → multinomial draw / greedy argmax.

V1_004 (operator-coordinated)

Once this lands, operator can dispatch companion-side bench with:

```
APR_AGENT_REPEAT_PENALTY=1.2 APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh
```

(Companion-side env-var plumbing tracked separately in #1843's Phase 3.)

Test plan

`cargo check -p aprender-serve --lib --features cuda` — clean
`cargo test sample_from_logits_tests --features cuda` — 12/12 pass
All previously-passing sampling tests still pass (backwards compat)
CI

🤖 Generated with Claude Code

Registers a new provable contract for repetition penalty (`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path. Completes the 3-knob toolkit alongside: - qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p - qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming - qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n ## Motivation `QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path's `sample_advanced` applies them. The qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT. Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is generating REPEATED restatements in its turn-1 output (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (typically 1.1-1.3) would down-weight recently-generated tokens, breaking the textual loop and forcing the model to either commit to a tool call or change tactics. ## Falsification gates - V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat) - V1_002: `repeat_penalty > 1.0` down-weights repeated tokens - V1_003: `repeat_last_n` bounds the window correctly - V1_004: companion-side bench with penalty produces a measurably different outcome distribution than the greedy baseline ## Implementation phases (engineer playbook) - Phase 1 (~1hr): extend `sample_from_logits` signature with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 - Phase 2 (~30min): plumb `&tokens` through decode loop - Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing Total ~3-4 hours; operator-actionable any time post-M32d-merge. ## NOT in scope - Mirostat / DRY / other penalty schemes - Per-token logit biases - Dynamic per-position penalty - Companion-side bench env-var plumbing (separate companion PR; this contract is aprender-side only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ischarged) Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843 (which was contract-only). ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - `sample_from_logits` signature extended with `recent_tokens: &[u32]` parameter - Repetition penalty applied as **Step 1** (BEFORE temperature scaling) - Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384, dense path's `sample_advanced` in `gguf/inference/fails.rs:100`): - For positive logits: `logits[idx] /= repeat_penalty` - For negative logits: `logits[idx] *= repeat_penalty` - Window: last `repeat_last_n` tokens from `recent_tokens` - No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0` - Decode loop passes `&tokens` slice (cheap borrow; no allocation per token) ## Test results 12/12 tests pass in `cargo test sample_from_logits_tests`: - 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG, V1_003 seed divergence, V1_004 top_k=1 forces greedy) - 5 NEW rep-penalty tests: - V1_001a: repeat_penalty=1.0 no-op - V1_001b: repeat_last_n=0 no-op - V1_002a: positive-logit branch (division) - V1_002b: negative-logit branch (multiplication; Candle convention) - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects) - 3 edge cases (empty logits error, top_p=1 no-op, signature compatibility) ## V1_004 (operator-coordinated bench discharge) Operator can dispatch companion-side bench with APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once companion-side env-var plumbing (Phase 3 of #1843) lands. Bench script + analyzer already pass through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually apply repetition penalty. ## Stacked on #1842 (qwen3-moe-sampling-v1 implementation; the `sample_from_logits` helper this PR extends). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…(all 4 falsifiers discharged) (#1842) * test(distill): fixture-driven integration tests for ShardBatchSource (F-DISTILL-SHARD-BATCH-001/002) Closes the cross-component contract gap that the Blackwell cascade post-mortem (lesson #2) identified: cache machinery is silent on divergences between producer and consumer, until live dispatch surfaces the failure. Same risk class for ShardBatchSource: its wrap-around / cursor / chunk semantics need fixture-driven verification. Adds two tests gated on `shard-batch-source` feature: F-DISTILL-SHARD-BATCH-001 — happy path Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via ShardBatchSource::from_dir, asserts: - batch shape (4 rows × 16 tokens) - all returned tokens lie in [0, 4096) (fixture range) - labels in same range Catches: any cursor-off-by-one or layout swap that produces garbage outside the fixture range. F-DISTILL-SHARD-BATCH-002 — wrap-around Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16), consumes 5 batches in a row. Asserts no error — wrap_around=true is the default for ShardBatchSource. Catches: regression where the iterator returns None on exhaustion despite the constructor setting wrap_around. Test plan: - [x] 63 distill lib tests pass (was 61; 2 new) - [x] `cargo test --features shard-batch-source` clean These two tests would have caught most ShardBatchSource bugs at PR-time instead of at gx10-dispatch-time, where each failure costs 5-15min. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p Implements the qwen3-moe-sampling-v1 contract (#1837; v1.0.0 → v1.1.0). Discharges ALL 4 falsifiers via in-session work + empirical validation against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M. ## Empirical results 4-test battery (19.28s total wall): - V1_001 (greedy determinism): 3 runs returned identical [9707, 198, 40, 614, 264, 3405] ("I have a question") - V1_002 (seeded RNG reproducibility): 2 runs seed=42 temp=0.7 returned identical [9707, 198, 40, 1079, 264, 220] - V1_003 (seed divergence): seed=42 vs seed=43 produced totally different generated tokens - V1_004 (top_k=1 forces greedy regardless of temperature): byte-identical to pure greedy output ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - New private `sample_from_logits(logits, config, rng) -> Result<u32>` - Pipeline mirrors the dense `Self::sample_advanced` (in `gguf/inference/fails.rs:100`): temperature scale → top_k filter → top_p filter → multinomial draw - Greedy fallback when `temperature == 0` OR `top_k == 1` - Uses `rand::rngs::StdRng` (ChaCha12; seedable from u64) for reproducibility. NOT `rand::thread_rng()` like the dense path (intentional — V1_002 requires deterministic re-runs from same seed) - Decode loop seeds the RNG from `QuantizedGenerateConfig.seed` New test: `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` — 4 falsifier tests, env-gated on QWEN3_MOE_GGUF_PATH (mirrors the existing qwen3_moe_serve_dispatch_v1 + moe_kv_cache_equivalence tests). ## NOT in scope - Repetition penalty (separate contract qwen3-moe-repetition-penalty-v1; still pending operator authorization) - Mirostat / logit bias / streaming (separate concerns) ## Companion-side downstream The V1_004 bench (paiml/claude-code-parity-apr Phase 6) is hitting 900s per-turn timeout because the 30B-MoE is verbose under greedy decoding. With sampling shipped, operator can dispatch a follow-up bench with `temperature=0.3` (or similar) which may concentrate probability mass on action tokens and reduce rambling. The bench script + analyzer already pass temperature through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually take effect. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(qwen3-moe-sampling-v1): add CI-runnable unit tests (no GGUF needed) Complements the env-gated integration tests in `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` (which require a real Qwen3-MoE GGUF) with pure-Rust unit tests against synthetic logits arrays. These run unconditionally in CI and pin all 4 falsifiers as runnable gates without requiring `QWEN3_MOE_GGUF_PATH`. 7 tests total in `sample_from_logits_tests`: - v1_001_temperature_zero_is_argmax_deterministic — greedy fallback via temperature == 0 - v1_001_top_k_one_is_argmax_deterministic — greedy fallback via top_k == 1 (independent path) - v1_002_seeded_rng_is_reproducible — seed=42 produces same token across 5 invocations - v1_003_different_seeds_diverge — 32 seeds produce ≥ 3 distinct tokens (statistical bound) - v1_004_top_k_one_equals_pure_greedy — top_k=1 with high temp byte-identical to pure greedy - empty_logits_returns_error — edge case, no panic - top_p_one_is_no_op — edge case, top_p=1.0 equivalent to top_p sentinel-off path Empirical: ``` cargo test -p aprender-serve --lib sample_from_logits_tests --features cuda test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured; 16890 filtered out; finished in 0.00s ``` CI gates the sampling invariants at every PR going forward. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843) (#1844) * spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d Registers a new provable contract for repetition penalty (`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path. Completes the 3-knob toolkit alongside: - qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p - qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming - qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n ## Motivation `QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n` (usize) fields. The dense path's `sample_advanced` applies them. The qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT. Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is generating REPEATED restatements in its turn-1 output (same Rust snippet 3× in fixture leetcode__01-two-sum). Repetition penalty (typically 1.1-1.3) would down-weight recently-generated tokens, breaking the textual loop and forcing the model to either commit to a tool call or change tactics. ## Falsification gates - V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat) - V1_002: `repeat_penalty > 1.0` down-weights repeated tokens - V1_003: `repeat_last_n` bounds the window correctly - V1_004: companion-side bench with penalty produces a measurably different outcome distribution than the greedy baseline ## Implementation phases (engineer playbook) - Phase 1 (~1hr): extend `sample_from_logits` signature with `recent_tokens: &[u32]` parameter; apply penalty as Step 1 - Phase 2 (~30min): plumb `&tokens` through decode loop - Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing Total ~3-4 hours; operator-actionable any time post-M32d-merge. ## NOT in scope - Mirostat / DRY / other penalty schemes - Per-token logit biases - Dynamic per-position penalty - Companion-side bench env-var plumbing (separate companion PR; this contract is aprender-side only) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged) Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843 (which was contract-only). ## Implementation `crates/aprender-serve/src/infer/qwen3_moe_generate.rs`: - `sample_from_logits` signature extended with `recent_tokens: &[u32]` parameter - Repetition penalty applied as **Step 1** (BEFORE temperature scaling) - Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384, dense path's `sample_advanced` in `gguf/inference/fails.rs:100`): - For positive logits: `logits[idx] /= repeat_penalty` - For negative logits: `logits[idx] *= repeat_penalty` - Window: last `repeat_last_n` tokens from `recent_tokens` - No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0` - Decode loop passes `&tokens` slice (cheap borrow; no allocation per token) ## Test results 12/12 tests pass in `cargo test sample_from_logits_tests`: - 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG, V1_003 seed divergence, V1_004 top_k=1 forces greedy) - 5 NEW rep-penalty tests: - V1_001a: repeat_penalty=1.0 no-op - V1_001b: repeat_last_n=0 no-op - V1_002a: positive-logit branch (division) - V1_002b: negative-logit branch (multiplication; Candle convention) - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects) - 3 edge cases (empty logits error, top_p=1 no-op, signature compatibility) ## V1_004 (operator-coordinated bench discharge) Operator can dispatch companion-side bench with APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once companion-side env-var plumbing (Phase 3 of #1843) lands. Bench script + analyzer already pass through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually apply repetition penalty. ## Stacked on #1842 (qwen3-moe-sampling-v1 implementation; the `sample_from_logits` helper this PR extends). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…apr code env vars (#1846) * feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code Closes the gap between the 3-knob implementation (sampling #1842 + rep-penalty #1844) and the HTTP chat-completions interface. Without this PR, the new QuantizedGenerateConfig fields (top_k, top_p, repeat_penalty, repeat_last_n, seed) were silently hardcoded to defaults in `try_qwen3_moe_backend` — the impl was on main but unreachable from the HTTP path. ## Changes ### `crates/aprender-serve/src/api/mod_create_demo.rs` `ChatCompletionRequest` gains 5 optional fields (aprender extensions to the OpenAI schema): - `top_k: Option<usize>` — qwen3-moe-sampling-v1 V1_001 knob - `repeat_penalty: Option<f32>` — qwen3-moe-repetition-penalty-v1 - `repeat_last_n: Option<usize>` — penalty window - `seed: Option<u64>` — qwen3-moe-sampling-v1 V1_002 reproducibility All `#[serde(default)]` so existing clients are unaffected. ### `crates/aprender-serve/src/api/cuda_chat_backend.rs` `try_qwen3_moe_backend` thread all 5 new fields from the HTTP request into `QuantizedGenerateConfig`. When unset, falls back to the QuantizedGenerateConfig::default() values (greedy decoding). ### `crates/aprender-orchestrate/src/agent/driver/apr_serve.rs` `AprServeDriver::build_openai_body` reads 6 env vars and includes them in the HTTP request body when set: - `APR_AGENT_TEMPERATURE` — overrides CompletionRequest.temperature - `APR_AGENT_TOP_K` - `APR_AGENT_TOP_P` - `APR_AGENT_REPEAT_PENALTY` - `APR_AGENT_REPEAT_LAST_N` - `APR_AGENT_SEED` Operator can now dispatch the CCPA Phase 6 bench with sampling/penalty: ```bash APR_AGENT_TEMPERATURE=0.3 \ APR_AGENT_TOP_K=50 \ APR_AGENT_TOP_P=0.95 \ APR_AGENT_REPEAT_PENALTY=1.2 \ APR_AGENT_REPEAT_LAST_N=64 \ bash scripts/phase-6-bench.sh ``` (Per paiml/claude-code-parity-apr M288 v1004-3knob-dispatch-recipe.md.) ## What this is NOT - NOT new contract gates — V1_001..V1_004 of qwen3-moe-sampling-v1 + qwen3-moe-repetition-penalty-v1 are already discharged. This is PURE PLUMBING. - NOT companion-side env-var plumbing — apr code in this PR already reads env vars; companion bench script just needs to set them (mechanical, no aprender change). ## Cross-references - aprender#1832 (M32d, MERGED) - aprender#1842 (sampling impl, MERGED via squash that also absorbed #1844) - aprender#1844 (rep-penalty impl, MERGED) - aprender#1835 (streaming SSE contract, OPEN) - paiml/claude-code-parity-apr M288 (3-knob dispatch recipe) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(aprender-serve): add 4 new fields to 22 ChatCompletionRequest test sites PR #1846 added top_k/repeat_penalty/repeat_last_n/seed to ChatCompletionRequest but missed updating the existing test construction sites. CI failed with 11 E0063 errors ("missing fields ... in initializer of api::ChatCompletionRequest"). Surgical fix: insert the 4 new fields (all set to None) at every test-site struct literal. Behavior preserved — these fields are Option<T> defaulting to None matches existing behavior. Sites patched: - src/api/tests/ (10 files, 11 sites) — built by `cargo test --lib` - tests/api_coverage.rs, tests/api_deep_coverage.rs, tests/property_api.rs (3 files, 15 sites) — built by integration test path Closes the workspace-test failure on b2333b8. --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…4 follow-up) (#1849) Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT (the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context: paiml/claude-code-parity-apr M287 evidence showed the 30B model emits Markdown ```rust``` code blocks (in turn-1 text) instead of <tool_call> JSON. The parser at realizar.rs:144-149 accepts <tool_call> + ```json``` but NOT ```rust``` — so the model's turns are silently text-only, bench hits per-turn timeout after 4 turns of rambling. The 3-knob toolkit (sampling/penalty/streaming) tunes probability distributions but can't change format adherence. THIS PR addresses the format adherence directly by: 1. Showing the model 3 concrete <tool_call> examples in-context (file_read, file_edit, shell) 2. Adding an explicit "ALWAYS gets a tool-call response" rule 3. Adding "Be concise — DO NOT narrate" guideline 4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule ## Why few-shot examples work Large language models are pattern-matchers. Showing them the exact format they should emit (rather than just describing it) drastically improves format adherence on coder-finetuned models. The 30B-Coder has strong "Markdown code block" priors from training; explicit counter-examples + the negative rule pull it toward the <tool_call> format. ## Empirical context M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform driver_error / turns_before_error=4 pattern. Every turn was text with Rust code in Markdown, no tool calls extracted. Operator playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846 shipped). This PR is COMPLEMENTARY: prompt fix + sampling together have the best chance of breaking the rambling pattern. ## Companion-side dispatch (post-merge) After this PR + rebuild, operator can run a NEW sub-bench (call it Sub-bench E in M288 nomenclature) that combines: - 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95) - Repetition penalty (repeat_penalty=1.2, repeat_last_n=64) - THIS PR's few-shot prompt (active by default; no env var needed) If Sub-bench E shows ANY fixture pass, V1_004 discharges. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

noahgift and others added 2 commits May 20, 2026 11:52

noahgift mentioned this pull request May 20, 2026

spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d #1843

Closed

2 tasks

noahgift merged commit b0a5220 into feat/m32d-followup-sampling-impl May 20, 2026
1 check passed

noahgift deleted the feat/m32d-followup-repetition-penalty-impl branch May 20, 2026 09:57

noahgift mentioned this pull request May 20, 2026

feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code env vars #1846

Merged

3 tasks

This was referenced May 20, 2026

feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up) #1849

Merged

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation #1852

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843)#1844

feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged; supersedes #1843)#1844
noahgift merged 2 commits into
feat/m32d-followup-sampling-implfrom
feat/m32d-followup-repetition-penalty-impl

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Empirical results

Rep-penalty tests

Implementation

V1_004 (operator-coordinated)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant