Skip to content

feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p (all 4 falsifiers discharged)#1842

Merged
noahgift merged 7 commits into
mainfrom
feat/m32d-followup-sampling-impl
May 20, 2026
Merged

feat: qwen3-moe-sampling-v1 implementation — temperature/top_k/top_p (all 4 falsifiers discharged)#1842
noahgift merged 7 commits into
mainfrom
feat/m32d-followup-sampling-impl

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Implements the qwen3-moe-sampling-v1 contract (registered in #1837; bumps v1.0.0 → v1.1.0). Discharges ALL 4 falsifiers via in-session work + empirical validation against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

Empirical results

4-test battery (`crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs`, 19.28s total wall, all PASS):

Gate Test Tokens (excerpt)
V1_001 (greedy determinism) 3 runs returned same `[9707, 198, 40, 614, 264, 3405]` ("I have a question")
V1_002 (seeded RNG reproducibility) 2 runs seed=42 temp=0.7 returned identical `[9707, 198, 40, 1079, 264, 220]`
V1_003 (seed divergence) seed=42 vs seed=43 totally different generated tokens
V1_004 (top_k=1 forces greedy) high-temp + top_k=1 == pure greedy byte-identical

Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:

  • New `sample_from_logits(logits, config, rng) -> Result` helper
  • Pipeline mirrors the dense `sample_advanced` (in `gguf/inference/fails.rs:100`): temperature scale → top_k filter → top_p filter → multinomial draw
  • Greedy fallback when `temperature == 0` OR `top_k == 1`
  • Uses `rand::rngs::StdRng` (ChaCha12; seedable from `u64`) for reproducibility — NOT `rand::thread_rng()` like the dense path. V1_002 requires deterministic re-runs from same seed.
  • Decode loop seeds the RNG from `QuantizedGenerateConfig.seed`

Companion-side downstream

The V1_004 bench (paiml/claude-code-parity-apr Phase 6) is currently hitting 900s per-turn timeout because the 30B-MoE is verbose under greedy decoding. With sampling shipped, operator can dispatch a follow-up bench with `temperature=0.3` to concentrate probability mass on action tokens and reduce rambling.

The bench script + analyzer already pass temperature through `gen_config` to `run_qwen3_moe_generate` — this PR makes that pass-through actually take effect.

NOT in scope

  • Repetition penalty (separate contract `qwen3-moe-repetition-penalty-v1`)
  • Mirostat / logit bias / streaming (separate concerns)

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` — clean
  • `cargo check --test qwen3_moe_sampling_v1 -p aprender-serve --features cuda` — clean
  • All 4 falsifier tests pass against real Qwen3-Coder-30B-A3B (live, 19.28s wall)
  • CI (sovereign-ci full workflow)

🤖 Generated with Claude Code

noahgift and others added 2 commits May 20, 2026 10:57
…(F-DISTILL-SHARD-BATCH-001/002)

Closes the cross-component contract gap that the Blackwell cascade
post-mortem (lesson #2) identified: cache machinery is silent on
divergences between producer and consumer, until live dispatch
surfaces the failure. Same risk class for ShardBatchSource: its
wrap-around / cursor / chunk semantics need fixture-driven verification.

Adds two tests gated on `shard-batch-source` feature:

  F-DISTILL-SHARD-BATCH-001 — happy path
    Writes a tiny .bin shard with [0, 1, ..., 4095] tokens, opens via
    ShardBatchSource::from_dir, asserts:
      - batch shape (4 rows × 16 tokens)
      - all returned tokens lie in [0, 4096) (fixture range)
      - labels in same range
    Catches: any cursor-off-by-one or layout swap that produces
    garbage outside the fixture range.

  F-DISTILL-SHARD-BATCH-002 — wrap-around
    Writes only 128 tokens (enough for ~1.88 batches at bs=4, seq=16),
    consumes 5 batches in a row. Asserts no error — wrap_around=true is
    the default for ShardBatchSource. Catches: regression where the
    iterator returns None on exhaustion despite the constructor
    setting wrap_around.

Test plan:
- [x] 63 distill lib tests pass (was 61; 2 new)
- [x] `cargo test --features shard-batch-source` clean

These two tests would have caught most ShardBatchSource bugs at PR-time
instead of at gx10-dispatch-time, where each failure costs 5-15min.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements the qwen3-moe-sampling-v1 contract (#1837;
v1.0.0 → v1.1.0). Discharges ALL 4 falsifiers via in-session work +
empirical validation against Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.

## Empirical results

4-test battery (19.28s total wall):

- V1_001 (greedy determinism): 3 runs returned identical
  [9707, 198, 40, 614, 264, 3405] ("I have a question")
- V1_002 (seeded RNG reproducibility): 2 runs seed=42 temp=0.7
  returned identical [9707, 198, 40, 1079, 264, 220]
- V1_003 (seed divergence): seed=42 vs seed=43 produced
  totally different generated tokens
- V1_004 (top_k=1 forces greedy regardless of temperature):
  byte-identical to pure greedy output

## Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:
- New private `sample_from_logits(logits, config, rng) -> Result<u32>`
- Pipeline mirrors the dense `Self::sample_advanced` (in
  `gguf/inference/fails.rs:100`): temperature scale → top_k filter
  → top_p filter → multinomial draw
- Greedy fallback when `temperature == 0` OR `top_k == 1`
- Uses `rand::rngs::StdRng` (ChaCha12; seedable from u64) for
  reproducibility. NOT `rand::thread_rng()` like the dense path
  (intentional — V1_002 requires deterministic re-runs from same seed)
- Decode loop seeds the RNG from `QuantizedGenerateConfig.seed`

New test: `crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` —
4 falsifier tests, env-gated on QWEN3_MOE_GGUF_PATH (mirrors the
existing qwen3_moe_serve_dispatch_v1 + moe_kv_cache_equivalence tests).

## NOT in scope

- Repetition penalty (separate contract qwen3-moe-repetition-penalty-v1;
  still pending operator authorization)
- Mirostat / logit bias / streaming (separate concerns)

## Companion-side downstream

The V1_004 bench (paiml/claude-code-parity-apr Phase 6) is hitting
900s per-turn timeout because the 30B-MoE is verbose under greedy
decoding. With sampling shipped, operator can dispatch a follow-up
bench with `temperature=0.3` (or similar) which may concentrate
probability mass on action tokens and reduce rambling. The bench
script + analyzer already pass temperature through `gen_config` to
`run_qwen3_moe_generate` — this PR makes that pass-through actually
take effect.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 09:22
noahgift and others added 2 commits May 20, 2026 11:22
Complements the env-gated integration tests in
`crates/aprender-serve/tests/qwen3_moe_sampling_v1.rs` (which require
a real Qwen3-MoE GGUF) with pure-Rust unit tests against synthetic
logits arrays. These run unconditionally in CI and pin all 4
falsifiers as runnable gates without requiring `QWEN3_MOE_GGUF_PATH`.

7 tests total in `sample_from_logits_tests`:

- v1_001_temperature_zero_is_argmax_deterministic — greedy fallback
  via temperature == 0
- v1_001_top_k_one_is_argmax_deterministic — greedy fallback via
  top_k == 1 (independent path)
- v1_002_seeded_rng_is_reproducible — seed=42 produces same token
  across 5 invocations
- v1_003_different_seeds_diverge — 32 seeds produce ≥ 3 distinct
  tokens (statistical bound)
- v1_004_top_k_one_equals_pure_greedy — top_k=1 with high temp
  byte-identical to pure greedy
- empty_logits_returns_error — edge case, no panic
- top_p_one_is_no_op — edge case, top_p=1.0 equivalent to top_p
  sentinel-off path

Empirical:
```
cargo test -p aprender-serve --lib sample_from_logits_tests --features cuda
test result: ok. 7 passed; 0 failed; 0 ignored; 0 measured;
16890 filtered out; finished in 0.00s
```

CI gates the sampling invariants at every PR going forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
noahgift and others added 3 commits May 20, 2026 11:57
…ischarged; supersedes #1843) (#1844)

* spec: qwen3-moe-repetition-penalty-v1 — third 3-knob follow-up to M32d

Registers a new provable contract for repetition penalty
(`repeat_penalty` + `repeat_last_n`) on the qwen3_moe inference path.
Completes the 3-knob toolkit alongside:

- qwen3-moe-sampling-v1 (#1837 contract + #1842 impl) — temperature/top_k/top_p
- qwen3-moe-streaming-sse-v1 (#1835) — per-token SSE streaming
- qwen3-moe-repetition-penalty-v1 (this PR) — repeat_penalty / repeat_last_n

## Motivation

`QuantizedGenerateConfig` has `repeat_penalty` (f32) + `repeat_last_n`
(usize) fields. The dense path's `sample_advanced` applies them. The
qwen3_moe path's new `sample_from_logits` (added by #1842) does NOT.

Empirical observation from M287 evidence: Qwen3-Coder-30B-A3B is
generating REPEATED restatements in its turn-1 output (same Rust
snippet 3× in fixture leetcode__01-two-sum). Repetition penalty
(typically 1.1-1.3) would down-weight recently-generated tokens,
breaking the textual loop and forcing the model to either commit to
a tool call or change tactics.

## Falsification gates

- V1_001: `repeat_penalty == 1.0` is a no-op (backwards compat)
- V1_002: `repeat_penalty > 1.0` down-weights repeated tokens
- V1_003: `repeat_last_n` bounds the window correctly
- V1_004: companion-side bench with penalty produces a measurably
  different outcome distribution than the greedy baseline

## Implementation phases (engineer playbook)

- Phase 1 (~1hr): extend `sample_from_logits` signature with
  `recent_tokens: &[u32]` parameter; apply penalty as Step 1
- Phase 2 (~30min): plumb `&tokens` through decode loop
- Phase 3 (~2hr): unit tests + companion-side bench env-var plumbing

Total ~3-4 hours; operator-actionable any time post-M32d-merge.

## NOT in scope

- Mirostat / DRY / other penalty schemes
- Per-token logit biases
- Dynamic per-position penalty
- Companion-side bench env-var plumbing (separate companion PR;
  this contract is aprender-side only)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* feat: qwen3-moe-repetition-penalty-v1 implementation (V1_001-V1_003 discharged)

Implements the qwen3-moe-repetition-penalty-v1 contract. Bumps the
contract v1.0.0 → v1.1.0 with status_history. Supersedes #1843
(which was contract-only).

## Implementation

`crates/aprender-serve/src/infer/qwen3_moe_generate.rs`:

- `sample_from_logits` signature extended with `recent_tokens: &[u32]`
  parameter
- Repetition penalty applied as **Step 1** (BEFORE temperature scaling)
- Mirrors Candle's `apply_repeat_penalty` semantics (PMAT-383/384,
  dense path's `sample_advanced` in `gguf/inference/fails.rs:100`):
  - For positive logits: `logits[idx] /= repeat_penalty`
  - For negative logits: `logits[idx] *= repeat_penalty`
- Window: last `repeat_last_n` tokens from `recent_tokens`
- No-op when `repeat_penalty == 1.0` OR `repeat_last_n == 0`
- Decode loop passes `&tokens` slice (cheap borrow; no allocation per token)

## Test results

12/12 tests pass in `cargo test sample_from_logits_tests`:

- 4 original sampling tests (V1_001 greedy fallback, V1_002 seeded RNG,
  V1_003 seed divergence, V1_004 top_k=1 forces greedy)
- 5 NEW rep-penalty tests:
  - V1_001a: repeat_penalty=1.0 no-op
  - V1_001b: repeat_last_n=0 no-op
  - V1_002a: positive-logit branch (division)
  - V1_002b: negative-logit branch (multiplication; Candle convention)
  - V1_003: repeat_last_n window bounds (n=0/n=2/n=8 different effects)
- 3 edge cases (empty logits error, top_p=1 no-op, signature
  compatibility)

## V1_004 (operator-coordinated bench discharge)

Operator can dispatch companion-side bench with
APR_AGENT_REPEAT_PENALTY=1.2 + APR_AGENT_REPEAT_LAST_N=64 once
companion-side env-var plumbing (Phase 3 of #1843) lands. Bench
script + analyzer already pass through `gen_config` to
`run_qwen3_moe_generate` — this PR makes that pass-through actually
apply repetition penalty.

## Stacked on

#1842 (qwen3-moe-sampling-v1 implementation; the
`sample_from_logits` helper this PR extends).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 88fe858 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/m32d-followup-sampling-impl branch May 20, 2026 11:07
noahgift added a commit that referenced this pull request May 20, 2026
…apr code env vars (#1846)

* feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code

Closes the gap between the 3-knob implementation (sampling #1842 +
rep-penalty #1844) and the HTTP chat-completions interface. Without
this PR, the new QuantizedGenerateConfig fields (top_k, top_p,
repeat_penalty, repeat_last_n, seed) were silently hardcoded to
defaults in `try_qwen3_moe_backend` — the impl was on main but
unreachable from the HTTP path.

## Changes

### `crates/aprender-serve/src/api/mod_create_demo.rs`

`ChatCompletionRequest` gains 5 optional fields (aprender extensions
to the OpenAI schema):

- `top_k: Option<usize>` — qwen3-moe-sampling-v1 V1_001 knob
- `repeat_penalty: Option<f32>` — qwen3-moe-repetition-penalty-v1
- `repeat_last_n: Option<usize>` — penalty window
- `seed: Option<u64>` — qwen3-moe-sampling-v1 V1_002 reproducibility

All `#[serde(default)]` so existing clients are unaffected.

### `crates/aprender-serve/src/api/cuda_chat_backend.rs`

`try_qwen3_moe_backend` thread all 5 new fields from the HTTP request
into `QuantizedGenerateConfig`. When unset, falls back to the
QuantizedGenerateConfig::default() values (greedy decoding).

### `crates/aprender-orchestrate/src/agent/driver/apr_serve.rs`

`AprServeDriver::build_openai_body` reads 6 env vars and includes them
in the HTTP request body when set:

- `APR_AGENT_TEMPERATURE` — overrides CompletionRequest.temperature
- `APR_AGENT_TOP_K`
- `APR_AGENT_TOP_P`
- `APR_AGENT_REPEAT_PENALTY`
- `APR_AGENT_REPEAT_LAST_N`
- `APR_AGENT_SEED`

Operator can now dispatch the CCPA Phase 6 bench with sampling/penalty:

```bash
APR_AGENT_TEMPERATURE=0.3 \
APR_AGENT_TOP_K=50 \
APR_AGENT_TOP_P=0.95 \
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh
```

(Per paiml/claude-code-parity-apr M288 v1004-3knob-dispatch-recipe.md.)

## What this is NOT

- NOT new contract gates — V1_001..V1_004 of qwen3-moe-sampling-v1 +
  qwen3-moe-repetition-penalty-v1 are already discharged. This is
  PURE PLUMBING.
- NOT companion-side env-var plumbing — apr code in this PR already
  reads env vars; companion bench script just needs to set them
  (mechanical, no aprender change).

## Cross-references

- aprender#1832 (M32d, MERGED)
- aprender#1842 (sampling impl, MERGED via squash that also absorbed #1844)
- aprender#1844 (rep-penalty impl, MERGED)
- aprender#1835 (streaming SSE contract, OPEN)
- paiml/claude-code-parity-apr M288 (3-knob dispatch recipe)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

* fix(aprender-serve): add 4 new fields to 22 ChatCompletionRequest test sites

PR #1846 added top_k/repeat_penalty/repeat_last_n/seed to ChatCompletionRequest
but missed updating the existing test construction sites. CI failed with 11
E0063 errors ("missing fields ... in initializer of api::ChatCompletionRequest").

Surgical fix: insert the 4 new fields (all set to None) at every test-site
struct literal. Behavior preserved — these fields are Option<T> defaulting to
None matches existing behavior.

Sites patched:
- src/api/tests/ (10 files, 11 sites) — built by `cargo test --lib`
- tests/api_coverage.rs, tests/api_deep_coverage.rs, tests/property_api.rs
  (3 files, 15 sites) — built by integration test path

Closes the workspace-test failure on b2333b8.

---------

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift added a commit that referenced this pull request May 20, 2026
…4 follow-up) (#1849)

Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT
(the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context:
paiml/claude-code-parity-apr M287 evidence showed the 30B model
emits Markdown ```rust``` code blocks (in turn-1 text) instead of
<tool_call> JSON. The parser at realizar.rs:144-149 accepts
<tool_call> + ```json``` but NOT ```rust``` — so the model's
turns are silently text-only, bench hits per-turn timeout after 4
turns of rambling.

The 3-knob toolkit (sampling/penalty/streaming) tunes probability
distributions but can't change format adherence. THIS PR addresses
the format adherence directly by:

1. Showing the model 3 concrete <tool_call> examples in-context
   (file_read, file_edit, shell)
2. Adding an explicit "ALWAYS gets a tool-call response" rule
3. Adding "Be concise — DO NOT narrate" guideline
4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule

## Why few-shot examples work

Large language models are pattern-matchers. Showing them the exact
format they should emit (rather than just describing it) drastically
improves format adherence on coder-finetuned models. The 30B-Coder
has strong "Markdown code block" priors from training; explicit
counter-examples + the negative rule pull it toward the <tool_call>
format.

## Empirical context

M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform
driver_error / turns_before_error=4 pattern. Every turn was text
with Rust code in Markdown, no tool calls extracted. Operator
playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846
shipped). This PR is COMPLEMENTARY: prompt fix + sampling together
have the best chance of breaking the rambling pattern.

## Companion-side dispatch (post-merge)

After this PR + rebuild, operator can run a NEW sub-bench (call it
Sub-bench E in M288 nomenclature) that combines:
- 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95)
- Repetition penalty (repeat_penalty=1.2, repeat_last_n=64)
- THIS PR's few-shot prompt (active by default; no env var needed)

If Sub-bench E shows ANY fixture pass, V1_004 discharges.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant