Skip to content

feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code env vars#1846

Merged
noahgift merged 5 commits into
mainfrom
feat/3knob-http-wireup
May 20, 2026
Merged

feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code env vars#1846
noahgift merged 5 commits into
mainfrom
feat/3knob-http-wireup

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Closes the gap between the 3-knob implementation (sampling #1842 + rep-penalty #1844) and the HTTP chat-completions interface. Without this PR, the new `QuantizedGenerateConfig` fields (top_k, top_p, repeat_penalty, repeat_last_n, seed) were silently hardcoded to defaults in `try_qwen3_moe_backend` — the impl was on main but unreachable from the HTTP path.

Changes

`ChatCompletionRequest` extension (`mod_create_demo.rs`)

5 new optional fields (aprender extensions to the OpenAI schema; all `#[serde(default)]` so existing clients unaffected):

  • `top_k: Option`
  • `repeat_penalty: Option`
  • `repeat_last_n: Option`
  • `seed: Option`
  • (`top_p` already existed)

`try_qwen3_moe_backend` (`cuda_chat_backend.rs`)

Thread all 5 new fields from HTTP request into `QuantizedGenerateConfig`. Fallback to `QuantizedGenerateConfig::default()` when unset (greedy decoding).

`AprServeDriver::build_openai_body` (`apr_serve.rs`)

Read 6 env vars + include in HTTP body when set:

  • `APR_AGENT_TEMPERATURE` (overrides `CompletionRequest.temperature`)
  • `APR_AGENT_TOP_K`, `APR_AGENT_TOP_P`
  • `APR_AGENT_REPEAT_PENALTY`, `APR_AGENT_REPEAT_LAST_N`
  • `APR_AGENT_SEED`

Operator dispatch (post-merge)

APR_AGENT_TEMPERATURE=0.3 APR_AGENT_TOP_K=50 APR_AGENT_TOP_P=0.95 \\
APR_AGENT_REPEAT_PENALTY=1.2 APR_AGENT_REPEAT_LAST_N=64 \\
bash scripts/phase-6-bench.sh

Per paiml/claude-code-parity-apr M288 `v1004-3knob-dispatch-recipe-2026-05-20.md`.

What this is NOT

  • NOT new contract gates — V1_001..V1_004 are already discharged. PURE PLUMBING.
  • NOT companion-side bench changes — companion bench just inherits env from operator's shell.

Test plan

  • `cargo check -p aprender-serve --lib --features cuda` — clean
  • `cargo check -p aprender-orchestrate` — clean
  • CI

🤖 Generated with Claude Code

…apr code

Closes the gap between the 3-knob implementation (sampling #1842 +
rep-penalty #1844) and the HTTP chat-completions interface. Without
this PR, the new QuantizedGenerateConfig fields (top_k, top_p,
repeat_penalty, repeat_last_n, seed) were silently hardcoded to
defaults in `try_qwen3_moe_backend` — the impl was on main but
unreachable from the HTTP path.

## Changes

### `crates/aprender-serve/src/api/mod_create_demo.rs`

`ChatCompletionRequest` gains 5 optional fields (aprender extensions
to the OpenAI schema):

- `top_k: Option<usize>` — qwen3-moe-sampling-v1 V1_001 knob
- `repeat_penalty: Option<f32>` — qwen3-moe-repetition-penalty-v1
- `repeat_last_n: Option<usize>` — penalty window
- `seed: Option<u64>` — qwen3-moe-sampling-v1 V1_002 reproducibility

All `#[serde(default)]` so existing clients are unaffected.

### `crates/aprender-serve/src/api/cuda_chat_backend.rs`

`try_qwen3_moe_backend` thread all 5 new fields from the HTTP request
into `QuantizedGenerateConfig`. When unset, falls back to the
QuantizedGenerateConfig::default() values (greedy decoding).

### `crates/aprender-orchestrate/src/agent/driver/apr_serve.rs`

`AprServeDriver::build_openai_body` reads 6 env vars and includes them
in the HTTP request body when set:

- `APR_AGENT_TEMPERATURE` — overrides CompletionRequest.temperature
- `APR_AGENT_TOP_K`
- `APR_AGENT_TOP_P`
- `APR_AGENT_REPEAT_PENALTY`
- `APR_AGENT_REPEAT_LAST_N`
- `APR_AGENT_SEED`

Operator can now dispatch the CCPA Phase 6 bench with sampling/penalty:

```bash
APR_AGENT_TEMPERATURE=0.3 \
APR_AGENT_TOP_K=50 \
APR_AGENT_TOP_P=0.95 \
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh
```

(Per paiml/claude-code-parity-apr M288 v1004-3knob-dispatch-recipe.md.)

## What this is NOT

- NOT new contract gates — V1_001..V1_004 of qwen3-moe-sampling-v1 +
  qwen3-moe-repetition-penalty-v1 are already discharged. This is
  PURE PLUMBING.
- NOT companion-side env-var plumbing — apr code in this PR already
  reads env vars; companion bench script just needs to set them
  (mechanical, no aprender change).

## Cross-references

- aprender#1832 (M32d, MERGED)
- aprender#1842 (sampling impl, MERGED via squash that also absorbed #1844)
- aprender#1844 (rep-penalty impl, MERGED)
- aprender#1835 (streaming SSE contract, OPEN)
- paiml/claude-code-parity-apr M288 (3-knob dispatch recipe)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift enabled auto-merge (squash) May 20, 2026 11:23
noahgift added 2 commits May 20, 2026 13:55
…t sites

PR #1846 added top_k/repeat_penalty/repeat_last_n/seed to ChatCompletionRequest
but missed updating the existing test construction sites. CI failed with 11
E0063 errors ("missing fields ... in initializer of api::ChatCompletionRequest").

Surgical fix: insert the 4 new fields (all set to None) at every test-site
struct literal. Behavior preserved — these fields are Option<T> defaulting to
None matches existing behavior.

Sites patched:
- src/api/tests/ (10 files, 11 sites) — built by `cargo test --lib`
- tests/api_coverage.rs, tests/api_deep_coverage.rs, tests/property_api.rs
  (3 files, 15 sites) — built by integration test path

Closes the workspace-test failure on b2333b8.
noahgift added a commit to paiml/claude-code-parity-apr that referenced this pull request May 20, 2026
paiml/aprender#1846 closes the env-var plumbing gap noted in M288.
M288's "NOT YET shipped" caveat is now resolved end-to-end.

## End-to-end flow now wired

```
operator shell ENV
  → bench script (inherits)
    → ccpa-arena-bench (inherits)
      → apr code (inherits)
        → AprServeDriver::build_openai_body (READS env vars)
          → HTTP POST /v1/chat/completions {temperature, top_k, ...}
            → try_qwen3_moe_backend (PARSES request)
              → QuantizedGenerateConfig {...}
                → run_qwen3_moe_generate
                  → sample_from_logits (APPLIES sampling + penalty)
```

Every link wired. Operator's `APR_AGENT_TEMPERATURE=0.3` (etc) now
flows through to actual logit sampling, no longer a no-op.

## Status reconciliation

8/10 aprender M32d-arc PRs MERGED. 2 OPEN:
- #1835 (streaming SSE contract; workspace-test pending)
- #1846 (this M289's prerequisite; just opened)

## Companion-side state

CCPA M281-M288 + M289 = 9 docs tracking the full upstream arc +
dispatch recipe + plumbing confirmation.

## What's NOT done

- V1_004 sub-bench not yet dispatched (operator-coordinated; needs
  #1846 merge + apr rebuild + ~10-15hr wall)
- Currently-running greedy baseline bench should finish first; don't
  start the 3-knob bench until the baseline scores.json lands (the
  COMPARISON is the value)

Mechanical doc. M-counter NOT bumped.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 1910e9e into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/3knob-http-wireup branch May 20, 2026 13:45
noahgift added a commit that referenced this pull request May 20, 2026
…4 follow-up) (#1849)

Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT
(the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context:
paiml/claude-code-parity-apr M287 evidence showed the 30B model
emits Markdown ```rust``` code blocks (in turn-1 text) instead of
<tool_call> JSON. The parser at realizar.rs:144-149 accepts
<tool_call> + ```json``` but NOT ```rust``` — so the model's
turns are silently text-only, bench hits per-turn timeout after 4
turns of rambling.

The 3-knob toolkit (sampling/penalty/streaming) tunes probability
distributions but can't change format adherence. THIS PR addresses
the format adherence directly by:

1. Showing the model 3 concrete <tool_call> examples in-context
   (file_read, file_edit, shell)
2. Adding an explicit "ALWAYS gets a tool-call response" rule
3. Adding "Be concise — DO NOT narrate" guideline
4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule

## Why few-shot examples work

Large language models are pattern-matchers. Showing them the exact
format they should emit (rather than just describing it) drastically
improves format adherence on coder-finetuned models. The 30B-Coder
has strong "Markdown code block" priors from training; explicit
counter-examples + the negative rule pull it toward the <tool_call>
format.

## Empirical context

M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform
driver_error / turns_before_error=4 pattern. Every turn was text
with Rust code in Markdown, no tool calls extracted. Operator
playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846
shipped). This PR is COMPLEMENTARY: prompt fix + sampling together
have the best chance of breaking the rambling pattern.

## Companion-side dispatch (post-merge)

After this PR + rebuild, operator can run a NEW sub-bench (call it
Sub-bench E in M288 nomenclature) that combines:
- 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95)
- Repetition penalty (repeat_penalty=1.2, repeat_last_n=64)
- THIS PR's few-shot prompt (active by default; no env var needed)

If Sub-bench E shows ANY fixture pass, V1_004 discharges.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant