feat: 3-knob HTTP wire-up — operator-actionable sampling/penalty via apr code env vars#1846
Merged
Conversation
…apr code Closes the gap between the 3-knob implementation (sampling #1842 + rep-penalty #1844) and the HTTP chat-completions interface. Without this PR, the new QuantizedGenerateConfig fields (top_k, top_p, repeat_penalty, repeat_last_n, seed) were silently hardcoded to defaults in `try_qwen3_moe_backend` — the impl was on main but unreachable from the HTTP path. ## Changes ### `crates/aprender-serve/src/api/mod_create_demo.rs` `ChatCompletionRequest` gains 5 optional fields (aprender extensions to the OpenAI schema): - `top_k: Option<usize>` — qwen3-moe-sampling-v1 V1_001 knob - `repeat_penalty: Option<f32>` — qwen3-moe-repetition-penalty-v1 - `repeat_last_n: Option<usize>` — penalty window - `seed: Option<u64>` — qwen3-moe-sampling-v1 V1_002 reproducibility All `#[serde(default)]` so existing clients are unaffected. ### `crates/aprender-serve/src/api/cuda_chat_backend.rs` `try_qwen3_moe_backend` thread all 5 new fields from the HTTP request into `QuantizedGenerateConfig`. When unset, falls back to the QuantizedGenerateConfig::default() values (greedy decoding). ### `crates/aprender-orchestrate/src/agent/driver/apr_serve.rs` `AprServeDriver::build_openai_body` reads 6 env vars and includes them in the HTTP request body when set: - `APR_AGENT_TEMPERATURE` — overrides CompletionRequest.temperature - `APR_AGENT_TOP_K` - `APR_AGENT_TOP_P` - `APR_AGENT_REPEAT_PENALTY` - `APR_AGENT_REPEAT_LAST_N` - `APR_AGENT_SEED` Operator can now dispatch the CCPA Phase 6 bench with sampling/penalty: ```bash APR_AGENT_TEMPERATURE=0.3 \ APR_AGENT_TOP_K=50 \ APR_AGENT_TOP_P=0.95 \ APR_AGENT_REPEAT_PENALTY=1.2 \ APR_AGENT_REPEAT_LAST_N=64 \ bash scripts/phase-6-bench.sh ``` (Per paiml/claude-code-parity-apr M288 v1004-3knob-dispatch-recipe.md.) ## What this is NOT - NOT new contract gates — V1_001..V1_004 of qwen3-moe-sampling-v1 + qwen3-moe-repetition-penalty-v1 are already discharged. This is PURE PLUMBING. - NOT companion-side env-var plumbing — apr code in this PR already reads env vars; companion bench script just needs to set them (mechanical, no aprender change). ## Cross-references - aprender#1832 (M32d, MERGED) - aprender#1842 (sampling impl, MERGED via squash that also absorbed #1844) - aprender#1844 (rep-penalty impl, MERGED) - aprender#1835 (streaming SSE contract, OPEN) - paiml/claude-code-parity-apr M288 (3-knob dispatch recipe) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t sites PR #1846 added top_k/repeat_penalty/repeat_last_n/seed to ChatCompletionRequest but missed updating the existing test construction sites. CI failed with 11 E0063 errors ("missing fields ... in initializer of api::ChatCompletionRequest"). Surgical fix: insert the 4 new fields (all set to None) at every test-site struct literal. Behavior preserved — these fields are Option<T> defaulting to None matches existing behavior. Sites patched: - src/api/tests/ (10 files, 11 sites) — built by `cargo test --lib` - tests/api_coverage.rs, tests/api_deep_coverage.rs, tests/property_api.rs (3 files, 15 sites) — built by integration test path Closes the workspace-test failure on b2333b8.
Merged
2 tasks
noahgift
added a commit
to paiml/claude-code-parity-apr
that referenced
this pull request
May 20, 2026
paiml/aprender#1846 closes the env-var plumbing gap noted in M288. M288's "NOT YET shipped" caveat is now resolved end-to-end. ## End-to-end flow now wired ``` operator shell ENV → bench script (inherits) → ccpa-arena-bench (inherits) → apr code (inherits) → AprServeDriver::build_openai_body (READS env vars) → HTTP POST /v1/chat/completions {temperature, top_k, ...} → try_qwen3_moe_backend (PARSES request) → QuantizedGenerateConfig {...} → run_qwen3_moe_generate → sample_from_logits (APPLIES sampling + penalty) ``` Every link wired. Operator's `APR_AGENT_TEMPERATURE=0.3` (etc) now flows through to actual logit sampling, no longer a no-op. ## Status reconciliation 8/10 aprender M32d-arc PRs MERGED. 2 OPEN: - #1835 (streaming SSE contract; workspace-test pending) - #1846 (this M289's prerequisite; just opened) ## Companion-side state CCPA M281-M288 + M289 = 9 docs tracking the full upstream arc + dispatch recipe + plumbing confirmation. ## What's NOT done - V1_004 sub-bench not yet dispatched (operator-coordinated; needs #1846 merge + apr rebuild + ~10-15hr wall) - Currently-running greedy baseline bench should finish first; don't start the 3-knob bench until the baseline scores.json lands (the COMPARISON is the value) Mechanical doc. M-counter NOT bumped. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
noahgift
added a commit
that referenced
this pull request
May 20, 2026
…4 follow-up) (#1849) Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT (the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context: paiml/claude-code-parity-apr M287 evidence showed the 30B model emits Markdown ```rust``` code blocks (in turn-1 text) instead of <tool_call> JSON. The parser at realizar.rs:144-149 accepts <tool_call> + ```json``` but NOT ```rust``` — so the model's turns are silently text-only, bench hits per-turn timeout after 4 turns of rambling. The 3-knob toolkit (sampling/penalty/streaming) tunes probability distributions but can't change format adherence. THIS PR addresses the format adherence directly by: 1. Showing the model 3 concrete <tool_call> examples in-context (file_read, file_edit, shell) 2. Adding an explicit "ALWAYS gets a tool-call response" rule 3. Adding "Be concise — DO NOT narrate" guideline 4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule ## Why few-shot examples work Large language models are pattern-matchers. Showing them the exact format they should emit (rather than just describing it) drastically improves format adherence on coder-finetuned models. The 30B-Coder has strong "Markdown code block" priors from training; explicit counter-examples + the negative rule pull it toward the <tool_call> format. ## Empirical context M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform driver_error / turns_before_error=4 pattern. Every turn was text with Rust code in Markdown, no tool calls extracted. Operator playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846 shipped). This PR is COMPLEMENTARY: prompt fix + sampling together have the best chance of breaking the rambling pattern. ## Companion-side dispatch (post-merge) After this PR + rebuild, operator can run a NEW sub-bench (call it Sub-bench E in M288 nomenclature) that combines: - 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95) - Repetition penalty (repeat_penalty=1.2, repeat_last_n=64) - THIS PR's few-shot prompt (active by default; no env var needed) If Sub-bench E shows ANY fixture pass, V1_004 discharges. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the gap between the 3-knob implementation (sampling #1842 + rep-penalty #1844) and the HTTP chat-completions interface. Without this PR, the new `QuantizedGenerateConfig` fields (top_k, top_p, repeat_penalty, repeat_last_n, seed) were silently hardcoded to defaults in `try_qwen3_moe_backend` — the impl was on main but unreachable from the HTTP path.
Changes
`ChatCompletionRequest` extension (`mod_create_demo.rs`)
5 new optional fields (aprender extensions to the OpenAI schema; all `#[serde(default)]` so existing clients unaffected):
`try_qwen3_moe_backend` (`cuda_chat_backend.rs`)
Thread all 5 new fields from HTTP request into `QuantizedGenerateConfig`. Fallback to `QuantizedGenerateConfig::default()` when unset (greedy decoding).
`AprServeDriver::build_openai_body` (`apr_serve.rs`)
Read 6 env vars + include in HTTP body when set:
Operator dispatch (post-merge)
Per paiml/claude-code-parity-apr M288 `v1004-3knob-dispatch-recipe-2026-05-20.md`.
What this is NOT
Test plan
🤖 Generated with Claude Code