feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up) by noahgift · Pull Request #1849 · paiml/aprender

noahgift · 2026-05-20T14:27:55Z

Summary

Adds 3 few-shot examples + 2 anti-rambling rules to `CODE_SYSTEM_PROMPT` (the 7B+ branch used by Qwen3-Coder-30B-A3B). Targets the M287 verbosity pattern where the model emits Markdown ```rust``` blocks instead of `<tool_call>` JSON.

Empirical context

paiml/claude-code-parity-apr M287 evidence: 10/20 Phase 6 fixtures uniformly `driver_error, turns_before_error=4`. Every turn was text-only — the 30B-Coder model has strong "Markdown code block" priors from training and was emitting Rust code in markdown rather than the `<tool_call>` JSON the parser at `realizar.rs:144-149` expects.

The 3-knob toolkit (#1842 sampling + #1844 rep-penalty + #1846 HTTP wire-up) tunes probability distributions but can't change format adherence. This PR addresses format adherence directly.

Changes

`CODE_SYSTEM_PROMPT` gains:

3 concrete examples showing the exact `<tool_call>` JSON format (file_read, file_edit, shell)
Rule: "The user message ALWAYS gets a tool-call response. NEVER reply with explanations only."
Guideline: "Be concise — DO NOT narrate what you're about to do; just emit the `<tool_call>`"
Anti-rule: "DO NOT use Markdown ```rust``` code blocks for file edits; ALWAYS use file_edit or file_write tool_calls"

Why few-shot examples work

Large language models pattern-match. Showing them the exact format (rather than just describing it) drastically improves format adherence on coder-finetuned models. Counter-examples + negative rules pull the model toward the desired format.

Test plan

`cargo check -p aprender-orchestrate` — clean
CI
Combined sub-bench E (operator-coordinated): post-merge dispatch with 3-knob sampling + THIS PR's prompt to test for V1_004 discharge

If Sub-bench E shows ANY fixture pass, V1_004 discharges + M280 suspension lifts.

🤖 Generated with Claude Code

…4 follow-up) Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT (the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context: paiml/claude-code-parity-apr M287 evidence showed the 30B model emits Markdown ```rust``` code blocks (in turn-1 text) instead of <tool_call> JSON. The parser at realizar.rs:144-149 accepts <tool_call> + ```json``` but NOT ```rust``` — so the model's turns are silently text-only, bench hits per-turn timeout after 4 turns of rambling. The 3-knob toolkit (sampling/penalty/streaming) tunes probability distributions but can't change format adherence. THIS PR addresses the format adherence directly by: 1. Showing the model 3 concrete <tool_call> examples in-context (file_read, file_edit, shell) 2. Adding an explicit "ALWAYS gets a tool-call response" rule 3. Adding "Be concise — DO NOT narrate" guideline 4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule ## Why few-shot examples work Large language models are pattern-matchers. Showing them the exact format they should emit (rather than just describing it) drastically improves format adherence on coder-finetuned models. The 30B-Coder has strong "Markdown code block" priors from training; explicit counter-examples + the negative rule pull it toward the <tool_call> format. ## Empirical context M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform driver_error / turns_before_error=4 pattern. Every turn was text with Rust code in Markdown, no tool calls extracted. Operator playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846 shipped). This PR is COMPLEMENTARY: prompt fix + sampling together have the best chance of breaking the rambling pattern. ## Companion-side dispatch (post-merge) After this PR + rebuild, operator can run a NEW sub-bench (call it Sub-bench E in M288 nomenclature) that combines: - 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95) - Repetition penalty (repeat_penalty=1.2, repeat_last_n=64) - THIS PR's few-shot prompt (active by default; no env var needed) If Sub-bench E shows ANY fixture pass, V1_004 discharges. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

noahgift enabled auto-merge (squash) May 20, 2026 14:28

noahgift mentioned this pull request May 20, 2026

fix(try_qwen3_moe_backend): populate stop_tokens with EOS — fixes M287 runaway 'Human:' generation #1852

Merged

3 tasks

Merge branch 'main' into feat/code-prompt-few-shot-tool-calls

170c8b3

noahgift merged commit 24f0de5 into main May 20, 2026
10 checks passed

noahgift deleted the feat/code-prompt-few-shot-tool-calls branch May 20, 2026 15:42

noahgift mentioned this pull request May 21, 2026

docs(M291): V1_004 sub-bench B empirical pattern shift + aprender#1853 fix paiml/claude-code-parity-apr#259

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up)#1849

feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up)#1849
noahgift merged 2 commits into
mainfrom
feat/code-prompt-few-shot-tool-calls

noahgift commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 20, 2026

Summary

Empirical context

Changes

Why few-shot examples work

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant