Skip to content

feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up)#1849

Merged
noahgift merged 2 commits into
mainfrom
feat/code-prompt-few-shot-tool-calls
May 20, 2026
Merged

feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up)#1849
noahgift merged 2 commits into
mainfrom
feat/code-prompt-few-shot-tool-calls

Conversation

@noahgift

Copy link
Copy Markdown
Contributor

Summary

Adds 3 few-shot examples + 2 anti-rambling rules to `CODE_SYSTEM_PROMPT` (the 7B+ branch used by Qwen3-Coder-30B-A3B). Targets the M287 verbosity pattern where the model emits Markdown ```rust``` blocks instead of `<tool_call>` JSON.

Empirical context

paiml/claude-code-parity-apr M287 evidence: 10/20 Phase 6 fixtures uniformly `driver_error, turns_before_error=4`. Every turn was text-only — the 30B-Coder model has strong "Markdown code block" priors from training and was emitting Rust code in markdown rather than the `<tool_call>` JSON the parser at `realizar.rs:144-149` expects.

The 3-knob toolkit (#1842 sampling + #1844 rep-penalty + #1846 HTTP wire-up) tunes probability distributions but can't change format adherence. This PR addresses format adherence directly.

Changes

`CODE_SYSTEM_PROMPT` gains:

  1. 3 concrete examples showing the exact `<tool_call>` JSON format (file_read, file_edit, shell)
  2. Rule: "The user message ALWAYS gets a tool-call response. NEVER reply with explanations only."
  3. Guideline: "Be concise — DO NOT narrate what you're about to do; just emit the `<tool_call>`"
  4. Anti-rule: "DO NOT use Markdown ```rust``` code blocks for file edits; ALWAYS use file_edit or file_write tool_calls"

Why few-shot examples work

Large language models pattern-match. Showing them the exact format (rather than just describing it) drastically improves format adherence on coder-finetuned models. Counter-examples + negative rules pull the model toward the desired format.

Test plan

  • `cargo check -p aprender-orchestrate` — clean
  • CI
  • Combined sub-bench E (operator-coordinated): post-merge dispatch with 3-knob sampling + THIS PR's prompt to test for V1_004 discharge

If Sub-bench E shows ANY fixture pass, V1_004 discharges + M280 suspension lifts.

🤖 Generated with Claude Code

…4 follow-up)

Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT
(the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context:
paiml/claude-code-parity-apr M287 evidence showed the 30B model
emits Markdown ```rust``` code blocks (in turn-1 text) instead of
<tool_call> JSON. The parser at realizar.rs:144-149 accepts
<tool_call> + ```json``` but NOT ```rust``` — so the model's
turns are silently text-only, bench hits per-turn timeout after 4
turns of rambling.

The 3-knob toolkit (sampling/penalty/streaming) tunes probability
distributions but can't change format adherence. THIS PR addresses
the format adherence directly by:

1. Showing the model 3 concrete <tool_call> examples in-context
   (file_read, file_edit, shell)
2. Adding an explicit "ALWAYS gets a tool-call response" rule
3. Adding "Be concise — DO NOT narrate" guideline
4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule

## Why few-shot examples work

Large language models are pattern-matchers. Showing them the exact
format they should emit (rather than just describing it) drastically
improves format adherence on coder-finetuned models. The 30B-Coder
has strong "Markdown code block" priors from training; explicit
counter-examples + the negative rule pull it toward the <tool_call>
format.

## Empirical context

M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform
driver_error / turns_before_error=4 pattern. Every turn was text
with Rust code in Markdown, no tool calls extracted. Operator
playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846
shipped). This PR is COMPLEMENTARY: prompt fix + sampling together
have the best chance of breaking the rambling pattern.

## Companion-side dispatch (post-merge)

After this PR + rebuild, operator can run a NEW sub-bench (call it
Sub-bench E in M288 nomenclature) that combines:
- 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95)
- Repetition penalty (repeat_penalty=1.2, repeat_last_n=64)
- THIS PR's few-shot prompt (active by default; no env var needed)

If Sub-bench E shows ANY fixture pass, V1_004 discharges.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@noahgift noahgift merged commit 24f0de5 into main May 20, 2026
10 checks passed
@noahgift noahgift deleted the feat/code-prompt-few-shot-tool-calls branch May 20, 2026 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant