feat(code-prompt): few-shot <tool_call> examples + anti-rambling guideline (V1_004 follow-up)#1849
Merged
Merged
Conversation
…4 follow-up) Adds 3 concrete few-shot <tool_call> examples to CODE_SYSTEM_PROMPT (the 7B+ branch used for Qwen3-Coder-30B-A3B). Empirical context: paiml/claude-code-parity-apr M287 evidence showed the 30B model emits Markdown ```rust``` code blocks (in turn-1 text) instead of <tool_call> JSON. The parser at realizar.rs:144-149 accepts <tool_call> + ```json``` but NOT ```rust``` — so the model's turns are silently text-only, bench hits per-turn timeout after 4 turns of rambling. The 3-knob toolkit (sampling/penalty/streaming) tunes probability distributions but can't change format adherence. THIS PR addresses the format adherence directly by: 1. Showing the model 3 concrete <tool_call> examples in-context (file_read, file_edit, shell) 2. Adding an explicit "ALWAYS gets a tool-call response" rule 3. Adding "Be concise — DO NOT narrate" guideline 4. Adding "DO NOT use Markdown ```rust``` code blocks" anti-rule ## Why few-shot examples work Large language models are pattern-matchers. Showing them the exact format they should emit (rather than just describing it) drastically improves format adherence on coder-finetuned models. The 30B-Coder has strong "Markdown code block" priors from training; explicit counter-examples + the negative rule pull it toward the <tool_call> format. ## Empirical context M287 (Phase 6 bench, fixtures 1-10 + greedy decoding): uniform driver_error / turns_before_error=4 pattern. Every turn was text with Rust code in Markdown, no tool calls extracted. Operator playbook calls for sampling/penalty sub-bench (#1842 + #1844 + #1846 shipped). This PR is COMPLEMENTARY: prompt fix + sampling together have the best chance of breaking the rambling pattern. ## Companion-side dispatch (post-merge) After this PR + rebuild, operator can run a NEW sub-bench (call it Sub-bench E in M288 nomenclature) that combines: - 3-knob sampling (temperature=0.3, top_k=50, top_p=0.95) - Repetition penalty (repeat_penalty=1.2, repeat_last_n=64) - THIS PR's few-shot prompt (active by default; no env var needed) If Sub-bench E shows ANY fixture pass, V1_004 discharges. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds 3 few-shot examples + 2 anti-rambling rules to `CODE_SYSTEM_PROMPT` (the 7B+ branch used by Qwen3-Coder-30B-A3B). Targets the M287 verbosity pattern where the model emits Markdown ```rust``` blocks instead of `<tool_call>` JSON.
Empirical context
paiml/claude-code-parity-apr M287 evidence: 10/20 Phase 6 fixtures uniformly `driver_error, turns_before_error=4`. Every turn was text-only — the 30B-Coder model has strong "Markdown code block" priors from training and was emitting Rust code in markdown rather than the `<tool_call>` JSON the parser at `realizar.rs:144-149` expects.
The 3-knob toolkit (#1842 sampling + #1844 rep-penalty + #1846 HTTP wire-up) tunes probability distributions but can't change format adherence. This PR addresses format adherence directly.
Changes
`CODE_SYSTEM_PROMPT` gains:
Why few-shot examples work
Large language models pattern-match. Showing them the exact format (rather than just describing it) drastically improves format adherence on coder-finetuned models. Counter-examples + negative rules pull the model toward the desired format.
Test plan
If Sub-bench E shows ANY fixture pass, V1_004 discharges + M280 suspension lifts.
🤖 Generated with Claude Code