Skip to content

fix(evidence): stop rejecting real complete_step verifications over command-string drift#3982

Merged
esengine merged 2 commits into
main-v2from
fix/complete-step-evidence-matching
Jun 11, 2026
Merged

fix(evidence): stop rejecting real complete_step verifications over command-string drift#3982
esengine merged 2 commits into
main-v2from
fix/complete-step-evidence-matching

Conversation

@esengine

Copy link
Copy Markdown
Owner

Problem

Users keep reporting that complete_step "fails all the time". Session forensics (local logs + #2917's attached transcript) show 5 of 18 real complete_step calls rejected, every one a false negative: the step was genuinely done and verified, but the citation didn't byte-match the receipt. Each rejection then cascades into a todo_write failure (its guard wants a prior complete_step), and in the worst case the model gives up and overclaims in the final answer (#2917).

The observed failure shapes, all from real transcripts:

  1. Dropped cd prefix — ran cd /repo && git merge upstream/main-v2 --ff-only, cited git merge upstream/main-v2 --ff-only ([Bug]: todo_write / complete_step 失败后,Agent 仍允许最终回答“全部完成” #2917). Not a prefix relation, so the fix(agent): reduce evidence verification false signals #3587 session fallback missed it too.
  2. Flag/quote drift — ran rm -v x && ls -la x 2>&1 || true, cited rm -v x && ls x 2>&1; ran echo "deleted", cited echo 'deleted'.
  3. Compound-segment citation — ran seq …; cat x; wc -l x, cited just wc -l x.
  4. Negative verification dead end — ran ls deleted-file to prove a deletion; it exits 2, so no successful receipt can ever exist. The old error was indistinguishable from "never ran".
  5. Schema said command/paths were optional while the host requires them — models follow the schema and get rejected.
  6. Shell-redirect filesseq … > file leaves no reader/writer receipt, so files evidence for the file always failed; models recovered by re-writing the whole file with write_file just to mint a receipt.

Fix

  • Segment matching (evidence.CommandMatches): split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens; a cited segment is covered by a ran segment that equals it or token-supersets it under the same head token. One-token citations still require exact equality; an aggregated citation that no single command covers is still rejected (anti-fabrication holds — a condensed made-up command in the e2e run was still refused).
  • Session fallback hardened: uses the same matcher and now skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof.
  • Actionable rejections: ran-but-nonzero is now distinguished from never-ran (with a || true hint for negative verification); never-ran rejections list the turn's actual receipts; todo_write's cascade error explains the complete_step-first order. In e2e, every remaining legitimate rejection self-corrected in 1–2 rounds instead of looping blind.
  • Schema truthful: command is marked REQUIRED for verification, paths for diff/files.
  • Redirect-created files: a successful bash command whose text names the path now counts as a files receipt.

Verification

  • Unit replays of all observed failures, verbatim from the transcripts (completestep_test.go, commandmatch_test.go).
  • End-to-end A/B against the real DeepSeek API (deepseek-v4-flash, the same scenario the June 2 sessions failed on — create 20-line file → append → delete lines → delete file → negative-verify):
    • base (origin/main-v2): 4/8 complete_step calls rejected, 21 rounds.
    • fixed: 4/4 complete_step first-try, 0 todo_write failures, 16 rounds, all sign-offs host-verified 1, manual/unverified 0.
  • go test ./internal/evidence/... ./internal/tool/builtin/... ./internal/agent/... ./internal/cli/... ./internal/control/... green; go vet clean.

Related: #2917, #3469, #3911, #3587.

reasonix added 2 commits June 11, 2026 00:22
…lete_step self-correction

complete_step rejected real verifications whenever the cited command string
was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a
flag or quote-style drift, or a piped tail all failed both the ledger match
and the #3587 session fallback's prefix matching. Local session forensics
show 5 of 18 real complete_step calls rejected this way, each cascading into
todo_write failures and final answers that overclaim.

Match commands by shell segment instead: split cited and ran commands on
&&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a
cited segment when a ran segment equals it or supersets its tokens under the
same head token. One-token citations still require exact equality, and an
aggregated citation that no single command covers is still rejected. The
session fallback now uses the same matcher and skips calls whose recorded
result is an error or block, closing the false positive where any attempted
command counted as proof.

Rejections now carry recovery context: ran-but-nonzero commands are
distinguished from never-ran (with a '|| true' hint for negative
verification, e.g. proving a file is gone), never-ran rejections list the
turn's actual receipts, and the schema marks command/paths as required for
their kinds instead of advertising them optional.
Files created or edited through shell redirection (seq … > file, sed -i)
leave no reader/writer receipt, so files evidence for them was always
rejected and the model had to re-write the file with write_file just to
mint a receipt. A successful bash command whose text names the path now
counts as having touched it.
@esengine esengine requested a review from SivanCola as a code owner June 11, 2026 07:32
@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development skills Skill system (internal/skill, internal/tool) labels Jun 11, 2026
@esengine esengine merged commit d3fbcb6 into main-v2 Jun 11, 2026
14 checks passed
@esengine esengine deleted the fix/complete-step-evidence-matching branch June 11, 2026 07:37
esengine added a commit that referenced this pull request Jun 11, 2026
… todos (#4006)

* fix(evidence): tolerate citation drift when matching complete_step to todos

The todo-step matcher demanded byte-exact (case-folded) equality between
complete_step.step and a todo's text, so a fullwidth/halfwidth colon or
whitespace drift ("Phase 5:…" cited as "Phase 5: …") could never match
and the model looped on "no matching todo_write item" retries, burning
tokens (discussion #3970). Same disease #3982 cured for command
citations, different limb.

Normalize both sides (fullwidth ASCII → halfwidth, whitespace dropped,
case-folded) before comparing, fall back to unique substring containment
(≥6 runes; ambiguous citations stay unmatched), and list this turn's
todos in the rejection so the model can self-correct by verbatim content
or index instead of guessing.

* style: gofmt evidence_test (CJK-width map alignment)

---------

Co-authored-by: reasonix <reasonix@deepseek.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skills Skill system (internal/skill, internal/tool) v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant