fix(evidence): stop rejecting real complete_step verifications over command-string drift#3982
Merged
Merged
Conversation
added 2 commits
June 11, 2026 00:22
…lete_step self-correction complete_step rejected real verifications whenever the cited command string was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a flag or quote-style drift, or a piped tail all failed both the ledger match and the #3587 session fallback's prefix matching. Local session forensics show 5 of 18 real complete_step calls rejected this way, each cascading into todo_write failures and final answers that overclaim. Match commands by shell segment instead: split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a cited segment when a ran segment equals it or supersets its tokens under the same head token. One-token citations still require exact equality, and an aggregated citation that no single command covers is still rejected. The session fallback now uses the same matcher and skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof. Rejections now carry recovery context: ran-but-nonzero commands are distinguished from never-ran (with a '|| true' hint for negative verification, e.g. proving a file is gone), never-ran rejections list the turn's actual receipts, and the schema marks command/paths as required for their kinds instead of advertising them optional.
Files created or edited through shell redirection (seq … > file, sed -i) leave no reader/writer receipt, so files evidence for them was always rejected and the model had to re-write the file with write_file just to mint a receipt. A successful bash command whose text names the path now counts as having touched it.
esengine
added a commit
that referenced
this pull request
Jun 11, 2026
… todos (#4006) * fix(evidence): tolerate citation drift when matching complete_step to todos The todo-step matcher demanded byte-exact (case-folded) equality between complete_step.step and a todo's text, so a fullwidth/halfwidth colon or whitespace drift ("Phase 5:…" cited as "Phase 5: …") could never match and the model looped on "no matching todo_write item" retries, burning tokens (discussion #3970). Same disease #3982 cured for command citations, different limb. Normalize both sides (fullwidth ASCII → halfwidth, whitespace dropped, case-folded) before comparing, fall back to unique substring containment (≥6 runes; ambiguous citations stay unmatched), and list this turn's todos in the rejection so the model can self-correct by verbatim content or index instead of guessing. * style: gofmt evidence_test (CJK-width map alignment) --------- Co-authored-by: reasonix <reasonix@deepseek.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Users keep reporting that
complete_step"fails all the time". Session forensics (local logs + #2917's attached transcript) show 5 of 18 realcomplete_stepcalls rejected, every one a false negative: the step was genuinely done and verified, but the citation didn't byte-match the receipt. Each rejection then cascades into atodo_writefailure (its guard wants a priorcomplete_step), and in the worst case the model gives up and overclaims in the final answer (#2917).The observed failure shapes, all from real transcripts:
cdprefix — rancd /repo && git merge upstream/main-v2 --ff-only, citedgit merge upstream/main-v2 --ff-only([Bug]: todo_write / complete_step 失败后,Agent 仍允许最终回答“全部完成” #2917). Not a prefix relation, so the fix(agent): reduce evidence verification false signals #3587 session fallback missed it too.rm -v x && ls -la x 2>&1 || true, citedrm -v x && ls x 2>&1; ranecho "deleted", citedecho 'deleted'.seq …; cat x; wc -l x, cited justwc -l x.ls deleted-fileto prove a deletion; it exits 2, so no successful receipt can ever exist. The old error was indistinguishable from "never ran".command/pathswere optional while the host requires them — models follow the schema and get rejected.seq … > fileleaves no reader/writer receipt, sofilesevidence for the file always failed; models recovered by re-writing the whole file withwrite_filejust to mint a receipt.Fix
evidence.CommandMatches): split cited and ran commands on&&/||/;/|/newlines, quote-strip and whitespace-normalize tokens; a cited segment is covered by a ran segment that equals it or token-supersets it under the same head token. One-token citations still require exact equality; an aggregated citation that no single command covers is still rejected (anti-fabrication holds — a condensed made-up command in the e2e run was still refused).|| truehint for negative verification); never-ran rejections list the turn's actual receipts;todo_write's cascade error explains thecomplete_step-first order. In e2e, every remaining legitimate rejection self-corrected in 1–2 rounds instead of looping blind.commandis marked REQUIRED forverification,pathsfordiff/files.filesreceipt.Verification
completestep_test.go,commandmatch_test.go).deepseek-v4-flash, the same scenario the June 2 sessions failed on — create 20-line file → append → delete lines → delete file → negative-verify):complete_stepcalls rejected, 21 rounds.complete_stepfirst-try, 0todo_writefailures, 16 rounds, all sign-offshost-verified 1, manual/unverified 0.go test ./internal/evidence/... ./internal/tool/builtin/... ./internal/agent/... ./internal/cli/... ./internal/control/...green;go vetclean.Related: #2917, #3469, #3911, #3587.