fix(evidence): stop rejecting real complete_step verifications over command-string drift by esengine · Pull Request #3982 · esengine/DeepSeek-Reasonix

esengine · 2026-06-11T07:32:15Z

Problem

Users keep reporting that complete_step "fails all the time". Session forensics (local logs + #2917's attached transcript) show 5 of 18 real complete_step calls rejected, every one a false negative: the step was genuinely done and verified, but the citation didn't byte-match the receipt. Each rejection then cascades into a todo_write failure (its guard wants a prior complete_step), and in the worst case the model gives up and overclaims in the final answer (#2917).

The observed failure shapes, all from real transcripts:

Dropped cd prefix — ran cd /repo && git merge upstream/main-v2 --ff-only, cited git merge upstream/main-v2 --ff-only ([Bug]: todo_write / complete_step 失败后，Agent 仍允许最终回答“全部完成” #2917). Not a prefix relation, so the fix(agent): reduce evidence verification false signals #3587 session fallback missed it too.
Flag/quote drift — ran rm -v x && ls -la x 2>&1 || true, cited rm -v x && ls x 2>&1; ran echo "deleted", cited echo 'deleted'.
Compound-segment citation — ran seq …; cat x; wc -l x, cited just wc -l x.
Negative verification dead end — ran ls deleted-file to prove a deletion; it exits 2, so no successful receipt can ever exist. The old error was indistinguishable from "never ran".
Schema said command/paths were optional while the host requires them — models follow the schema and get rejected.
Shell-redirect files — seq … > file leaves no reader/writer receipt, so files evidence for the file always failed; models recovered by re-writing the whole file with write_file just to mint a receipt.

Fix

Segment matching (evidence.CommandMatches): split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens; a cited segment is covered by a ran segment that equals it or token-supersets it under the same head token. One-token citations still require exact equality; an aggregated citation that no single command covers is still rejected (anti-fabrication holds — a condensed made-up command in the e2e run was still refused).
Session fallback hardened: uses the same matcher and now skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof.
Actionable rejections: ran-but-nonzero is now distinguished from never-ran (with a || true hint for negative verification); never-ran rejections list the turn's actual receipts; todo_write's cascade error explains the complete_step-first order. In e2e, every remaining legitimate rejection self-corrected in 1–2 rounds instead of looping blind.
Schema truthful: command is marked REQUIRED for verification, paths for diff/files.
Redirect-created files: a successful bash command whose text names the path now counts as a files receipt.

Verification

Unit replays of all observed failures, verbatim from the transcripts (completestep_test.go, commandmatch_test.go).
End-to-end A/B against the real DeepSeek API (deepseek-v4-flash, the same scenario the June 2 sessions failed on — create 20-line file → append → delete lines → delete file → negative-verify):
- base (origin/main-v2): 4/8 complete_step calls rejected, 21 rounds.
- fixed: 4/4 complete_step first-try, 0 todo_write failures, 16 rounds, all sign-offs host-verified 1, manual/unverified 0.
go test ./internal/evidence/... ./internal/tool/builtin/... ./internal/agent/... ./internal/cli/... ./internal/control/... green; go vet clean.

Related: #2917, #3469, #3911, #3587.

…lete_step self-correction complete_step rejected real verifications whenever the cited command string was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a flag or quote-style drift, or a piped tail all failed both the ledger match and the #3587 session fallback's prefix matching. Local session forensics show 5 of 18 real complete_step calls rejected this way, each cascading into todo_write failures and final answers that overclaim. Match commands by shell segment instead: split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a cited segment when a ran segment equals it or supersets its tokens under the same head token. One-token citations still require exact equality, and an aggregated citation that no single command covers is still rejected. The session fallback now uses the same matcher and skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof. Rejections now carry recovery context: ran-but-nonzero commands are distinguished from never-ran (with a '|| true' hint for negative verification, e.g. proving a file is gone), never-ran rejections list the turn's actual receipts, and the schema marks command/paths as required for their kinds instead of advertising them optional.

Files created or edited through shell redirection (seq … > file, sed -i) leave no reader/writer receipt, so files evidence for them was always rejected and the model had to re-write the file with write_file just to mint a receipt. A successful bash command whose text names the path now counts as having touched it.

… todos (#4006) * fix(evidence): tolerate citation drift when matching complete_step to todos The todo-step matcher demanded byte-exact (case-folded) equality between complete_step.step and a todo's text, so a fullwidth/halfwidth colon or whitespace drift ("Phase 5：…" cited as "Phase 5: …") could never match and the model looped on "no matching todo_write item" retries, burning tokens (discussion #3970). Same disease #3982 cured for command citations, different limb. Normalize both sides (fullwidth ASCII → halfwidth, whitespace dropped, case-folded) before comparing, fall back to unique substring containment (≥6 runes; ambiguous citations stay unmatched), and list this turn's todos in the rejection so the model can self-correct by verbatim content or index instead of guessing. * style: gofmt evidence_test (CJK-width map alignment) --------- Co-authored-by: reasonix <reasonix@deepseek.com>

reasonix added 2 commits June 11, 2026 00:22

esengine requested a review from SivanCola as a code owner June 11, 2026 07:32

github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development skills Skill system (internal/skill, internal/tool) labels Jun 11, 2026

esengine merged commit d3fbcb6 into main-v2 Jun 11, 2026
14 checks passed

esengine deleted the fix/complete-step-evidence-matching branch June 11, 2026 07:37

esengine mentioned this pull request Jun 11, 2026

fix(evidence): tolerate citation drift when matching complete_step to todos #4006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(evidence): stop rejecting real complete_step verifications over command-string drift#3982

fix(evidence): stop rejecting real complete_step verifications over command-string drift#3982
esengine merged 2 commits into
main-v2from
fix/complete-step-evidence-matching

esengine commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

esengine commented Jun 11, 2026

Problem

Fix

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant