fix(agent): reduce evidence verification false signals#3587
Merged
Conversation
…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator Three changes that all reduce false signals in the evidence/complete_step/todo verification system: 1. Remove duplicate complete_step ordering check from finalReadinessCheck. The todo_write tool's own guard (verifyTodoCompletionTransitions) already rejects marking items completed without a matching complete_step receipt, and is order-independent. The finalReadinessCheck duplicate was stricter (required complete_step after the latest writer), causing false positives when the model correctly completes steps in a different order. 2. Add session-fallback command verification in complete_step. The per-turn Ledger is reset each turn, so cross-turn references, non-bash tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated command strings all fail the strict receipt match. verifyCommandFromSession scans the full conversation transcript with normalized matching (strip …, collapse whitespace, prefix match with 8-char minimum) and tool-name fallback, eliminating three common false negatives. 3. Remove stale "Progress may be stale" indicator from desktop TodoPanel. The heuristic (≥2 tools after last todo_write, final assistant message, readiness notices) triggered false alerts in normal completion flows. The todo count and transcript itself provide better progress visibility. Fixes esengine#3469
esengine
approved these changes
Jun 9, 2026
esengine
left a comment
Owner
There was a problem hiding this comment.
Solid — reducing the false readiness/evidence signals with the added flow + readiness tests is a good tightening. Thanks!
dorokuma
pushed a commit
to dorokuma/DeepSeek-Reasonix
that referenced
this pull request
Jun 10, 2026
…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator (esengine#3587) Three changes that all reduce false signals in the evidence/complete_step/todo verification system: 1. Remove duplicate complete_step ordering check from finalReadinessCheck. The todo_write tool's own guard (verifyTodoCompletionTransitions) already rejects marking items completed without a matching complete_step receipt, and is order-independent. The finalReadinessCheck duplicate was stricter (required complete_step after the latest writer), causing false positives when the model correctly completes steps in a different order. 2. Add session-fallback command verification in complete_step. The per-turn Ledger is reset each turn, so cross-turn references, non-bash tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated command strings all fail the strict receipt match. verifyCommandFromSession scans the full conversation transcript with normalized matching (strip …, collapse whitespace, prefix match with 8-char minimum) and tool-name fallback, eliminating three common false negatives. 3. Remove stale "Progress may be stale" indicator from desktop TodoPanel. The heuristic (≥2 tools after last todo_write, final assistant message, readiness notices) triggered false alerts in normal completion flows. The todo count and transcript itself provide better progress visibility. Fixes esengine#3469 Co-authored-by: wufengfan <wufengfan@wufengfandeMacBook-Air.local>
esengine
added a commit
that referenced
this pull request
Jun 11, 2026
…ommand-string drift (#3982) * fix(evidence): match paraphrased verification commands and guide complete_step self-correction complete_step rejected real verifications whenever the cited command string was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a flag or quote-style drift, or a piped tail all failed both the ledger match and the #3587 session fallback's prefix matching. Local session forensics show 5 of 18 real complete_step calls rejected this way, each cascading into todo_write failures and final answers that overclaim. Match commands by shell segment instead: split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a cited segment when a ran segment equals it or supersets its tokens under the same head token. One-token citations still require exact equality, and an aggregated citation that no single command covers is still rejected. The session fallback now uses the same matcher and skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof. Rejections now carry recovery context: ran-but-nonzero commands are distinguished from never-ran (with a '|| true' hint for negative verification, e.g. proving a file is gone), never-ran rejections list the turn's actual receipts, and the schema marks command/paths as required for their kinds instead of advertising them optional. * fix(evidence): count bash commands naming a path as files receipts Files created or edited through shell redirection (seq … > file, sed -i) leave no reader/writer receipt, so files evidence for them was always rejected and the model had to re-write the file with write_file just to mint a receipt. A successful bash command whose text names the path now counts as having touched it. --------- Co-authored-by: reasonix <reasonix@deepseek.com>
esengine
added a commit
that referenced
this pull request
Jun 11, 2026
Per-turn evidence ledger reset made complete_step reject cross-turn citations and let the final gate miss an unfinished plan. diff/files evidence now falls back to the full session (like commands, #3587); the host keeps a canonical todo list (survives turns + compaction) the gate consults; a successful complete_step advances that list so the model no longer batches todo_write (#3909). Real-API A/B confirmed base rejects/blocks where the PR accepts/advances. Closes #2917
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary / 摘要
EN: Remove three false signal sources in the evidence/complete_step/todo verification system that either blocked the model's final answer incorrectly, rejected valid
complete_stepevidence, or showed a misleading UI indicator.CN: 消除 evidence / complete_step / todo 验证系统中的三个虚假信号源,它们分别导致了最终答案被错误拦截、合法的
complete_step证据被拒绝、以及 UI 显示误导性提示。Changes / 改动内容
1. Remove duplicate complete_step ordering check from finalReadinessCheck / 移除 finalReadinessCheck 中重复的 complete_step 排序检查
EN:
finalReadinessCheck()usedHasSuccessfulCompleteStepAfter(writer)to requirecomplete_stepat a receipt index greater than the last writer tool. Buttodo_write's ownverifyTodoCompletionTransitionsalready enforces this constraint and is order-independent (it searches all receipts). The duplicate was stricter, causing false positives when the model calledcomplete_step → write_file → todo_write(completed). Removed entirely.CN:
finalReadinessCheck()用HasSuccessfulCompleteStepAfter(writer)要求complete_step必须在最后一个 writer 工具之后。但todo_write自身的verifyTodoCompletionTransitions已经是顺序无关的(搜索所有 receipts)。这个重复检查更严格,导致complete_step → write_file → todo_write(completed)这类正常流程被误拦截。已彻底删除。2. Add session-fallback command verification in complete_step / complete_step 增加会话回退命令匹配
EN: The per-turn
Ledgeris reset each turn, socomplete_stepverification has three blind spots:bash("grep ...")in turn 1, cited in turn 2lstool (not bash), cited as command="ls ."find . -type f ...(real),find . -type f ...(cited)Added
verifyCommandFromSession(): normalizes both sides (strip …, collapse whitespace), does exact match, prefix match (min 8 chars), and tool-name match. Receipt match is still the fast path — session fallback only runs when the receipt misses.CN: Per-turn 的
Ledger每轮会 reset,导致三个盲区:ls工具(非 bash),引用为 command="ls ."find . -type f…,引用find . -type f…新增
verifyCommandFromSession():归一化(去掉 …、折叠空白)、精确匹配、前缀匹配(最短 8 字符)、工具名匹配。receipt 匹配仍是快路径,session 回退只在前者未命中时运行。3. Remove stale "Progress may be stale" indicator / 移除"进度可能未同步"提示
EN: The
todoStaleheuristic in the desktopTodoPanelchecked 4 conditions (≥2 tools after lasttodo_write, final assistant message, readiness notices, 90s timeout). All except the timeout triggered false positives in normal completion flows — the model finishing its work and producing a final answer always lit the "stale" flag. Removed entirely. The todo count ({done}/{total}) and the transcript itself provide better progress visibility.CN: Desktop
TodoPanel的todoStale启发式判断检查了 4 个条件(todo_write 后 ≥2 个工具、有 assistant 回复、readiness notice、90 秒超时)。除了超时,其他三个在正常完成场景下都会误报——模型做完工作给出最终答案时必然触发。已彻底删除。todo 的数字({done}/{total})和 transcript 本身已经提供了足够的进度可见性。Files / 涉及文件(13 files, +123 / -82)
internal/agent/agent.goHasSuccessfulCompleteStepAfter(writer)check; addWithSessionMessages()injectioninternal/evidence/evidence.goWithSessionMessages()/SessionMessagesFromContext()internal/evidence/readiness_audit.goMissingCompleteStepfrom structinternal/tool/builtin/completestep.goverifyCommandFromSession()as fallbackinternal/cli/run_metrics.goReadinessMissingCompleteStepsinternal/cli/run_metrics_test.gointernal/agent/evidence_flow_test.gointernal/agent/final_readiness_test.godesktop/frontend/src/App.tsxtodoStaleuseMemodesktop/frontend/src/components/TodoPanel.tsxstalepropdesktop/frontend/src/locales/en.ts"todo.stale"desktop/frontend/src/locales/zh.ts"todo.stale"desktop/frontend/src/styles.css.todobar__staleTesting / 测试
Related / 关联