fix(agent): reduce evidence verification false signals by CVEngineer66 · Pull Request #3587 · esengine/DeepSeek-Reasonix

CVEngineer66 · 2026-06-08T12:45:55Z

Summary / 摘要

EN: Remove three false signal sources in the evidence/complete_step/todo verification system that either blocked the model's final answer incorrectly, rejected valid complete_step evidence, or showed a misleading UI indicator.

CN: 消除 evidence / complete_step / todo 验证系统中的三个虚假信号源，它们分别导致了最终答案被错误拦截、合法的 complete_step 证据被拒绝、以及 UI 显示误导性提示。

Changes / 改动内容

1. Remove duplicate complete_step ordering check from finalReadinessCheck / 移除 finalReadinessCheck 中重复的 complete_step 排序检查

EN: finalReadinessCheck() used HasSuccessfulCompleteStepAfter(writer) to require complete_step at a receipt index greater than the last writer tool. But todo_write's own verifyTodoCompletionTransitions already enforces this constraint and is order-independent (it searches all receipts). The duplicate was stricter, causing false positives when the model called complete_step → write_file → todo_write(completed). Removed entirely.

CN: finalReadinessCheck() 用 HasSuccessfulCompleteStepAfter(writer) 要求 complete_step 必须在最后一个 writer 工具之后。但 todo_write 自身的 verifyTodoCompletionTransitions 已经是顺序无关的（搜索所有 receipts）。这个重复检查更严格，导致 complete_step → write_file → todo_write(completed) 这类正常流程被误拦截。已彻底删除。

2. Add session-fallback command verification in complete_step / complete_step 增加会话回退命令匹配

EN: The per-turn Ledger is reset each turn, so complete_step verification has three blind spots:

Scenario	Example	Before	After
Cross-turn command	`bash("grep ...")` in turn 1, cited in turn 2	❌ no matching receipt	✅ scans session transcript
Non-bash tool name	`ls` tool (not bash), cited as command="ls ."	❌ no bash receipt	✅ matches by tool name
Truncated command	`find . -type f ...` (real), `find . -type f ...` (cited)	❌ string mismatch	✅ normalized + prefix match

Added verifyCommandFromSession(): normalizes both sides (strip …, collapse whitespace), does exact match, prefix match (min 8 chars), and tool-name match. Receipt match is still the fast path — session fallback only runs when the receipt misses.

CN: Per-turn 的 Ledger 每轮会 reset，导致三个盲区：

场景	示例	修改前	修改后
跨轮命令	第1轮 bash，第2轮引用	❌ 找不到 receipt	✅ 扫描全对话
非 bash 工具名	`ls` 工具（非 bash），引用为 command="ls ."	❌ 没有 bash receipt	✅ 按工具名匹配
命令截断	真实 `find . -type f` …，引用 `find . -type f` …	❌ 字符不匹配	✅ 归一化 + 前缀匹配

新增 verifyCommandFromSession()：归一化（去掉 …、折叠空白）、精确匹配、前缀匹配（最短 8 字符）、工具名匹配。receipt 匹配仍是快路径，session 回退只在前者未命中时运行。

3. Remove stale "Progress may be stale" indicator / 移除"进度可能未同步"提示

EN: The todoStale heuristic in the desktop TodoPanel checked 4 conditions (≥2 tools after last todo_write, final assistant message, readiness notices, 90s timeout). All except the timeout triggered false positives in normal completion flows — the model finishing its work and producing a final answer always lit the "stale" flag. Removed entirely. The todo count ({done}/{total}) and the transcript itself provide better progress visibility.

CN: Desktop TodoPanel 的 todoStale 启发式判断检查了 4 个条件（todo_write 后 ≥2 个工具、有 assistant 回复、readiness notice、90 秒超时）。除了超时，其他三个在正常完成场景下都会误报——模型做完工作给出最终答案时必然触发。已彻底删除。todo 的数字（{done}/{total}）和 transcript 本身已经提供了足够的进度可见性。

Files / 涉及文件（13 files, +123 / -82）

File	Δ	Change
`internal/agent/agent.go`	+8 -1	Remove `HasSuccessfulCompleteStepAfter(writer)` check; add `WithSessionMessages()` injection
`internal/evidence/evidence.go`	+17 -0	Add `WithSessionMessages()` / `SessionMessagesFromContext()`
`internal/evidence/readiness_audit.go`	-1	Remove `MissingCompleteStep` from struct
`internal/tool/builtin/completestep.go`	+98 -2	Add `verifyCommandFromSession()` as fallback
`internal/cli/run_metrics.go`	-4	Remove `ReadinessMissingCompleteSteps`
`internal/cli/run_metrics_test.go`	-5	Update assertion
`internal/agent/evidence_flow_test.go`	+2 -2	Update assertion
`internal/agent/final_readiness_test.go`	+1 -1	Update assertion
`desktop/frontend/src/App.tsx`	-30	Remove `todoStale` useMemo
`desktop/frontend/src/components/TodoPanel.tsx`	-8	Remove `stale` prop
`desktop/frontend/src/locales/en.ts`	-1	Remove `"todo.stale"`
`desktop/frontend/src/locales/zh.ts`	-1	Remove `"todo.stale"`
`desktop/frontend/src/styles.css`	-15	Remove `.todobar__stale`

Testing / 测试

ok  internal/agent       5.520s
ok  internal/evidence    2.367s
ok  internal/tool/builtin 6.959s
ok  internal/cli         5.324s

# Cache guard tests all pass (zero cache impact):
ok  TestCacheHitPrefixStable        ✅
ok  TestCacheHitClimbsWithoutCompaction  ✅
ok  TestCompactRewriteVersionFeedsCacheDiagnostics  ✅

Related / 关联

Fixes bug: finalReadinessCheck enforces strict ordering (complete_step must follow writer), causing false positive on correctly ordered tool calls #3469

…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator Three changes that all reduce false signals in the evidence/complete_step/todo verification system: 1. Remove duplicate complete_step ordering check from finalReadinessCheck. The todo_write tool's own guard (verifyTodoCompletionTransitions) already rejects marking items completed without a matching complete_step receipt, and is order-independent. The finalReadinessCheck duplicate was stricter (required complete_step after the latest writer), causing false positives when the model correctly completes steps in a different order. 2. Add session-fallback command verification in complete_step. The per-turn Ledger is reset each turn, so cross-turn references, non-bash tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated command strings all fail the strict receipt match. verifyCommandFromSession scans the full conversation transcript with normalized matching (strip …, collapse whitespace, prefix match with 8-char minimum) and tool-name fallback, eliminating three common false negatives. 3. Remove stale "Progress may be stale" indicator from desktop TodoPanel. The heuristic (≥2 tools after last todo_write, final assistant message, readiness notices) triggered false alerts in normal completion flows. The todo count and transcript itself provide better progress visibility. Fixes esengine#3469

esengine

Solid — reducing the false readiness/evidence signals with the added flow + readiness tests is a good tightening. Thanks!

…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator (esengine#3587) Three changes that all reduce false signals in the evidence/complete_step/todo verification system: 1. Remove duplicate complete_step ordering check from finalReadinessCheck. The todo_write tool's own guard (verifyTodoCompletionTransitions) already rejects marking items completed without a matching complete_step receipt, and is order-independent. The finalReadinessCheck duplicate was stricter (required complete_step after the latest writer), causing false positives when the model correctly completes steps in a different order. 2. Add session-fallback command verification in complete_step. The per-turn Ledger is reset each turn, so cross-turn references, non-bash tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated command strings all fail the strict receipt match. verifyCommandFromSession scans the full conversation transcript with normalized matching (strip …, collapse whitespace, prefix match with 8-char minimum) and tool-name fallback, eliminating three common false negatives. 3. Remove stale "Progress may be stale" indicator from desktop TodoPanel. The heuristic (≥2 tools after last todo_write, final assistant message, readiness notices) triggered false alerts in normal completion flows. The todo count and transcript itself provide better progress visibility. Fixes esengine#3469 Co-authored-by: wufengfan <wufengfan@wufengfandeMacBook-Air.local>

…ommand-string drift (#3982) * fix(evidence): match paraphrased verification commands and guide complete_step self-correction complete_step rejected real verifications whenever the cited command string was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a flag or quote-style drift, or a piped tail all failed both the ledger match and the #3587 session fallback's prefix matching. Local session forensics show 5 of 18 real complete_step calls rejected this way, each cascading into todo_write failures and final answers that overclaim. Match commands by shell segment instead: split cited and ran commands on &&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a cited segment when a ran segment equals it or supersets its tokens under the same head token. One-token citations still require exact equality, and an aggregated citation that no single command covers is still rejected. The session fallback now uses the same matcher and skips calls whose recorded result is an error or block, closing the false positive where any attempted command counted as proof. Rejections now carry recovery context: ran-but-nonzero commands are distinguished from never-ran (with a '|| true' hint for negative verification, e.g. proving a file is gone), never-ran rejections list the turn's actual receipts, and the schema marks command/paths as required for their kinds instead of advertising them optional. * fix(evidence): count bash commands naming a path as files receipts Files created or edited through shell redirection (seq … > file, sed -i) leave no reader/writer receipt, so files evidence for them was always rejected and the model had to re-write the file with write_file just to mint a receipt. A successful bash command whose text names the path now counts as having touched it. --------- Co-authored-by: reasonix <reasonix@deepseek.com>

Per-turn evidence ledger reset made complete_step reject cross-turn citations and let the final gate miss an unfinished plan. diff/files evidence now falls back to the full session (like commands, #3587); the host keeps a canonical todo list (survives turns + compaction) the gate consults; a successful complete_step advances that list so the model no longer batches todo_write (#3909). Real-API A/B confirmed base rejects/blocks where the PR accepts/advances. Closes #2917

CVEngineer66 requested review from SivanCola and esengine as code owners June 8, 2026 12:45

CVEngineer66 force-pushed the main-v2 branch from 8a5c159 to 3890e57 Compare June 8, 2026 14:40

esengine approved these changes Jun 9, 2026

View reviewed changes

esengine merged commit 4441e6b into esengine:main-v2 Jun 9, 2026
19 checks passed

esengine mentioned this pull request Jun 11, 2026

fix(evidence): stop rejecting real complete_step verifications over command-string drift #3982

Merged

esengine mentioned this pull request Jun 11, 2026

fix(agent): close complete_step cross-turn evidence + loop gaps #4014

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): reduce evidence verification false signals#3587

fix(agent): reduce evidence verification false signals#3587
esengine merged 1 commit into
esengine:main-v2from
CVEngineer66:main-v2

CVEngineer66 commented Jun 8, 2026

Uh oh!

esengine left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CVEngineer66 commented Jun 8, 2026

Summary / 摘要

Changes / 改动内容

1. Remove duplicate complete_step ordering check from finalReadinessCheck / 移除 finalReadinessCheck 中重复的 complete_step 排序检查

2. Add session-fallback command verification in complete_step / complete_step 增加会话回退命令匹配

3. Remove stale "Progress may be stale" indicator / 移除"进度可能未同步"提示

Files / 涉及文件（13 files, +123 / -82）

Testing / 测试

Related / 关联

Uh oh!

esengine left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants