Skip to content

fix(agent): reduce evidence verification false signals#3587

Merged
esengine merged 1 commit into
esengine:main-v2from
CVEngineer66:main-v2
Jun 9, 2026
Merged

fix(agent): reduce evidence verification false signals#3587
esengine merged 1 commit into
esengine:main-v2from
CVEngineer66:main-v2

Conversation

@CVEngineer66

Copy link
Copy Markdown
Contributor

Summary / 摘要

EN: Remove three false signal sources in the evidence/complete_step/todo verification system that either blocked the model's final answer incorrectly, rejected valid complete_step evidence, or showed a misleading UI indicator.

CN: 消除 evidence / complete_step / todo 验证系统中的三个虚假信号源,它们分别导致了最终答案被错误拦截、合法的 complete_step 证据被拒绝、以及 UI 显示误导性提示。


Changes / 改动内容

1. Remove duplicate complete_step ordering check from finalReadinessCheck / 移除 finalReadinessCheck 中重复的 complete_step 排序检查

EN: finalReadinessCheck() used HasSuccessfulCompleteStepAfter(writer) to require complete_step at a receipt index greater than the last writer tool. But todo_write's own verifyTodoCompletionTransitions already enforces this constraint and is order-independent (it searches all receipts). The duplicate was stricter, causing false positives when the model called complete_step → write_file → todo_write(completed). Removed entirely.

CN: finalReadinessCheck()HasSuccessfulCompleteStepAfter(writer) 要求 complete_step 必须在最后一个 writer 工具之后。但 todo_write 自身的 verifyTodoCompletionTransitions 已经是顺序无关的(搜索所有 receipts)。这个重复检查更严格,导致 complete_step → write_file → todo_write(completed) 这类正常流程被误拦截。已彻底删除。

2. Add session-fallback command verification in complete_step / complete_step 增加会话回退命令匹配

EN: The per-turn Ledger is reset each turn, so complete_step verification has three blind spots:

Scenario Example Before After
Cross-turn command bash("grep ...") in turn 1, cited in turn 2 ❌ no matching receipt ✅ scans session transcript
Non-bash tool name ls tool (not bash), cited as command="ls ." ❌ no bash receipt ✅ matches by tool name
Truncated command find . -type f ... (real), find . -type f ... (cited) ❌ string mismatch ✅ normalized + prefix match

Added verifyCommandFromSession(): normalizes both sides (strip …​, collapse whitespace), does exact match, prefix match (min 8 chars), and tool-name match. Receipt match is still the fast path — session fallback only runs when the receipt misses.

CN: Per-turn 的 Ledger 每轮会 reset,导致三个盲区:

场景 示例 修改前 修改后
跨轮命令 第1轮 bash,第2轮引用 ❌ 找不到 receipt ✅ 扫描全对话
非 bash 工具名 ls 工具(非 bash),引用为 command="ls ." ❌ 没有 bash receipt ✅ 按工具名匹配
命令截断 真实 find . -type f …​,引用 find . -type f …​ ❌ 字符不匹配 ✅ 归一化 + 前缀匹配

新增 verifyCommandFromSession():归一化(去掉 …​、折叠空白)、精确匹配、前缀匹配(最短 8 字符)、工具名匹配。receipt 匹配仍是快路径,session 回退只在前者未命中时运行。

3. Remove stale "Progress may be stale" indicator / 移除"进度可能未同步"提示

EN: The todoStale heuristic in the desktop TodoPanel checked 4 conditions (≥2 tools after last todo_write, final assistant message, readiness notices, 90s timeout). All except the timeout triggered false positives in normal completion flows — the model finishing its work and producing a final answer always lit the "stale" flag. Removed entirely. The todo count ({done}/{total}) and the transcript itself provide better progress visibility.

CN: Desktop TodoPaneltodoStale 启发式判断检查了 4 个条件(todo_write 后 ≥2 个工具、有 assistant 回复、readiness notice、90 秒超时)。除了超时,其他三个在正常完成场景下都会误报——模型做完工作给出最终答案时必然触发。已彻底删除。todo 的数字({done}/{total})和 transcript 本身已经提供了足够的进度可见性。


Files / 涉及文件(13 files, +123 / -82)

File Δ Change
internal/agent/agent.go +8 -1 Remove HasSuccessfulCompleteStepAfter(writer) check; add WithSessionMessages() injection
internal/evidence/evidence.go +17 -0 Add WithSessionMessages() / SessionMessagesFromContext()
internal/evidence/readiness_audit.go -1 Remove MissingCompleteStep from struct
internal/tool/builtin/completestep.go +98 -2 Add verifyCommandFromSession() as fallback
internal/cli/run_metrics.go -4 Remove ReadinessMissingCompleteSteps
internal/cli/run_metrics_test.go -5 Update assertion
internal/agent/evidence_flow_test.go +2 -2 Update assertion
internal/agent/final_readiness_test.go +1 -1 Update assertion
desktop/frontend/src/App.tsx -30 Remove todoStale useMemo
desktop/frontend/src/components/TodoPanel.tsx -8 Remove stale prop
desktop/frontend/src/locales/en.ts -1 Remove "todo.stale"
desktop/frontend/src/locales/zh.ts -1 Remove "todo.stale"
desktop/frontend/src/styles.css -15 Remove .todobar__stale

Testing / 测试

ok  internal/agent       5.520s
ok  internal/evidence    2.367s
ok  internal/tool/builtin 6.959s
ok  internal/cli         5.324s

# Cache guard tests all pass (zero cache impact):
ok  TestCacheHitPrefixStable        ✅
ok  TestCacheHitClimbsWithoutCompaction  ✅
ok  TestCompactRewriteVersionFeedsCacheDiagnostics  ✅

Related / 关联

…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator

Three changes that all reduce false signals in the evidence/complete_step/todo
verification system:

1. Remove duplicate complete_step ordering check from finalReadinessCheck.
   The todo_write tool's own guard (verifyTodoCompletionTransitions) already
   rejects marking items completed without a matching complete_step receipt,
   and is order-independent. The finalReadinessCheck duplicate was stricter
   (required complete_step after the latest writer), causing false positives
   when the model correctly completes steps in a different order.

2. Add session-fallback command verification in complete_step.
   The per-turn Ledger is reset each turn, so cross-turn references, non-bash
   tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated
   command strings all fail the strict receipt match. verifyCommandFromSession
   scans the full conversation transcript with normalized matching (strip …,
   collapse whitespace, prefix match with 8-char minimum) and tool-name
   fallback, eliminating three common false negatives.

3. Remove stale "Progress may be stale" indicator from desktop TodoPanel.
   The heuristic (≥2 tools after last todo_write, final assistant message,
   readiness notices) triggered false alerts in normal completion flows.
   The todo count and transcript itself provide better progress visibility.

Fixes esengine#3469
@github-actions github-actions Bot added v2 Go rewrite (1.x) — main-v2 branch, active development desktop Wails desktop app (desktop/**) tui Terminal UI / CLI (internal/cli, internal/control) skills Skill system (internal/skill, internal/tool) agent Core agent loop (internal/agent, internal/control) labels Jun 8, 2026

@esengine esengine left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid — reducing the false readiness/evidence signals with the added flow + readiness tests is a good tightening. Thanks!

@esengine esengine merged commit 4441e6b into esengine:main-v2 Jun 9, 2026
19 checks passed
dorokuma pushed a commit to dorokuma/DeepSeek-Reasonix that referenced this pull request Jun 10, 2026
…cate finalReadinessCheck, add session fallback for complete_step command matching, and drop stale todo indicator (esengine#3587)

Three changes that all reduce false signals in the evidence/complete_step/todo
verification system:

1. Remove duplicate complete_step ordering check from finalReadinessCheck.
   The todo_write tool's own guard (verifyTodoCompletionTransitions) already
   rejects marking items completed without a matching complete_step receipt,
   and is order-independent. The finalReadinessCheck duplicate was stricter
   (required complete_step after the latest writer), causing false positives
   when the model correctly completes steps in a different order.

2. Add session-fallback command verification in complete_step.
   The per-turn Ledger is reset each turn, so cross-turn references, non-bash
   tool calls (e.g. the `ls` tool cited as command="ls ."), and truncated
   command strings all fail the strict receipt match. verifyCommandFromSession
   scans the full conversation transcript with normalized matching (strip …,
   collapse whitespace, prefix match with 8-char minimum) and tool-name
   fallback, eliminating three common false negatives.

3. Remove stale "Progress may be stale" indicator from desktop TodoPanel.
   The heuristic (≥2 tools after last todo_write, final assistant message,
   readiness notices) triggered false alerts in normal completion flows.
   The todo count and transcript itself provide better progress visibility.

Fixes esengine#3469

Co-authored-by: wufengfan <wufengfan@wufengfandeMacBook-Air.local>
esengine added a commit that referenced this pull request Jun 11, 2026
…ommand-string drift (#3982)

* fix(evidence): match paraphrased verification commands and guide complete_step self-correction

complete_step rejected real verifications whenever the cited command string
was not byte-identical to the bash receipt: a dropped cd prefix (#2917), a
flag or quote-style drift, or a piped tail all failed both the ledger match
and the #3587 session fallback's prefix matching. Local session forensics
show 5 of 18 real complete_step calls rejected this way, each cascading into
todo_write failures and final answers that overclaim.

Match commands by shell segment instead: split cited and ran commands on
&&/||/;/|/newlines, quote-strip and whitespace-normalize tokens, and accept a
cited segment when a ran segment equals it or supersets its tokens under the
same head token. One-token citations still require exact equality, and an
aggregated citation that no single command covers is still rejected. The
session fallback now uses the same matcher and skips calls whose recorded
result is an error or block, closing the false positive where any attempted
command counted as proof.

Rejections now carry recovery context: ran-but-nonzero commands are
distinguished from never-ran (with a '|| true' hint for negative
verification, e.g. proving a file is gone), never-ran rejections list the
turn's actual receipts, and the schema marks command/paths as required for
their kinds instead of advertising them optional.

* fix(evidence): count bash commands naming a path as files receipts

Files created or edited through shell redirection (seq … > file, sed -i)
leave no reader/writer receipt, so files evidence for them was always
rejected and the model had to re-write the file with write_file just to
mint a receipt. A successful bash command whose text names the path now
counts as having touched it.

---------

Co-authored-by: reasonix <reasonix@deepseek.com>
esengine added a commit that referenced this pull request Jun 11, 2026
Per-turn evidence ledger reset made complete_step reject cross-turn citations and let the final gate miss an unfinished plan. diff/files evidence now falls back to the full session (like commands, #3587); the host keeps a canonical todo list (survives turns + compaction) the gate consults; a successful complete_step advances that list so the model no longer batches todo_write (#3909). Real-API A/B confirmed base rejects/blocks where the PR accepts/advances.

Closes #2917
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Core agent loop (internal/agent, internal/control) desktop Wails desktop app (desktop/**) skills Skill system (internal/skill, internal/tool) tui Terminal UI / CLI (internal/cli, internal/control) v2 Go rewrite (1.x) — main-v2 branch, active development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: finalReadinessCheck enforces strict ordering (complete_step must follow writer), causing false positive on correctly ordered tool calls

2 participants