Problem
complete_step currently verifies that the model supplied evidence fields, but the host runtime does not confirm that the cited command, file read, or edit actually happened in the current turn.
Proposal
Add a small runtime-only evidence receipt ledger. The agent records tool-call receipts during a turn, and complete_step checks verification, diff, and files evidence against those receipts before accepting a step completion.
Non-goals
- No UI changes.
- No performance claims or optimization.
- No auto-plan or multi-agent behavior.
- No prompt, tool schema, or tool list changes.
- No persistence of receipt data.
Conflict check
This is intentionally scoped away from the active auto-plan, worktree agents, goal state, cache diagnostics, and MCP startup/import PRs. The first implementation should touch only the core agent loop, complete_step behavior, and focused tests.
Review evidence plan
PRs will link this RFC. Because the first slice has no UI changes, screenshots are not applicable. Because it makes no performance claim and avoids prompt/tool-schema changes, cache/token metrics are not expected; if scope changes, the PR will include the required data.
Problem
complete_step currently verifies that the model supplied evidence fields, but the host runtime does not confirm that the cited command, file read, or edit actually happened in the current turn.
Proposal
Add a small runtime-only evidence receipt ledger. The agent records tool-call receipts during a turn, and complete_step checks verification, diff, and files evidence against those receipts before accepting a step completion.
Non-goals
Conflict check
This is intentionally scoped away from the active auto-plan, worktree agents, goal state, cache diagnostics, and MCP startup/import PRs. The first implementation should touch only the core agent loop, complete_step behavior, and focused tests.
Review evidence plan
PRs will link this RFC. Because the first slice has no UI changes, screenshots are not applicable. Because it makes no performance claim and avoids prompt/tool-schema changes, cache/token metrics are not expected; if scope changes, the PR will include the required data.