Skip to content

fix(agent): scan only new tool results in bg review, skip snapshot history (#14944)#14969

Closed
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/14944-bg-review-stale-tool-results
Closed

fix(agent): scan only new tool results in bg review, skip snapshot history (#14944)#14969
Bartok9 wants to merge 1 commit into
NousResearch:mainfrom
Bartok9:fix/14944-bg-review-stale-tool-results

Conversation

@Bartok9

@Bartok9 Bartok9 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Problem

Background review notifications were surfacing old successful tool results from the conversation history as if they had just been performed by the review agent.

For example, after a user created a cron job in an earlier turn, a later unrelated background review would repeat:

💾 Cron job '<reminder name>' created. · User profile updated

...even though the review agent never touched cron — it only updated the profile.

Reported in #14944. Related code path also noted in #9055 and #9696.

Root Cause

_spawn_background_review passes conversation_history=messages_snapshot to the review agent, which populates _session_messages with the full prior history (including old role=tool messages). The post-run scan then iterated over all of _session_messages — treating historical tool results as new review actions.

# Before: scans entire _session_messages including snapshot history
for msg in getattr(review_agent, "_session_messages", []):
    ...

Fix

Record the snapshot length before run_conversation(), then slice to only the messages the review agent actually appended:

snapshot_len = len(messages_snapshot)
all_msgs = getattr(review_agent, "_session_messages", [])
new_msgs = all_msgs[snapshot_len:]  # only messages added by the review agent

for msg in new_msgs:
    ...

This is a minimal, zero-risk change — it only affects which messages are summarised in the notification, not the review agent's behaviour or memory writes.

Tests

Added tests/agent/test_bg_review_stale_tool_results.py with 7 unit tests:

Test Verifies
test_stale_tool_result_not_surfaced Historical tool results ignored
test_new_tool_result_is_surfaced Review agent's own results reported
test_mixed_stale_and_new_only_new_surfaced Mixed history: only new results show
test_empty_snapshot_all_msgs_scanned No snapshot → all messages are new
test_failed_tool_results_excluded Failures never shown
test_deduplication_preserved Duplicate strings collapsed
test_regression_issue_14944 Exact scenario from the bug report

All 7 pass ✅

Closes #14944

…story (NousResearch#14944)

Background review notifications were incorrectly including successful tool
results that were already present in the conversation history *before* the
review agent ran.

Root cause: _spawn_background_review initialises the review sub-agent with
conversation_history=messages_snapshot, which populates _session_messages
with the full history.  The post-run scan then iterated over ALL messages
in _session_messages — including the historical role=tool messages from the
snapshot — treating them as newly performed review actions.

Fix: record snapshot_len before run_conversation(), then slice
_session_messages[snapshot_len:] so the scan covers only messages that the
review agent actually appended during its run.

Adds 7 unit tests covering:
- stale tool results not surfaced (regression for NousResearch#14944)
- new tool results correctly surfaced
- mixed history: only new results reported
- empty snapshot: all messages scanned as new
- failed tool results excluded
- deduplication preserved
- exact reporter scenario from issue description

Related: NousResearch#9055, NousResearch#9696 (same code path)
@Bartok9

Bartok9 commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

I rechecked this against current main 6051fba9 — no conflicts, the change is isolated to the _spawn_background_review function in run_agent.py. CI test failures (e2e and check-attribution) are pre-existing on main and unrelated to this PR.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/agent Core agent loop, run_agent.py, prompt builder labels Apr 24, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing fix for #14944 alongside #14967 and #9696 — all three PRs address the same stale-tool-result bug in background review. Recommend maintainer pick one.

@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the fix @Bartok9 — same bug was also addressed by #14967 (submitted ~37s before yours) which we just merged via salvage #15057.

Your slice-based approach (_session_messages[len(snapshot):]) was cleaner and smaller, but we went with #14967's ID-based matching because it's robust to any future init-step that might reorder/filter the inherited history (compression, prefix-cache replay, etc.). ID matching survives those changes; slice boundary does not.

Your 216-line test suite was excellent. Really appreciate the detailed root-cause writeup and the PR body table. Please keep contributing — next one where your approach is the clear winner, it gets merged.

@teknium1 teknium1 closed this Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Background review notification includes stale tool results from conversation history

3 participants