fix(hindsight): finalize memory before CLI shutdown by stepanov1975 · Pull Request #34859 · NousResearch/hermes-agent

stepanov1975 · 2026-05-29T20:43:39Z

What does this PR do?

Fixes the Hindsight memory shutdown race for one-shot CLI sessions by finalizing the active session before interpreter teardown, then draining/closing memory resources while Python can still schedule async work.

This addresses the failure mode where Hindsight retain work is submitted during atexit/interpreter shutdown and logs errors such as:

Hindsight retain failed: cannot schedule new futures after interpreter shutdown
Unclosed client session
Unclosed connector

It also fixes a related data-loss edge case for retain_every_n_turns > 1: final partial batches are now flushed on session end or session switch, and modern Hindsight append-mode retains track queued vs confirmed high-water marks so failed append writes can be retried without duplicating successful appends.

Related Issue

Fixes #15497
Fixes #15073

Related: #15512 is an open narrower PR for the provider-side shutdown guard. This PR covers the CLI lifecycle/finalization path, final partial-batch flushing, append-mode duplicate prevention, and retry behavior for failed append retains.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
🔒 Security fix
📝 Documentation update
✅ Tests (adding or improving test coverage)
♻️ Refactor (no behavior change)
🎯 New skill (bundled or hub)

Changes Made

cli.py
- Add explicit active-session finalization before one-shot CLI exit.
- Run one-shot cleanup in try/finally so unexpected exceptions still drain/close memory providers before interpreter shutdown.
- Suppress next-turn memory prefetch for one-shot quiet/human query paths.
run_agent.py
- Honor a one-shot _skip_memory_prefetch_after_turn flag so exit paths do not queue prefetch work for a turn that will never happen.
agent/agent_init.py
- Initialize the prefetch-skip flag on agents.
plugins/memory/hindsight/__init__.py
- Flush partial retained batches on session end and session switch.
- Use append-mode queued/confirmed high-water tracking per document.
- Generate append payloads at writer execution time so backup jobs no-op after earlier success or retry missing suffixes after earlier failure.
- Roll back queued append progress on failed retain jobs when no later queued job covers the same suffix.
- Keep Hindsight cleanup best-effort and idempotent.
Tests
- Add/extend regression coverage for one-shot cleanup, skipped prefetch, partial session-end flushing, session-switch flushing, append-mode delta flushing, immediate-drain append failure rollback, exact-boundary session-end retry, and session-switch fixture coverage for append high-water state.

How to Test

Run targeted regression suite:

scripts/run_tests.sh tests/agent/test_memory_session_switch.py tests/run_agent/test_memory_sync_interrupted.py tests/plugins/memory/test_hindsight_provider.py tests/cli/test_cli_shutdown_memory_messages.py -- -q

Result: 142 tests passed, 0 failed.

Run syntax/lint/safety checks:

git diff --check HEAD~1..HEAD
python -m compileall -q agent/agent_init.py cli.py run_agent.py plugins/memory/hindsight/__init__.py
ruff check tests/agent/test_memory_session_switch.py plugins/memory/hindsight/__init__.py cli.py run_agent.py agent/agent_init.py tests/cli/test_cli_shutdown_memory_messages.py tests/plugins/memory/test_hindsight_provider.py tests/run_agent/test_memory_sync_interrupted.py
python scripts/check-windows-footguns.py --all

Results:

git diff --check: passed
compileall: passed
ruff check: passed; emitted an existing warning about run_agent.py:68 invalid # noqa, but exited successfully
Windows footgun check: passed (No Windows footguns found)

Smoke one-shot CLI exit path:

hermes chat -Q -q 'Reply with exactly PR_READY_REVIEW_OK_4 and nothing else.'

Result: returned PR_READY_REVIEW_OK_4; post-smoke log scan found 0 new matches for the shutdown signatures above in errors.log and agent.log.

Full-suite note:

I also attempted the canonical full scripts/run_tests.sh earlier in this branch. It failed on tests/providers/test_plugin_discovery.py::test_bundled_plugins_discovered because plugins/model-providers/ai-gateway is missing __init__.py. I reproduced that same failure on detached origin/main, so it is a current baseline failure unrelated to this PR.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
I searched for existing PRs to make sure this isn't a duplicate
My PR contains only changes related to this fix/feature (no unrelated commits)
I've run pytest tests/ -q and all tests pass
I've added tests for my changes (required for bug fixes, strongly encouraged for features)
I've tested on my platform: Ubuntu 24.04 / Linux 6.8.0

Documentation & Housekeeping

I've updated relevant documentation (README, docs/, docstrings) — or N/A
I've updated cli-config.yaml.example if I added/changed config keys — or N/A
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A — this PR does not add a skill.

Screenshots / Logs

Relevant verification output:

=== Summary: 4 files, 142 tests passed, 0 failed ... ===
All checks passed!
✓ No Windows footguns found (537 file(s) scanned).
SMOKE_OUTPUT=PR_READY_REVIEW_OK_4
errors.log: matches_since_cutoff=0
agent.log: matches_since_cutoff=0

Finalize the active CLI session before interpreter teardown so Hindsight can flush buffered retain work while Python is still healthy. Suppress one-shot next-turn memory prefetch during CLI exit and add append-mode delta tracking so final/session-switch flushes do not duplicate already-enqueued turns.

wcctianma · 2026-06-01T16:13:07Z

Thanks for working on this. I hit the same shutdown/data-loss class locally and wanted to add one additional provider-level finding that may be worth folding into this PR or a small follow-up.

I agree that finalizing memory before CLI shutdown is the right lifecycle direction, but I was still able to reproduce short-lived worker/profile loss when the process exits without an explicit provider shutdown path. In that case the ordinary atexit.register(...) drain can still be too late: Python has already started interpreter/thread-pool teardown, and the retain path may fail with variants of:

Hindsight retain failed: cannot schedule new futures after interpreter shutdown
Hindsight retain failed: can't register atexit after shutdown
Hindsight writer did not stop within 10s; abandoning 1 pending retain(s)
Hindsight retain failed: Server disconnected

A provider-side hardening that fixed the short-lived subprocess case for me was:

Prefer CPython threading._register_atexit(self._atexit_shutdown) when available, instead of relying only on normal atexit.register(...).
Before registering that pre-shutdown hook, explicitly import concurrent.futures.thread.
- That causes the executor _python_exit hook to be registered first.
- threading pre-shutdown hooks run LIFO, so the Hindsight drain runs before the thread-pool/executor is marked as shutting down.
- This avoids the retain drain lazily touching asyncio/aiohttp executor machinery after interpreter shutdown has already started.
Keep ordinary atexit.register(...) as the portability fallback for non-CPython / unavailable _register_atexit runtimes.
In shutdown(), join the writer for at least the configured Hindsight request timeout rather than a fixed 10 seconds, so a still-running retain request is not abandoned and then followed immediately by aiohttp client closure.

The local patch was only in:

plugins/memory/hindsight/__init__.py
tests/plugins/memory/test_hindsight_provider.py

and added regression coverage that verifies the provider registers the pre-interpreter-shutdown hook when available, imports concurrent.futures.thread first, and only falls back to normal atexit otherwise.

Validation I ran locally:

python -m pytest tests/plugins/memory/test_hindsight_provider.py::TestShutdownRace::test_registers_pre_interpreter_shutdown_hook_when_available -q -o 'addopts='
# 1 passed

scripts/run_tests.sh tests/plugins/memory/test_hindsight_provider.py
# 102 passed

scripts/run_tests.sh tests/plugins/memory/test_hindsight_provider.py tests/agent/test_memory_provider.py tests/agent/test_memory_session_switch.py
# 190 passed

I also verified it with real short-lived subprocesses for two separate profiles/banks: each subprocess called HindsightMemoryProvider.sync_turn() and then exited without an explicit shutdown(). After the fix, the expected documents appeared in both target Hindsight banks (hermes-repair and hermes-gongwen), so this covered the worker-style short-task case rather than only unit tests.

I did not open a separate PR yet because this issue/PR already tracks the same root failure mode (#15073 / #15497). If useful, I can submit the provider-level piece as a focused follow-up PR, or it could be folded into this PR as defense-in-depth alongside the CLI finalization changes.

stepanov1975 force-pushed the fix/hindsight-one-shot-shutdown branch from b723d28 to 5c81765 Compare May 29, 2026 20:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hindsight): finalize memory before CLI shutdown#34859

fix(hindsight): finalize memory before CLI shutdown#34859
stepanov1975 wants to merge 1 commit into
NousResearch:mainfrom
stepanov1975:fix/hindsight-one-shot-shutdown

stepanov1975 commented May 29, 2026 •

edited

Loading

Uh oh!

wcctianma commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

stepanov1975 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related Issue

Type of Change

Changes Made

How to Test

Checklist

Code

Documentation & Housekeeping

For New Skills

Screenshots / Logs

Uh oh!

wcctianma commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

stepanov1975 commented May 29, 2026 •

edited

Loading