Skip to content

fix(hindsight): finalize memory before CLI shutdown#34859

Open
stepanov1975 wants to merge 1 commit into
NousResearch:mainfrom
stepanov1975:fix/hindsight-one-shot-shutdown
Open

fix(hindsight): finalize memory before CLI shutdown#34859
stepanov1975 wants to merge 1 commit into
NousResearch:mainfrom
stepanov1975:fix/hindsight-one-shot-shutdown

Conversation

@stepanov1975

@stepanov1975 stepanov1975 commented May 29, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the Hindsight memory shutdown race for one-shot CLI sessions by finalizing the active session before interpreter teardown, then draining/closing memory resources while Python can still schedule async work.

This addresses the failure mode where Hindsight retain work is submitted during atexit/interpreter shutdown and logs errors such as:

Hindsight retain failed: cannot schedule new futures after interpreter shutdown
Unclosed client session
Unclosed connector

It also fixes a related data-loss edge case for retain_every_n_turns > 1: final partial batches are now flushed on session end or session switch, and modern Hindsight append-mode retains track queued vs confirmed high-water marks so failed append writes can be retried without duplicating successful appends.

Related Issue

Fixes #15497
Fixes #15073

Related: #15512 is an open narrower PR for the provider-side shutdown guard. This PR covers the CLI lifecycle/finalization path, final partial-batch flushing, append-mode duplicate prevention, and retry behavior for failed append retains.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • cli.py
    • Add explicit active-session finalization before one-shot CLI exit.
    • Run one-shot cleanup in try/finally so unexpected exceptions still drain/close memory providers before interpreter shutdown.
    • Suppress next-turn memory prefetch for one-shot quiet/human query paths.
  • run_agent.py
    • Honor a one-shot _skip_memory_prefetch_after_turn flag so exit paths do not queue prefetch work for a turn that will never happen.
  • agent/agent_init.py
    • Initialize the prefetch-skip flag on agents.
  • plugins/memory/hindsight/__init__.py
    • Flush partial retained batches on session end and session switch.
    • Use append-mode queued/confirmed high-water tracking per document.
    • Generate append payloads at writer execution time so backup jobs no-op after earlier success or retry missing suffixes after earlier failure.
    • Roll back queued append progress on failed retain jobs when no later queued job covers the same suffix.
    • Keep Hindsight cleanup best-effort and idempotent.
  • Tests
    • Add/extend regression coverage for one-shot cleanup, skipped prefetch, partial session-end flushing, session-switch flushing, append-mode delta flushing, immediate-drain append failure rollback, exact-boundary session-end retry, and session-switch fixture coverage for append high-water state.

How to Test

  1. Run targeted regression suite:
scripts/run_tests.sh tests/agent/test_memory_session_switch.py tests/run_agent/test_memory_sync_interrupted.py tests/plugins/memory/test_hindsight_provider.py tests/cli/test_cli_shutdown_memory_messages.py -- -q

Result: 142 tests passed, 0 failed.

  1. Run syntax/lint/safety checks:
git diff --check HEAD~1..HEAD
python -m compileall -q agent/agent_init.py cli.py run_agent.py plugins/memory/hindsight/__init__.py
ruff check tests/agent/test_memory_session_switch.py plugins/memory/hindsight/__init__.py cli.py run_agent.py agent/agent_init.py tests/cli/test_cli_shutdown_memory_messages.py tests/plugins/memory/test_hindsight_provider.py tests/run_agent/test_memory_sync_interrupted.py
python scripts/check-windows-footguns.py --all

Results:

  • git diff --check: passed
  • compileall: passed
  • ruff check: passed; emitted an existing warning about run_agent.py:68 invalid # noqa, but exited successfully
  • Windows footgun check: passed (No Windows footguns found)
  1. Smoke one-shot CLI exit path:
hermes chat -Q -q 'Reply with exactly PR_READY_REVIEW_OK_4 and nothing else.'

Result: returned PR_READY_REVIEW_OK_4; post-smoke log scan found 0 new matches for the shutdown signatures above in errors.log and agent.log.

  1. Full-suite note:

I also attempted the canonical full scripts/run_tests.sh earlier in this branch. It failed on tests/providers/test_plugin_discovery.py::test_bundled_plugins_discovered because plugins/model-providers/ai-gateway is missing __init__.py. I reproduced that same failure on detached origin/main, so it is a current baseline failure unrelated to this PR.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu 24.04 / Linux 6.8.0

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A — this PR does not add a skill.

Screenshots / Logs

Relevant verification output:

=== Summary: 4 files, 142 tests passed, 0 failed ... ===
All checks passed!
✓ No Windows footguns found (537 file(s) scanned).
SMOKE_OUTPUT=PR_READY_REVIEW_OK_4
errors.log: matches_since_cutoff=0
agent.log: matches_since_cutoff=0

Finalize the active CLI session before interpreter teardown so Hindsight can flush buffered retain work while Python is still healthy. Suppress one-shot next-turn memory prefetch during CLI exit and add append-mode delta tracking so final/session-switch flushes do not duplicate already-enqueued turns.
@stepanov1975 stepanov1975 force-pushed the fix/hindsight-one-shot-shutdown branch from b723d28 to 5c81765 Compare May 29, 2026 20:51
@alt-glitch alt-glitch added type/bug Something isn't working P3 Low — cosmetic, nice to have comp/cli CLI entry point, hermes_cli/, setup wizard comp/agent Core agent loop, run_agent.py, prompt builder comp/plugins Plugin system and bundled plugins tool/memory Memory tool and memory providers labels May 29, 2026
@wcctianma

Copy link
Copy Markdown

Thanks for working on this. I hit the same shutdown/data-loss class locally and wanted to add one additional provider-level finding that may be worth folding into this PR or a small follow-up.

I agree that finalizing memory before CLI shutdown is the right lifecycle direction, but I was still able to reproduce short-lived worker/profile loss when the process exits without an explicit provider shutdown path. In that case the ordinary atexit.register(...) drain can still be too late: Python has already started interpreter/thread-pool teardown, and the retain path may fail with variants of:

Hindsight retain failed: cannot schedule new futures after interpreter shutdown
Hindsight retain failed: can't register atexit after shutdown
Hindsight writer did not stop within 10s; abandoning 1 pending retain(s)
Hindsight retain failed: Server disconnected

A provider-side hardening that fixed the short-lived subprocess case for me was:

  1. Prefer CPython threading._register_atexit(self._atexit_shutdown) when available, instead of relying only on normal atexit.register(...).
  2. Before registering that pre-shutdown hook, explicitly import concurrent.futures.thread.
    • That causes the executor _python_exit hook to be registered first.
    • threading pre-shutdown hooks run LIFO, so the Hindsight drain runs before the thread-pool/executor is marked as shutting down.
    • This avoids the retain drain lazily touching asyncio/aiohttp executor machinery after interpreter shutdown has already started.
  3. Keep ordinary atexit.register(...) as the portability fallback for non-CPython / unavailable _register_atexit runtimes.
  4. In shutdown(), join the writer for at least the configured Hindsight request timeout rather than a fixed 10 seconds, so a still-running retain request is not abandoned and then followed immediately by aiohttp client closure.

The local patch was only in:

plugins/memory/hindsight/__init__.py
tests/plugins/memory/test_hindsight_provider.py

and added regression coverage that verifies the provider registers the pre-interpreter-shutdown hook when available, imports concurrent.futures.thread first, and only falls back to normal atexit otherwise.

Validation I ran locally:

python -m pytest tests/plugins/memory/test_hindsight_provider.py::TestShutdownRace::test_registers_pre_interpreter_shutdown_hook_when_available -q -o 'addopts='
# 1 passed

scripts/run_tests.sh tests/plugins/memory/test_hindsight_provider.py
# 102 passed

scripts/run_tests.sh tests/plugins/memory/test_hindsight_provider.py tests/agent/test_memory_provider.py tests/agent/test_memory_session_switch.py
# 190 passed

I also verified it with real short-lived subprocesses for two separate profiles/banks: each subprocess called HindsightMemoryProvider.sync_turn() and then exited without an explicit shutdown(). After the fix, the expected documents appeared in both target Hindsight banks (hermes-repair and hermes-gongwen), so this covered the worker-style short-task case rather than only unit tests.

I did not open a separate PR yet because this issue/PR already tracks the same root failure mode (#15073 / #15497). If useful, I can submit the provider-level piece as a focused follow-up PR, or it could be folded into this PR as defense-in-depth alongside the CLI finalization changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/agent Core agent loop, run_agent.py, prompt builder comp/cli CLI entry point, hermes_cli/, setup wizard comp/plugins Plugin system and bundled plugins P3 Low — cosmetic, nice to have tool/memory Memory tool and memory providers type/bug Something isn't working

Projects

None yet

3 participants