fix(hindsight): finalize memory before CLI shutdown#34859
Conversation
Finalize the active CLI session before interpreter teardown so Hindsight can flush buffered retain work while Python is still healthy. Suppress one-shot next-turn memory prefetch during CLI exit and add append-mode delta tracking so final/session-switch flushes do not duplicate already-enqueued turns.
b723d28 to
5c81765
Compare
|
Thanks for working on this. I hit the same shutdown/data-loss class locally and wanted to add one additional provider-level finding that may be worth folding into this PR or a small follow-up. I agree that finalizing memory before CLI shutdown is the right lifecycle direction, but I was still able to reproduce short-lived worker/profile loss when the process exits without an explicit provider shutdown path. In that case the ordinary A provider-side hardening that fixed the short-lived subprocess case for me was:
The local patch was only in: and added regression coverage that verifies the provider registers the pre-interpreter-shutdown hook when available, imports Validation I ran locally: I also verified it with real short-lived subprocesses for two separate profiles/banks: each subprocess called I did not open a separate PR yet because this issue/PR already tracks the same root failure mode (#15073 / #15497). If useful, I can submit the provider-level piece as a focused follow-up PR, or it could be folded into this PR as defense-in-depth alongside the CLI finalization changes. |
What does this PR do?
Fixes the Hindsight memory shutdown race for one-shot CLI sessions by finalizing the active session before interpreter teardown, then draining/closing memory resources while Python can still schedule async work.
This addresses the failure mode where Hindsight retain work is submitted during
atexit/interpreter shutdown and logs errors such as:It also fixes a related data-loss edge case for
retain_every_n_turns > 1: final partial batches are now flushed on session end or session switch, and modern Hindsight append-mode retains track queued vs confirmed high-water marks so failed append writes can be retried without duplicating successful appends.Related Issue
Fixes #15497
Fixes #15073
Related: #15512 is an open narrower PR for the provider-side shutdown guard. This PR covers the CLI lifecycle/finalization path, final partial-batch flushing, append-mode duplicate prevention, and retry behavior for failed append retains.
Type of Change
Changes Made
cli.pytry/finallyso unexpected exceptions still drain/close memory providers before interpreter shutdown.run_agent.py_skip_memory_prefetch_after_turnflag so exit paths do not queue prefetch work for a turn that will never happen.agent/agent_init.pyplugins/memory/hindsight/__init__.pyHow to Test
Result:
142 tests passed, 0 failed.Results:
git diff --check: passedcompileall: passedruff check: passed; emitted an existing warning aboutrun_agent.py:68invalid# noqa, but exited successfullyNo Windows footguns found)hermes chat -Q -q 'Reply with exactly PR_READY_REVIEW_OK_4 and nothing else.'Result: returned
PR_READY_REVIEW_OK_4; post-smoke log scan found 0 new matches for the shutdown signatures above inerrors.logandagent.log.I also attempted the canonical full
scripts/run_tests.shearlier in this branch. It failed ontests/providers/test_plugin_discovery.py::test_bundled_plugins_discoveredbecauseplugins/model-providers/ai-gatewayis missing__init__.py. I reproduced that same failure on detachedorigin/main, so it is a current baseline failure unrelated to this PR.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AFor New Skills
N/A — this PR does not add a skill.
Screenshots / Logs
Relevant verification output: