fix(docker): recover from out-of-band container removal in persistent mode#36631
Closed
annguyenNous wants to merge 1 commit into
Closed
Conversation
… mode When a persistent Docker sandbox container is removed out-of-band (idle reaper, docker prune, OOM kill, daemon restart), the gateway keeps issuing 'docker exec' against the dead container ID, returning 'No such container' on every subsequent tool call. The agent is permanently blocked until the gateway process is restarted. This fix adds automatic recovery: when execute() detects a 'No such container' or 'is not running' error in a persistent session, it invalidates the cached container handle, probes for a reusable container via labels, and falls back to creating a fresh one — then re-initializes the session snapshot and retries the original command transparently. Fixes NousResearch#36266
benbarclay
added a commit
that referenced
this pull request
Jun 5, 2026
… mode (salvage #36631) (#39415) Salvage of #36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes #36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from #36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
davidgut1982
pushed a commit
to davidgut1982/hermes-agent
that referenced
this pull request
Jun 5, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
changman
pushed a commit
to changman/hermes-agent
that referenced
this pull request
Jun 10, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a persistent Docker sandbox container is removed out-of-band (idle reaper,
docker prune, OOM kill, daemon restart), the gateway keeps issuingdocker execagainst the dead container ID, returning "No such container" on every subsequent tool call. The agent is permanently blocked until the gateway process is restarted.Reproduced with:
terminal.backend: docker,container_persistent: truedocker rm -f hermes-XXXXterminal()calls return the same error indefinitelyFix
Added automatic recovery in
DockerEnvironment.execute():execute()checks the output for"No such container"or"is not running"patterns after any non-zero exit_recreate_container():Only activates for
persist_across_processes=Truesessions (one-shot containers don't need recovery).Tests
All 63 existing docker environment tests pass:
Fixes #36266