fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631)#39415
Merged
benbarclay merged 1 commit intoJun 5, 2026
Conversation
… mode (salvage #36631) Salvage of #36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes #36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from #36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
Contributor
🔎 Lint report:
|
NelsonLongxiang
pushed a commit
to NelsonLongxiang/hermes-agent
that referenced
this pull request
Jun 5, 2026
Key upstream changes: - feat(delegation): uncap max_spawn_depth (NousResearch#39772) - feat(discord): voice-channel mixer with ambient idle bed (NousResearch#39659) - fix(terminal): guard os.getcwd() against deleted CWD (NousResearch#39275) - fix(telegram): finalize sealed overflow chunk formatting - fix(gateway): tolerate non-UTF-8 status/pid files - fix(mcp): widen shutdown exception guard to BaseException - fix(vision): detect vision-capable custom providers - fix(docker): recover from out-of-band container removal (NousResearch#39415) - desktop drag sessions, OAuth remote connect fixes - react-router 7.17.0 security bump (GHSA-8x6r-g9mw-2r78) Our patches preserved: - claude_session tool (17 files) - TG pool exhaustion 4-layer defense - SKILL.md v4.5 1 conflict resolved: tools/terminal_tool.py (took upstream _safe_getcwd with TERMINAL_CWD fallback)
davidgut1982
pushed a commit
to davidgut1982/hermes-agent
that referenced
this pull request
Jun 5, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
13 tasks
changman
pushed a commit
to changman/hermes-agent
that referenced
this pull request
Jun 10, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Salvage of #36631 (@annguyenNous), rebased onto current main with the regression tests the original lacked. Fixes #36266.
When a persistent Docker sandbox container is removed out-of-band (idle reaper,
docker prune, OOM kill, daemon restart), the gateway kept issuingdocker execagainst the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted.DockerEnvironment.execute()now detects theNo such container/is not runningerror after a non-zero exit (gated onpersist_across_processes) and calls_recreate_container(): tries label-based reuse first, falls back to a fresh container replaying the same image + fullall_run_argsset, re-runsinit_session(), and retries the command once. A genuine non-zero exit is not misclassified as container-gone.Differences from #36631 as submitted
The original had no tests (contributing guide requires them for bug fixes). This salvage adds
tests/tools/test_docker_environment.pycoverage:_is_container_gonepattern matching, including the negative/control case (an ordinary command failure must not match)persist_across_processes=Falseopt-out (recovery must not run)_make_dummy_envnow forwardspersist_across_processes. The fix itself is @annguyenNous's, unchanged in behavior.Verification
test_docker_environment.py(4 new + all existing).docker rm -f'd it out-of-band, and the nextexecute()transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a realexit Npassed through without triggering recovery.Closes #36631. Fixes #36266.