fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631) by benbarclay · Pull Request #39415 · NousResearch/hermes-agent

benbarclay · 2026-06-05T00:11:03Z

Summary

Salvage of #36631 (@annguyenNous), rebased onto current main with the regression tests the original lacked. Fixes #36266.

When a persistent Docker sandbox container is removed out-of-band (idle reaper, docker prune, OOM kill, daemon restart), the gateway kept issuing docker exec against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the No such container / is not running error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is not misclassified as container-gone.

Differences from #36631 as submitted

The original had no tests (contributing guide requires them for bug fixes). This salvage adds tests/tools/test_docker_environment.py coverage:

_is_container_gone pattern matching, including the negative/control case (an ordinary command failure must not match)
the detect → recreate → retry success path
the persist_across_processes=False opt-out (recovery must not run)
the ordinary-failure passthrough (no spurious recreation on a real non-zero exit)

_make_dummy_env now forwards persist_across_processes. The fix itself is @annguyenNous's, unchanged in behavior.

Verification

Unit: 67/67 in test_docker_environment.py (4 new + all existing).
Live E2E against the real docker daemon: started a persistent container, docker rm -f'd it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real exit N passed through without triggering recovery.

Closes #36631. Fixes #36266.

@annguyenNous

… mode (salvage #36631) Salvage of #36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes #36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from #36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>

github-actions · 2026-06-05T00:11:52Z

🔎 Lint report: `fix/docker-recover-out-of-band-removal-salvage-36631` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9840 on HEAD, 9840 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5102 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Key upstream changes: - feat(delegation): uncap max_spawn_depth (NousResearch#39772) - feat(discord): voice-channel mixer with ambient idle bed (NousResearch#39659) - fix(terminal): guard os.getcwd() against deleted CWD (NousResearch#39275) - fix(telegram): finalize sealed overflow chunk formatting - fix(gateway): tolerate non-UTF-8 status/pid files - fix(mcp): widen shutdown exception guard to BaseException - fix(vision): detect vision-capable custom providers - fix(docker): recover from out-of-band container removal (NousResearch#39415) - desktop drag sessions, OAuth remote connect fixes - react-router 7.17.0 security bump (GHSA-8x6r-g9mw-2r78) Our patches preserved: - claude_session tool (17 files) - TG pool exhaustion 4-layer defense - SKILL.md v4.5 1 conflict resolved: tools/terminal_tool.py (took upstream _safe_getcwd with TERMINAL_CWD fallback)

@annguyenNous

… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>

@annguyenNous

… mode (salvage NousResearch#36631) (NousResearch#39415) Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with regression tests added. Fixes NousResearch#36266. When a persistent Docker sandbox container is removed out-of-band (idle reaper, `docker prune`, OOM kill, daemon restart), the gateway kept issuing `docker exec` against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted. DockerEnvironment.execute() now detects the "No such container" / "is not running" error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): it tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is NOT misclassified as container-gone. Differs from NousResearch#36631 as submitted: adds the tests the original lacked. tests/tools/test_docker_environment.py covers _is_container_gone pattern matching (incl. the negative/control case), the recover-and-retry path, the persist_across_processes=False opt-out (no recovery), and the ordinary-failure passthrough (no spurious recreation). _make_dummy_env now forwards persist_across_processes. Verified: - Unit: 67/67 in test_docker_environment.py (4 new + existing). - Live E2E against the real docker daemon: started a persistent container, `docker rm -f`'d it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real `exit N` passed through without triggering recovery. Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>

benbarclay merged commit 8a88844 into main Jun 5, 2026
23 checks passed

benbarclay deleted the fix/docker-recover-out-of-band-removal-salvage-36631 branch June 5, 2026 00:33

mrparker0980 mentioned this pull request Jun 7, 2026

fix(docker): confirm container is gone before recovery re-runs the command #41378

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631)#39415

fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631)#39415
benbarclay merged 1 commit into
mainfrom
fix/docker-recover-out-of-band-removal-salvage-36631

benbarclay commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

benbarclay commented Jun 5, 2026

Summary

Differences from #36631 as submitted

Verification

Uh oh!

github-actions Bot commented Jun 5, 2026

🔎 Lint report: fix/docker-recover-out-of-band-removal-salvage-36631 vs origin/main

ruff

ty (type checker)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

🔎 Lint report: `fix/docker-recover-out-of-band-removal-salvage-36631` vs `origin/main`