Skip to content

fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631)#39415

Merged
benbarclay merged 1 commit into
mainfrom
fix/docker-recover-out-of-band-removal-salvage-36631
Jun 5, 2026
Merged

fix(docker): recover from out-of-band container removal in persistent mode (salvage #36631)#39415
benbarclay merged 1 commit into
mainfrom
fix/docker-recover-out-of-band-removal-salvage-36631

Conversation

@benbarclay

Copy link
Copy Markdown
Collaborator

Summary

Salvage of #36631 (@annguyenNous), rebased onto current main with the regression tests the original lacked. Fixes #36266.

When a persistent Docker sandbox container is removed out-of-band (idle reaper, docker prune, OOM kill, daemon restart), the gateway kept issuing docker exec against the dead container ID, returning "No such container" on every subsequent tool call — the agent was permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the No such container / is not running error after a non-zero exit (gated on persist_across_processes) and calls _recreate_container(): tries label-based reuse first, falls back to a fresh container replaying the same image + full all_run_args set, re-runs init_session(), and retries the command once. A genuine non-zero exit is not misclassified as container-gone.

Differences from #36631 as submitted

The original had no tests (contributing guide requires them for bug fixes). This salvage adds tests/tools/test_docker_environment.py coverage:

  • _is_container_gone pattern matching, including the negative/control case (an ordinary command failure must not match)
  • the detect → recreate → retry success path
  • the persist_across_processes=False opt-out (recovery must not run)
  • the ordinary-failure passthrough (no spurious recreation on a real non-zero exit)

_make_dummy_env now forwards persist_across_processes. The fix itself is @annguyenNous's, unchanged in behavior.

Verification

  • Unit: 67/67 in test_docker_environment.py (4 new + all existing).
  • Live E2E against the real docker daemon: started a persistent container, docker rm -f'd it out-of-band, and the next execute() transparently recreated a fresh container and succeeded; a follow-up command worked in the recovered container; a real exit N passed through without triggering recovery.

Closes #36631. Fixes #36266.

… mode (salvage #36631)

Salvage of #36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes #36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from #36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

🔎 Lint report: fix/docker-recover-out-of-band-removal-salvage-36631 vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9840 on HEAD, 9840 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 5102 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

@benbarclay benbarclay merged commit 8a88844 into main Jun 5, 2026
23 checks passed
@benbarclay benbarclay deleted the fix/docker-recover-out-of-band-removal-salvage-36631 branch June 5, 2026 00:33
NelsonLongxiang pushed a commit to NelsonLongxiang/hermes-agent that referenced this pull request Jun 5, 2026
Key upstream changes:
- feat(delegation): uncap max_spawn_depth (NousResearch#39772)
- feat(discord): voice-channel mixer with ambient idle bed (NousResearch#39659)
- fix(terminal): guard os.getcwd() against deleted CWD (NousResearch#39275)
- fix(telegram): finalize sealed overflow chunk formatting
- fix(gateway): tolerate non-UTF-8 status/pid files
- fix(mcp): widen shutdown exception guard to BaseException
- fix(vision): detect vision-capable custom providers
- fix(docker): recover from out-of-band container removal (NousResearch#39415)
- desktop drag sessions, OAuth remote connect fixes
- react-router 7.17.0 security bump (GHSA-8x6r-g9mw-2r78)

Our patches preserved:
- claude_session tool (17 files)
- TG pool exhaustion 4-layer defense
- SKILL.md v4.5

1 conflict resolved: tools/terminal_tool.py (took upstream _safe_getcwd with TERMINAL_CWD fallback)
davidgut1982 pushed a commit to davidgut1982/hermes-agent that referenced this pull request Jun 5, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415)

Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes NousResearch#36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from NousResearch#36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415)

Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes NousResearch#36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from NousResearch#36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent docker sandbox: gateway loops on No such container when the pinned container is removed out-of-band (never re-spawns)

1 participant