Skip to content

fix(docker): recover from out-of-band container removal in persistent mode#36631

Closed
annguyenNous wants to merge 1 commit into
NousResearch:mainfrom
annguyenNous:fix/docker-persistent-container-recovery-clean
Closed

fix(docker): recover from out-of-band container removal in persistent mode#36631
annguyenNous wants to merge 1 commit into
NousResearch:mainfrom
annguyenNous:fix/docker-persistent-container-recovery-clean

Conversation

@annguyenNous

Copy link
Copy Markdown
Contributor

Problem

When a persistent Docker sandbox container is removed out-of-band (idle reaper, docker prune, OOM kill, daemon restart), the gateway keeps issuing docker exec against the dead container ID, returning "No such container" on every subsequent tool call. The agent is permanently blocked until the gateway process is restarted.

Reproduced with:

  • terminal.backend: docker, container_persistent: true
  • Container removed externally via docker rm -f hermes-XXXX
  • All subsequent terminal() calls return the same error indefinitely

Fix

Added automatic recovery in DockerEnvironment.execute():

  1. Detection: execute() checks the output for "No such container" or "is not running" patterns after any non-zero exit
  2. Recovery: When detected in a persistent session, _recreate_container():
    • Probes for a reusable container via labels (another process may have recreated it)
    • Falls back to creating a fresh container with the same image and run-args
    • Re-initializes the session snapshot
  3. Retry: The original command is retried transparently

Only activates for persist_across_processes=True sessions (one-shot containers don't need recovery).

Tests

All 63 existing docker environment tests pass:

tests/tools/test_docker_environment.py — 63 passed in 2.70s

Fixes #36266

… mode

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, docker prune, OOM kill, daemon restart), the gateway keeps
issuing 'docker exec' against the dead container ID, returning 'No such
container' on every subsequent tool call.  The agent is permanently
blocked until the gateway process is restarted.

This fix adds automatic recovery: when execute() detects a 'No such
container' or 'is not running' error in a persistent session, it
invalidates the cached container handle, probes for a reusable container
via labels, and falls back to creating a fresh one — then re-initializes
the session snapshot and retries the original command transparently.

Fixes NousResearch#36266
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists backend/docker Docker container execution labels Jun 1, 2026
benbarclay added a commit that referenced this pull request Jun 5, 2026
… mode (salvage #36631) (#39415)

Salvage of #36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes #36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from #36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
davidgut1982 pushed a commit to davidgut1982/hermes-agent that referenced this pull request Jun 5, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415)

Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes NousResearch#36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from NousResearch#36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
… mode (salvage NousResearch#36631) (NousResearch#39415)

Salvage of NousResearch#36631 (@annguyenNous), rebased onto current main with
regression tests added. Fixes NousResearch#36266.

When a persistent Docker sandbox container is removed out-of-band (idle
reaper, `docker prune`, OOM kill, daemon restart), the gateway kept
issuing `docker exec` against the dead container ID, returning
"No such container" on every subsequent tool call — the agent was
permanently blocked until the gateway process restarted.

DockerEnvironment.execute() now detects the "No such container" /
"is not running" error after a non-zero exit (gated on
persist_across_processes) and calls _recreate_container(): it tries
label-based reuse first, falls back to a fresh container replaying the
same image + full all_run_args set, re-runs init_session(), and retries
the command once. A genuine non-zero exit is NOT misclassified as
container-gone.

Differs from NousResearch#36631 as submitted: adds the tests the original lacked.
tests/tools/test_docker_environment.py covers _is_container_gone pattern
matching (incl. the negative/control case), the recover-and-retry path,
the persist_across_processes=False opt-out (no recovery), and the
ordinary-failure passthrough (no spurious recreation). _make_dummy_env
now forwards persist_across_processes.

Verified:
- Unit: 67/67 in test_docker_environment.py (4 new + existing).
- Live E2E against the real docker daemon: started a persistent
  container, `docker rm -f`'d it out-of-band, and the next execute()
  transparently recreated a fresh container and succeeded; a follow-up
  command worked in the recovered container; a real `exit N` passed
  through without triggering recovery.

Co-authored-by: annguyenNous <annguyenNous@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend/docker Docker container execution P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Persistent docker sandbox: gateway loops on No such container when the pinned container is removed out-of-band (never re-spawns)

2 participants