fix(docker): auto-join Docker socket group for docker-in-docker backend by benbarclay · Pull Request #34407 · NousResearch/hermes-agent

benbarclay · 2026-05-29T06:14:59Z

Summary

Fixes #16703. When users configure TERMINAL_ENV=docker and bind-mount /var/run/docker.sock to drive the host's Docker daemon from inside our container ("docker-out-of-docker" / DooD), the supervised hermes user (UID 10000) lacks permission to talk to the socket. Every docker invocation EACCES'es and check_terminal_requirements() returns False.

In messaging-platform mode this has a strictly worse failure mode than the issue title suggests (surfaced by @osheari1 in the issue comments): tools/__init__.py's _check_file_reqs() delegates to check_terminal_requirements(), so when the docker probe fails the entire file toolset (read_file/write_file/patch/search_files/terminal) is silently dropped from the registered tool schema, and the model confidently rationalizes the gaps as a Discord/Telegram sandboxing restriction. model_tools._tool_defs_cache then memoizes that bad result for the rest of the gateway process lifetime.

Why the obvious workaround doesn't work

The "just tell users to docker run --group-add <socket-gid>" answer is broken on our image. s6-setuidgid (and gosu, the older shim) calls initgroups() for the target user, which rebuilds the supplementary group list from /etc/group. Without a matching /etc/group entry, the kernel-granted supp group is wiped between PID 1 and the dropped hermes process. Verified empirically against the published image:

--group-add 998 alone:        PID 1 has Groups: 0 998
                              after s6-setuidgid hermes: Groups: 10000   ← 998 gone
With this fix's /etc/group:   id hermes shows 998
                              after s6-setuidgid hermes: Groups: 998 10000   ← survives

The asymmetry hides the bug from people debugging it: docker exec --user hermes doesn't run initgroups(), so an interactive shell shows the group correctly while the supervised agent process has it wiped.

Fix

In docker/stage2-hook.sh (runs as root before the privilege drop), after the existing UID/GID remap:

stat -c '%g' the bind-mounted /var/run/docker.sock (or /run/docker.sock)
If a group with that GID already exists in /etc/group, reuse its name; otherwise groupadd -g <gid> hostdocker
usermod -aG <group> hermes

Idempotent across container restarts. Silent no-op when no socket is mounted. Non-fatal warnings under rootless containers where groupadd/usermod can fail.

Deliberately a NO-OP on the toolset-stripping side: if the user has explicitly selected terminal.backend: docker and we still can't reach the daemon for some other reason, we must surface that as a failure, not silently fall back to a different backend (per maintainer guidance).

Test plan (all run end-to-end against a freshly-built image and the real host Docker daemon, host socket GID 959)

Scenario	Expected	Result
Socket mounted, no `--group-add`	`docker version` works as `s6-setuidgid hermes`; `check_terminal_requirements() == True`	✅
No socket mounted	Hook silent, no warnings, no group changes	✅
`docker restart` (idempotency)	Second boot logs "hermes already in group 959", no duplicate work	✅
Socket GID collides with existing container group (e.g. `tty`=5)	Reuses existing group name, no duplicate group	✅
Socket owned by root (GID 0, some Podman setups)	`getent group 0` → `root`, `usermod -aG root hermes` succeeds	✅

Stage2 hook output on a successful first boot:

[stage2] Created group hostdocker (GID 959) for Docker socket
[stage2] Added hermes to group hostdocker (GID 959) for /var/run/docker.sock

And the actual Python probe inside the running supervised container:

$ docker exec -e TERMINAL_ENV=docker <cid> /command/s6-setuidgid hermes \
    /opt/hermes/.venv/bin/python -c \
    'from tools.terminal_tool import check_terminal_requirements; \
     print(check_terminal_requirements())'
True

Out of scope (intentional, see issue thread)

Toolset-stripping side of the bug (_check_file_reqs → entire file toolset disappears with no log line) — not addressed here because it crosses into tools/__init__.py + model_tools.py (cross-lane) and the user has explicitly directed that we never silently fall back from the docker backend to a different one. Worth a follow-up to at least surface the EACCES in logs without changing the fallback behavior, but separate PR.
Documentation update — the existing website/docs/user-guide/docker.md already says "bind-mount /var/run/docker.sock to opt in"; with this fix that's now sufficient and no further user-facing setup is required.

Fixes #16703

When users bind-mount /var/run/docker.sock to use TERMINAL_ENV=docker from inside the container, the supervised hermes user (UID 10000) lacks permission to talk to the socket — every `docker` invocation EACCES'es and check_terminal_requirements() returns False. In messaging mode this also silently strips the file/terminal toolset from the registered tool list, so the agent rationalizes the missing tools as a platform restriction. The naive workaround (docker run --group-add <socket-gid>) does NOT work with our s6-setuidgid privilege drop: s6-setuidgid calls initgroups() for the target user, which rebuilds supp groups from /etc/group. Without a matching /etc/group entry the kernel-granted supp group is wiped between PID 1 and the dropped hermes process. Verified empirically: --group-add 998 alone: PID 1 Groups: 0 998 → after drop: Groups: 10000 This fix's /etc/group add: id hermes shows 998 → after drop: Groups: 998 10000 Detect the socket's GID at boot in stage2-hook (runs as root before the privilege drop), reuse an existing group name if one matches the GID, otherwise create 'hostdocker'. Idempotent across container restarts. Silent no-op when no socket is mounted. End-to-end verified by building the image and running the supervised hermes user against the real host Docker daemon: `docker version` succeeds and check_terminal_requirements() returns True. Fixes #16703

github-actions · 2026-05-29T06:18:01Z

🔎 Lint report: `docker-terminal-did` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9419 on HEAD, 9419 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

benbarclay merged commit ec7736f into main May 29, 2026
21 checks passed

benbarclay deleted the docker-terminal-did branch May 29, 2026 06:15

alt-glitch added type/bug Something isn't working area/docker Docker image, Compose, packaging backend/docker Docker container execution P2 Medium — degraded but workaround exists labels May 29, 2026

This was referenced May 29, 2026

[Bug]: Regression in v0.15.1: non-root Docker + group_add (docker.sock GID) causes s6-overlay boot loop #34648

Closed

fix(docker): boot non-root containers, skip the s6-setuidgid drop when already unprivileged. #34837

Merged

github-actions Bot mentioned this pull request Jun 6, 2026

chore: bump NousResearch/hermes-agent version from v2026.5.29.2 to v2026.6.5 Docker-Hub-sirmark/docker-hermes-agent#9

Merged

ashanzzz mentioned this pull request Jun 6, 2026

[Bug] Docker backend generates container-internal paths for sandbox bind mounts when using host Docker socket (DooD on Unraid) #40368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(docker): auto-join Docker socket group for docker-in-docker backend#34407

fix(docker): auto-join Docker socket group for docker-in-docker backend#34407
benbarclay merged 1 commit into
mainfrom
docker-terminal-did

benbarclay commented May 29, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benbarclay commented May 29, 2026

Summary

Why the obvious workaround doesn't work

Fix

Test plan (all run end-to-end against a freshly-built image and the real host Docker daemon, host socket GID 959)

Out of scope (intentional, see issue thread)

Uh oh!

Uh oh!

github-actions Bot commented May 29, 2026

🔎 Lint report: docker-terminal-did vs origin/main

ruff

ty (type checker)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

🔎 Lint report: `docker-terminal-did` vs `origin/main`