Skip to content

fix(docker): auto-join Docker socket group for docker-in-docker backend#34407

Merged
benbarclay merged 1 commit into
mainfrom
docker-terminal-did
May 29, 2026
Merged

fix(docker): auto-join Docker socket group for docker-in-docker backend#34407
benbarclay merged 1 commit into
mainfrom
docker-terminal-did

Conversation

@benbarclay

Copy link
Copy Markdown
Collaborator

Summary

Fixes #16703. When users configure TERMINAL_ENV=docker and bind-mount /var/run/docker.sock to drive the host's Docker daemon from inside our container ("docker-out-of-docker" / DooD), the supervised hermes user (UID 10000) lacks permission to talk to the socket. Every docker invocation EACCES'es and check_terminal_requirements() returns False.

In messaging-platform mode this has a strictly worse failure mode than the issue title suggests (surfaced by @osheari1 in the issue comments): tools/__init__.py's _check_file_reqs() delegates to check_terminal_requirements(), so when the docker probe fails the entire file toolset (read_file/write_file/patch/search_files/terminal) is silently dropped from the registered tool schema, and the model confidently rationalizes the gaps as a Discord/Telegram sandboxing restriction. model_tools._tool_defs_cache then memoizes that bad result for the rest of the gateway process lifetime.

Why the obvious workaround doesn't work

The "just tell users to docker run --group-add <socket-gid>" answer is broken on our image. s6-setuidgid (and gosu, the older shim) calls initgroups() for the target user, which rebuilds the supplementary group list from /etc/group. Without a matching /etc/group entry, the kernel-granted supp group is wiped between PID 1 and the dropped hermes process. Verified empirically against the published image:

--group-add 998 alone:        PID 1 has Groups: 0 998
                              after s6-setuidgid hermes: Groups: 10000   ← 998 gone
With this fix's /etc/group:   id hermes shows 998
                              after s6-setuidgid hermes: Groups: 998 10000   ← survives

The asymmetry hides the bug from people debugging it: docker exec --user hermes doesn't run initgroups(), so an interactive shell shows the group correctly while the supervised agent process has it wiped.

Fix

In docker/stage2-hook.sh (runs as root before the privilege drop), after the existing UID/GID remap:

  1. stat -c '%g' the bind-mounted /var/run/docker.sock (or /run/docker.sock)
  2. If a group with that GID already exists in /etc/group, reuse its name; otherwise groupadd -g <gid> hostdocker
  3. usermod -aG <group> hermes

Idempotent across container restarts. Silent no-op when no socket is mounted. Non-fatal warnings under rootless containers where groupadd/usermod can fail.

Deliberately a NO-OP on the toolset-stripping side: if the user has explicitly selected terminal.backend: docker and we still can't reach the daemon for some other reason, we must surface that as a failure, not silently fall back to a different backend (per maintainer guidance).

Test plan (all run end-to-end against a freshly-built image and the real host Docker daemon, host socket GID 959)

Scenario Expected Result
Socket mounted, no --group-add docker version works as s6-setuidgid hermes; check_terminal_requirements() == True
No socket mounted Hook silent, no warnings, no group changes
docker restart (idempotency) Second boot logs "hermes already in group 959", no duplicate work
Socket GID collides with existing container group (e.g. tty=5) Reuses existing group name, no duplicate group
Socket owned by root (GID 0, some Podman setups) getent group 0root, usermod -aG root hermes succeeds

Stage2 hook output on a successful first boot:

[stage2] Created group hostdocker (GID 959) for Docker socket
[stage2] Added hermes to group hostdocker (GID 959) for /var/run/docker.sock

And the actual Python probe inside the running supervised container:

$ docker exec -e TERMINAL_ENV=docker <cid> /command/s6-setuidgid hermes \
    /opt/hermes/.venv/bin/python -c \
    'from tools.terminal_tool import check_terminal_requirements; \
     print(check_terminal_requirements())'
True

Out of scope (intentional, see issue thread)

  • Toolset-stripping side of the bug (_check_file_reqs → entire file toolset disappears with no log line) — not addressed here because it crosses into tools/__init__.py + model_tools.py (cross-lane) and the user has explicitly directed that we never silently fall back from the docker backend to a different one. Worth a follow-up to at least surface the EACCES in logs without changing the fallback behavior, but separate PR.
  • Documentation update — the existing website/docs/user-guide/docker.md already says "bind-mount /var/run/docker.sock to opt in"; with this fix that's now sufficient and no further user-facing setup is required.

Fixes #16703

When users bind-mount /var/run/docker.sock to use TERMINAL_ENV=docker from
inside the container, the supervised hermes user (UID 10000) lacks
permission to talk to the socket — every `docker` invocation EACCES'es and
check_terminal_requirements() returns False. In messaging mode this also
silently strips the file/terminal toolset from the registered tool list,
so the agent rationalizes the missing tools as a platform restriction.

The naive workaround (docker run --group-add <socket-gid>) does NOT work
with our s6-setuidgid privilege drop: s6-setuidgid calls initgroups() for
the target user, which rebuilds supp groups from /etc/group. Without a
matching /etc/group entry the kernel-granted supp group is wiped between
PID 1 and the dropped hermes process. Verified empirically:

  --group-add 998 alone:    PID 1 Groups: 0 998 → after drop: Groups: 10000
  This fix's /etc/group add: id hermes shows 998 → after drop: Groups: 998 10000

Detect the socket's GID at boot in stage2-hook (runs as root before the
privilege drop), reuse an existing group name if one matches the GID,
otherwise create 'hostdocker'. Idempotent across container restarts.
Silent no-op when no socket is mounted.

End-to-end verified by building the image and running the supervised
hermes user against the real host Docker daemon: `docker version`
succeeds and check_terminal_requirements() returns True.

Fixes #16703
@benbarclay benbarclay merged commit ec7736f into main May 29, 2026
21 checks passed
@benbarclay benbarclay deleted the docker-terminal-did branch May 29, 2026 06:15
@github-actions

Copy link
Copy Markdown
Contributor

🔎 Lint report: docker-terminal-did vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 9419 on HEAD, 9419 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 4890 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging backend/docker Docker container execution P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

2 participants