fix(docker): auto-join Docker socket group for docker-in-docker backend#34407
Merged
Conversation
When users bind-mount /var/run/docker.sock to use TERMINAL_ENV=docker from inside the container, the supervised hermes user (UID 10000) lacks permission to talk to the socket — every `docker` invocation EACCES'es and check_terminal_requirements() returns False. In messaging mode this also silently strips the file/terminal toolset from the registered tool list, so the agent rationalizes the missing tools as a platform restriction. The naive workaround (docker run --group-add <socket-gid>) does NOT work with our s6-setuidgid privilege drop: s6-setuidgid calls initgroups() for the target user, which rebuilds supp groups from /etc/group. Without a matching /etc/group entry the kernel-granted supp group is wiped between PID 1 and the dropped hermes process. Verified empirically: --group-add 998 alone: PID 1 Groups: 0 998 → after drop: Groups: 10000 This fix's /etc/group add: id hermes shows 998 → after drop: Groups: 998 10000 Detect the socket's GID at boot in stage2-hook (runs as root before the privilege drop), reuse an existing group name if one matches the GID, otherwise create 'hostdocker'. Idempotent across container restarts. Silent no-op when no socket is mounted. End-to-end verified by building the image and running the supervised hermes user against the real host Docker daemon: `docker version` succeeds and check_terminal_requirements() returns True. Fixes #16703
Contributor
🔎 Lint report:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #16703. When users configure
TERMINAL_ENV=dockerand bind-mount/var/run/docker.sockto drive the host's Docker daemon from inside our container ("docker-out-of-docker" / DooD), the supervisedhermesuser (UID 10000) lacks permission to talk to the socket. Everydockerinvocation EACCES'es andcheck_terminal_requirements()returns False.In messaging-platform mode this has a strictly worse failure mode than the issue title suggests (surfaced by @osheari1 in the issue comments):
tools/__init__.py's_check_file_reqs()delegates tocheck_terminal_requirements(), so when the docker probe fails the entirefiletoolset (read_file/write_file/patch/search_files/terminal) is silently dropped from the registered tool schema, and the model confidently rationalizes the gaps as a Discord/Telegram sandboxing restriction.model_tools._tool_defs_cachethen memoizes that bad result for the rest of the gateway process lifetime.Why the obvious workaround doesn't work
The "just tell users to
docker run --group-add <socket-gid>" answer is broken on our image.s6-setuidgid(and gosu, the older shim) callsinitgroups()for the target user, which rebuilds the supplementary group list from/etc/group. Without a matching/etc/groupentry, the kernel-granted supp group is wiped between PID 1 and the droppedhermesprocess. Verified empirically against the published image:The asymmetry hides the bug from people debugging it:
docker exec --user hermesdoesn't runinitgroups(), so an interactive shell shows the group correctly while the supervised agent process has it wiped.Fix
In
docker/stage2-hook.sh(runs as root before the privilege drop), after the existing UID/GID remap:stat -c '%g'the bind-mounted/var/run/docker.sock(or/run/docker.sock)/etc/group, reuse its name; otherwisegroupadd -g <gid> hostdockerusermod -aG <group> hermesIdempotent across container restarts. Silent no-op when no socket is mounted. Non-fatal warnings under rootless containers where
groupadd/usermodcan fail.Deliberately a NO-OP on the toolset-stripping side: if the user has explicitly selected
terminal.backend: dockerand we still can't reach the daemon for some other reason, we must surface that as a failure, not silently fall back to a different backend (per maintainer guidance).Test plan (all run end-to-end against a freshly-built image and the real host Docker daemon, host socket GID 959)
--group-adddocker versionworks ass6-setuidgid hermes;check_terminal_requirements() == Truedocker restart(idempotency)tty=5)getent group 0→root,usermod -aG root hermessucceedsStage2 hook output on a successful first boot:
And the actual Python probe inside the running supervised container:
Out of scope (intentional, see issue thread)
_check_file_reqs→ entire file toolset disappears with no log line) — not addressed here because it crosses intotools/__init__.py+model_tools.py(cross-lane) and the user has explicitly directed that we never silently fall back from the docker backend to a different one. Worth a follow-up to at least surface the EACCES in logs without changing the fallback behavior, but separate PR.website/docs/user-guide/docker.mdalready says "bind-mount/var/run/docker.sockto opt in"; with this fix that's now sufficient and no further user-facing setup is required.Fixes #16703