Skip to content

fix(docker): skip redundant s6 user drop#34684

Closed
0xdany wants to merge 1 commit into
NousResearch:mainfrom
0xdany:codex/fix-docker-group-add-s6-loop
Closed

fix(docker): skip redundant s6 user drop#34684
0xdany wants to merge 1 commit into
NousResearch:mainfrom
0xdany:codex/fix-docker-group-add-s6-loop

Conversation

@0xdany

@0xdany 0xdany commented May 29, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes a Docker image regression where containers started directly as the Hermes UID, for example with Compose user: "10000:10000" plus group_add for /var/run/docker.sock, could hit an s6-overlay restart loop:

s6-applyuidgid: fatal: unable to set supplementary group list: Operation not permitted

The Docker scripts now use a shared run-as-hermes.sh helper. It keeps the existing s6-setuidgid hermes behavior when the process is privileged, but bypasses that extra user/group reset when the current process is already running as the Hermes UID. That preserves Docker-injected supplementary groups and avoids the failing redundant privilege drop.

Related Issue

Fixes #34648

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added docker/run-as-hermes.sh with run_as_hermes and exec_as_hermes helpers.
  • Updated docker/stage2-hook.sh to use run_as_hermes for boot-time Hermes-owned file/directory setup.
  • Updated docker/main-wrapper.sh, docker/s6-rc.d/dashboard/run, and docker/cont-init.d/02-reconcile-profiles to use exec_as_hermes instead of unconditionally invoking s6-setuidgid hermes.
  • Added tests/tools/test_docker_run_as_hermes.py to cover the already-Hermes UID bypass, the privileged s6 path, and script wiring.

How to Test

  1. Run git diff --check.
  2. Run sh -n docker/run-as-hermes.sh docker/stage2-hook.sh docker/main-wrapper.sh docker/s6-rc.d/dashboard/run docker/cont-init.d/02-reconcile-profiles.
  3. Run uv run --with pytest --with pytest-timeout pytest tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py -q.
  4. Run scripts/run_tests.sh tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py.
  5. For manual container validation, start the image with user: "10000:10000", group_add containing the Docker socket GID, and /var/run/docker.sock:/var/run/docker.sock; startup should no longer fail with the s6 supplementary-group error.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS 26.3.1

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Validation run locally:

  • git diff --check passed
  • sh -n docker/run-as-hermes.sh docker/stage2-hook.sh docker/main-wrapper.sh docker/s6-rc.d/dashboard/run docker/cont-init.d/02-reconcile-profiles passed
  • uv run --with pytest --with pytest-timeout pytest tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py -q -> 10 passed
  • scripts/run_tests.sh tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py -> 10 tests passed, 0 failed

Full pytest tests/ -q was not run for this PR. Live Docker boot validation was also not run locally because Docker is installed but the daemon is unavailable on this machine: Cannot connect to the Docker daemon at unix:///Users/danyraihan/.orbstack/run/docker.sock.

@0xdany

0xdany commented May 29, 2026

Copy link
Copy Markdown
Author

@austinpickett I opened this as a focused P1 Docker regression fix for #34648. GitHub would not let me formally request review (RequestReviews permission denied), so tagging here.

Local validation:

  • git diff --check passed
  • sh -n docker/run-as-hermes.sh docker/stage2-hook.sh docker/main-wrapper.sh docker/s6-rc.d/dashboard/run docker/cont-init.d/02-reconcile-profiles passed
  • uv run --with pytest --with pytest-timeout pytest tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py -q -> 10 passed
  • scripts/run_tests.sh tests/tools/test_docker_run_as_hermes.py tests/tools/test_stage2_hook_puid_pgid.py -> 10 tests passed, 0 failed

I could not run a live Docker boot repro locally because the Docker daemon is unavailable on this machine, but the shell helper is covered for both paths: already-Hermes UID bypasses s6, privileged startup still uses s6.

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround area/docker Docker image, Compose, packaging backend/docker Docker container execution labels May 29, 2026
@benbarclay

Copy link
Copy Markdown
Collaborator

Superseded by #34837, which merged in 380ce47 and fixes the identical bug — the s6-applyuidgid: fatal: unable to set supplementary group list: Operation not permitted boot loop on non-root / group_add containers, with the same approach (skip the s6-setuidgid drop when the process is already running unprivileged).

#34837 guards each drop site inline ([ "$(id -u)" = 0 ] || exec <cmd>) across 02-reconcile-profiles, main-wrapper.sh, dashboard/run, stage2-hook.sh, and service_manager.py. Your PR solved it via a shared docker/run-as-hermes.sh helper — a clean approach, but since #34837 already landed, this now conflicts against main.

I verified #34837 end-to-end: an old image boots --user 10000 → exit 111 with that exact fatal error; the fixed image boots → running, exit 0, all s6 services up. Thanks for the fix @0xdany — the underlying regression is resolved. Closing as superseded.

@benbarclay benbarclay closed this Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging backend/docker Docker container execution P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Regression in v0.15.1: non-root Docker + group_add (docker.sock GID) causes s6-overlay boot loop

3 participants