Skip to content

fix(docker): boot non-root containers, skip the s6-setuidgid drop when already unprivileged.#34837

Merged
benbarclay merged 1 commit into
NousResearch:mainfrom
IAvecilla:fix/non-root-s6-boot-loop
Jun 1, 2026
Merged

fix(docker): boot non-root containers, skip the s6-setuidgid drop when already unprivileged.#34837
benbarclay merged 1 commit into
NousResearch:mainfrom
IAvecilla:fix/non-root-s6-boot-loop

Conversation

@IAvecilla

@IAvecilla IAvecilla commented May 29, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes the s6-overlay boot loop that hits any container started as a non-root user.
The container drops privileges to the unprivileged hermes user via s6-setuidgid hermes <cmd> in every boot script. s6-setuidgid calls setgroups(), which requires CAP_SETGID. A container started as root has it; a container started non-root does not, so every s6-setuidgid invocation dies with:

s6-applyuidgid: fatal: unable to set supplementary group list: Operation not permitted

The cont-init hooks exit 111 and the supervised services crash-loop, so the container never finishes booting.

The fix guards each privilege drop: if already non-root, run the command directly; only call s6-setuidgid when we're root (there's something to drop). This restores the v0.14 behavior for non-root containers while leaving root containers byte-for-byte unchanged.

Related Issue

Fixes #34648

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

All changes apply the same guard: [ "$(id -u)" = 0 ] → only drop via s6-setuidgid when root; otherwise exec directly.

  • docker/main-wrapper.sh: add a drop() helper and route the three CMD exec paths through it.
  • docker/stage2-hook.sh: add an as_hermes() helper; route the four inline drops (mkdir / tee / cp / skills-sync) through it.
  • docker/cont-init.d/02-reconcile-profiles: guard the container_boot drop.
  • docker/s6-rc.d/dashboard/run: guard the dashboard drop (command/flags unchanged).
  • hermes_cli/service_manager.py: guard the generated per-profile gateway run and log run scripts (_render_gateway_run / _render_log_run) — these were the second set of drops, emitted at runtime.

Security note

This does not skip the privilege drop for root containers. The guard only no-ops the drop when the container is already running as a non-root user, where setgroups() is both impossible (no CAP_SETGID) and unnecessary (we're already unprivileged). There is no path where a root process avoids dropping to hermes. No privilege escalation, no change to network exposure, and the dashboard's auth (OAuth gate, replay-secret check, --insecure default) is untouched. Only the OS user the process runs as changes, and only in the non-root case the operator explicitly opted into.

How to Test

  1. Reproduce (pre-fix): docker run --rm --user 10000:10000 <pre-fix v0.15 image> → boot loops with s6-applyuidgid: ... Operation not permitted.
  2. Non-root (fixed): same --user 10000:10000 run → boots clean, no Operation not permitted, no cont-init exited 111, services start and stay up.
  3. Root (unchanged): default docker run (root) → boots normally and the workload still runs as hermes (UID 10000) via s6-setuidgid.
  4. Unit tests: pytest tests/hermes_cli/test_service_manager.py tests/test_docker_home_override_scripts.py -q — the generated-script assertions still pass (the exec s6-setuidgid hermes … lines are retained as the root branch).

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits.
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've run pytest tests/ -q and all tests pass.
  • I've added tests for my changes
  • I've tested on my platform.

Documentation & Housekeeping

  • I've updated relevant documentation — N/A (inline comments only)
  • I've updated cli-config.yaml.example if I added/changed config keys — N/A
  • I've updated CONTRIBUTING.md / AGENTS.md if I changed architecture or workflows — N/A
  • I've considered cross-platform impact — N/A (container-only; root path unchanged, only adds a non-root fallback)
  • I've updated tool descriptions/schemas if I changed tool behavior — N/A

@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround area/docker Docker image, Compose, packaging labels May 29, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Competing with #34684 (same issue #34648). Both fix s6-setuidgid boot loop for non-root containers. This PR has broader scope (guards all drop sites including service_manager.py generated scripts). See also merged #34407, #33078, #32412 for prior s6 fixes in the same area.

@benbarclay benbarclay merged commit 380ce47 into NousResearch:main Jun 1, 2026
25 checks passed
benbarclay added a commit that referenced this pull request Jun 4, 2026
…ar guidance (#38579)

`docker run --user $(id -u):$(id -g)` was a tini-era trick to make
container-written files match the host user. Under s6-overlay it no longer
works: the bootstrap (UID remap, volume + build-tree chown, config seeding)
needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui,
node_modules) are owned by the hermes build UID (10000). A pinned arbitrary
UID can't write them, so the runtime fails with EACCES on a bind mount or
hard-crashes on a named volume (Docker inits the volume from the image as
10000; the non-root start can't even `cd /opt/data`, and the profile
reconciler dies with PermissionError on gateway_state.json).

Detect that start early in both the cont-init hook (stage2-hook.sh) and the
CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing
at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID
aliases), which remaps the hermes user and chowns the volume — the same
host-UID-matching outcome --user was used for, without breaking s6.

The guard fires only when the current UID is neither root NOR the hermes UID.
This preserves the supported non-root start from #34648/#34837 (running with
`--user 10000:10000`, i.e. pinned to the hermes UID itself), which is
unaffected — only the arbitrary-UID variant that #34837 never actually made
writable is rejected.

Verified live across five scenarios (built image, bind + named volume):
arbitrary --user on bind -> rejected with guidance, hermes does not run;
arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash;
--user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not
tripped; default root start -> boots. Pre-fix control reproduces the raw
PermissionError + 'can't cd' crash with no guidance.
JoeKowal pushed a commit to JoeKowal/hermes-agent that referenced this pull request Jun 4, 2026
Yuki-14544869 pushed a commit to Yuki-14544869/hermes-agent that referenced this pull request Jun 4, 2026
…ar guidance (NousResearch#38579)

`docker run --user $(id -u):$(id -g)` was a tini-era trick to make
container-written files match the host user. Under s6-overlay it no longer
works: the bootstrap (UID remap, volume + build-tree chown, config seeding)
needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui,
node_modules) are owned by the hermes build UID (10000). A pinned arbitrary
UID can't write them, so the runtime fails with EACCES on a bind mount or
hard-crashes on a named volume (Docker inits the volume from the image as
10000; the non-root start can't even `cd /opt/data`, and the profile
reconciler dies with PermissionError on gateway_state.json).

Detect that start early in both the cont-init hook (stage2-hook.sh) and the
CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing
at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID
aliases), which remaps the hermes user and chowns the volume — the same
host-UID-matching outcome --user was used for, without breaking s6.

The guard fires only when the current UID is neither root NOR the hermes UID.
This preserves the supported non-root start from NousResearch#34648/NousResearch#34837 (running with
`--user 10000:10000`, i.e. pinned to the hermes UID itself), which is
unaffected — only the arbitrary-UID variant that NousResearch#34837 never actually made
writable is rejected.

Verified live across five scenarios (built image, bind + named volume):
arbitrary --user on bind -> rejected with guidance, hermes does not run;
arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash;
--user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not
tripped; default root start -> boots. Pre-fix control reproduces the raw
PermissionError + 'can't cd' crash with no guidance.
davidgut1982 pushed a commit to davidgut1982/hermes-agent that referenced this pull request Jun 5, 2026
…ar guidance (NousResearch#38579)

`docker run --user $(id -u):$(id -g)` was a tini-era trick to make
container-written files match the host user. Under s6-overlay it no longer
works: the bootstrap (UID remap, volume + build-tree chown, config seeding)
needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui,
node_modules) are owned by the hermes build UID (10000). A pinned arbitrary
UID can't write them, so the runtime fails with EACCES on a bind mount or
hard-crashes on a named volume (Docker inits the volume from the image as
10000; the non-root start can't even `cd /opt/data`, and the profile
reconciler dies with PermissionError on gateway_state.json).

Detect that start early in both the cont-init hook (stage2-hook.sh) and the
CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing
at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID
aliases), which remaps the hermes user and chowns the volume — the same
host-UID-matching outcome --user was used for, without breaking s6.

The guard fires only when the current UID is neither root NOR the hermes UID.
This preserves the supported non-root start from NousResearch#34648/NousResearch#34837 (running with
`--user 10000:10000`, i.e. pinned to the hermes UID itself), which is
unaffected — only the arbitrary-UID variant that NousResearch#34837 never actually made
writable is rejected.

Verified live across five scenarios (built image, bind + named volume):
arbitrary --user on bind -> rejected with guidance, hermes does not run;
arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash;
--user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not
tripped; default root start -> boots. Pre-fix control reproduces the raw
PermissionError + 'can't cd' crash with no guidance.
changman pushed a commit to changman/hermes-agent that referenced this pull request Jun 10, 2026
…ar guidance (NousResearch#38579)

`docker run --user $(id -u):$(id -g)` was a tini-era trick to make
container-written files match the host user. Under s6-overlay it no longer
works: the bootstrap (UID remap, volume + build-tree chown, config seeding)
needs root, and the baked image dirs (/opt/data, /opt/hermes/.venv, ui-tui,
node_modules) are owned by the hermes build UID (10000). A pinned arbitrary
UID can't write them, so the runtime fails with EACCES on a bind mount or
hard-crashes on a named volume (Docker inits the volume from the image as
10000; the non-root start can't even `cd /opt/data`, and the profile
reconciler dies with PermissionError on gateway_state.json).

Detect that start early in both the cont-init hook (stage2-hook.sh) and the
CMD wrapper (main-wrapper.sh) and fail fast with actionable guidance pointing
at the supported path: root start + HERMES_UID/HERMES_GID (or the PUID/PGID
aliases), which remaps the hermes user and chowns the volume — the same
host-UID-matching outcome --user was used for, without breaking s6.

The guard fires only when the current UID is neither root NOR the hermes UID.
This preserves the supported non-root start from NousResearch#34648/NousResearch#34837 (running with
`--user 10000:10000`, i.e. pinned to the hermes UID itself), which is
unaffected — only the arbitrary-UID variant that NousResearch#34837 never actually made
writable is rejected.

Verified live across five scenarios (built image, bind + named volume):
arbitrary --user on bind -> rejected with guidance, hermes does not run;
arbitrary --user on named volume -> guidance shown, no raw 'can't cd' crash;
--user 10000:10000 -> boots; root + HERMES_UID=4242 remap -> boots, guard not
tripped; default root start -> boots. Pre-fix control reproduces the raw
PermissionError + 'can't cd' crash with no guidance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Regression in v0.15.1: non-root Docker + group_add (docker.sock GID) causes s6-overlay boot loop

3 participants