Skip to content

fix(docker): per-container s6-log lock + logger down marker (#34457, #34480)#34473

Open
Bartok9 wants to merge 4 commits into
NousResearch:mainfrom
Bartok9:fix/s6-log-shared-volume-lock-34457
Open

fix(docker): per-container s6-log lock + logger down marker (#34457, #34480)#34473
Bartok9 wants to merge 4 commits into
NousResearch:mainfrom
Bartok9:fix/s6-log-shared-volume-lock-34457

Conversation

@Bartok9

@Bartok9 Bartok9 commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Motivation

Closes #34457.
Closes #34480.

s6-log takes an exclusive lock on its log directory (logs/gateways/<profile>/lock) to guarantee a single writer. In a multi-container Compose stack where gateway, a custom-profile gateway, and dashboard bind-mount the same host data folder (the documented way to keep SQLite state, dynamic skills, and config in sync), two distinct problems combine into one crash loop:

  1. [Bug]: s6-log lock collision loops endlessly in multi-container gateway + dashboard shared volume setups #34457 — collision. Multiple containers run a gateway-default s6-log against the same logs/gateways/default path. The first claims the lock; the others die with "unable to take lock" and s6-supervise restarts them instantly.

  2. [Bug]: s6-log lock collision in multi-container setup with shared /opt/data volume #34480 — un-owned loggers run at all. container_boot._register_service writes a down marker on a profile's service slot when the container doesn't own that profile — but s6 supervises the log/ pipeline child independently of its producer, so the marker does not cascade. Every non-owning container therefore still starts gateway-<profile>/log's s6-log for all profiles. The dashboard container (which runs no gateway at all) is the clearest victim.

The fix addresses both:

  • Per-container nesting ([Bug]: s6-log lock collision loops endlessly in multi-container gateway + dashboard shared volume setups #34457): nest each container's gateway log directory under a per-container token so the lock paths no longer collide, while still persisting on the shared volume. The token is resolved portablylog/run executes under with-contenv sh (dash/execline, not bash, where $HOSTNAME is not guaranteed) — via HERMES_CONTAINER_ID$HOSTNAMEhostname/etc/hostname. Docker populates the latter three with the unique container ID by default. On by default; single-container deployments can opt out with HERMES_GATEWAY_LOG_PER_CONTAINER=0.

  • Logger down marker ([Bug]: s6-log lock collision in multi-container setup with shared /opt/data volume #34480, the issue's Option B): when a profile is registered in the down state, also write a down marker into its log/ sub-service so the un-owned logger does not run at all. This is the source-level fix; the per-container nesting is the safety net for any logger that does start.

This is effectively the issue's Option C (both), which #34480 explicitly recommends.

Verification

  • python3 -m pytest tests/hermes_cli/test_service_manager.py tests/hermes_cli/test_container_boot.py80 passed, including:
    • per-container log nesting on by default + HERMES_GATEWAY_LOG_PER_CONTAINER=0 opt-out flat path
    • down (un-owned) profile also marks log/ down; running (owned) profile leaves the logger free
  • sh -n + shellcheck -s sh on the rendered log/run — clean (POSIX-safe; no SC3028 $HOSTNAME bash-ism).
  • Docker-integration test updated to resolve the rotated current file via find … -name current so it is agnostic to the flat-vs-nested layout.
  • Docs: website/docs/user-guide/docker.md documents the per-container layout, the opt-out flag, and a "Sharing a data volume across containers" section.

Backwards compatibility

  • The base path is unchanged (logs/gateways/<profile>/…); only a per-container segment is appended by default.
  • HERMES_GATEWAY_LOG_PER_CONTAINER=0 restores the exact legacy flat path for single-container users.
  • The logger down marker is only added when the gateway slot itself is registered down (un-owned profile) — owned/started profiles are unaffected, so their logs are captured exactly as before.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists area/docker Docker image, Compose, packaging comp/gateway Gateway runner, session dispatch, delivery labels May 29, 2026
@Bartok9 Bartok9 force-pushed the fix/s6-log-shared-volume-lock-34457 branch from de2c1bc to c2b0fa6 Compare May 29, 2026 08:28
@Bartok9 Bartok9 changed the title fix(docker): scope per-profile s6-log lock per container (#34457) fix(docker): per-container s6-log lock + logger down marker (#34457, #34480) May 29, 2026
@Bartok9 Bartok9 force-pushed the fix/s6-log-shared-volume-lock-34457 branch 2 times, most recently from b34533c to a1a2a30 Compare May 31, 2026 08:22
@Bartok9

Bartok9 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Rebased + CI triage.

Rebased onto current origin/main (32899279a) via clean cherry-pick — the branch had drifted to 284 files / -16590 lines of staleness vs main; it's now back to the focused 6-file change. 80/80 local unit tests pass (tests/hermes_cli/test_service_manager.py + test_container_boot.py).

CI triage of the prior build-amd64 failure — one real failure: tests/docker/test_gateway_run_supervised.py::test_supervised_gateway_stdout_reaches_docker_logs. The ⚕ Hermes Gateway Starting banner did not reach docker logs after the 8s wait, though the gateway booted ("gateway is now running under s6 supervision" was present).

Root-cause notes for reviewers:

  • The exec s6-setuidgid hermes s6-log 1 n10 s1000000 T "$log_dir" line — incl. the 1 stdout-tee directive — is byte-identical to main; this PR only changes $log_dir to a per-container nested path and adds a mkdir -p/chown for the deeper path.
  • The #34480 log/down marker is correctly gated behind if not start:, so it does not affect the owned/running profile the test exercises.
  • Most plausible regression: a startup race — the extra mkdir -p "$base_dir/$container_id" + chown -R on the deeper nested path delays s6-log's tee just enough that the gateway's banner is written before the stdout pipe has a consumer, so it lands in current but not docker logs.

I've re-triggered CI on the clean commit to confirm whether the failure reproduces post-rebase before proposing a fix (candidate: pre-create + chown the nested $log_dir at register time in container_boot/_render_log_run, so s6-log doesn't mkdir on the hot path). Will follow up with the verified fix rather than a blind patch to s6 internals.

@Bartok9

Bartok9 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Pushed 7b0fa35f4: the clean rebase confirmed build-amd64 still failed on test_supervised_gateway_stdout_reaches_docker_logs — so this was not staleness, it's the real per-container startup race.

Root cause: the log/run script from _render_log_run runs mkdir -p + recursive chown on the hot path every time s6-supervise (re)starts the logger. On a cold boot the gateway writes its rich-console banner to stdout before the logger pipeline finishes that mkdir/chown and takes over the fd, so the banner is lost and never reaches docker logs.

Fix: _register_service (cont-init.d, runs as root before s6-svscan starts the loggers) now pre-creates + chowns the per-container nested log dir up front via the new _precreate_log_dir / _resolve_container_log_dir helpers. The resolver mirrors _render_log_run's runtime resolution exactly (HERMES_CONTAINER_ID$HOSTNAMEhostname/etc/hostname, gated by HERMES_GATEWAY_LOG_PER_CONTAINER), so the pre-created dir is the same path s6-log locks. With the dir already present and hermes-owned, the script's mkdir -p is a no-op and s6-log forwards stdout immediately.

Pre-creation is best-effort; the script's own mkdir/chown remain as a fallback so behavior never regresses off-container. Added 5 unit tests (resolution: nested / hostname fallback / flat opt-out; pre-creation: direct + end-to-end via reconcile). Local suite: 31 passed, ruff clean. Waiting on build-amd64 to confirm green.

@Bartok9

Bartok9 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Found the real root cause by reading the CI docker logs (491662877):

The CI docker logs showed only stderr reached the stream — skill-sync output, s6-rc service messages, the → gateway is now running under s6 supervision line, and Python-logging WARNINGs. The ⚕ Hermes Gateway Starting... banner was absent from both docker logs and the rotated current file (which was empty).

Root cause: under s6 supervision the gateway child's stdout is a pipe to s6-log, not a TTY, so Python block-buffers it. The banner is printed but sits in Python's stdout buffer and is never handed to s6-log's 1 stdout-tee within the window — in practice until unrelated later output happens to fill the ~4-8 KiB buffer. Python's logging defaults to stderr (which is why errors/WARNINGs showed but the stdout banner didn't), confirming the buffering diagnosis.

Fix: sys.stdout.flush() immediately after the banner print()s in gateway.py, so the banner is pushed to s6-log at startup and tee'd to /init's stdout = docker logs.

I also kept the earlier container_boot change (pre-create + chown the per-container log dir at register time) as defense-in-depth — it removes the mkdir/chown from the s6-log hot path — but the flush is the load-bearing fix. Waiting on build-amd64 to confirm green; not claiming fixed until CI says so.

@glensc

glensc commented May 31, 2026

Copy link
Copy Markdown

The underlying problem is weird, as this two-container setup is default in the docker-compose provided by the project:

@hulet

hulet commented Jun 1, 2026

Copy link
Copy Markdown

The two-container setup in the default docker-compose is a bug (see #36235). 5671059 deprecated the two-container setup in favor of the HERMES_DASHBOARD=1 flag, but failed to update the example docker-compose.yml. https://hermes-agent.nousresearch.com/docs/user-guide/docker explicitly states "Running the dashboard as a separate container is not supported: its gateway-liveness detection requires a shared PID namespace with the gateway process." and "Never run two Hermes gateway containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent write access.".

@Bartok9

Bartok9 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased clean + the deprecation question, head-on

Rebase: Rebased onto current origin/main via clean cherry-pick of the three focused commits — the branch had ballooned to ~1,300 files of staleness against main; it's now back to the intended 7 files / +390 / −7. mergeable is no longer CONFLICTING. Local unit suite green: tests/hermes_cli/test_service_manager.py + tests/hermes_cli/test_container_boot.py90 passed.

The bigger question @glensc / @hulet raised — and I think you're right. Thank you both for the pointer; it changes my read of this PR.

The two structural changes here (per-container log-dir nesting for #34457, and the log/down marker for #34480) exist to make a gateway + dashboard two-container stack sharing one /opt/data volume stop crash-looping on the s6-log lock. But current main now documents that exact topology as unsupported, in three places in website/docs/user-guide/docker.md:

  • L135: "Running the dashboard as a separate container is not supported: its gateway-liveness detection requires a shared PID namespace with the gateway process."
  • L174: "Never run two Hermes gateway containers against the same data directory simultaneously — session files and memory stores are not designed for concurrent write access."
  • L257: "never point two containers at the same ~/.hermes directory simultaneously … cross-container sharing of a data volume corrupts session files and memory stores."

Commit 5671059 ("feat(docker): launch dashboard as side-process via HERMES_DASHBOARD=1") deprecated the side-car dashboard container and collapsed the two-service Compose example into a single service. Issue #34457's reporter hit the lock loop precisely because they ran the now-deprecated gateway+dashboard-sidecar-on-shared-volume stack.

So I no longer think the structural parts of this PR should land as-is. Making the shared-volume topology log "more safely" works against the project's documented position that the topology corrupts data — it risks signaling that shared volumes are supported when the maintainers have deliberately moved away from them. The clean fix for #34457 is the single-container HERMES_DASHBOARD=1 path that already exists on main, plus (optionally) a doc/Compose-example update so users aren't pointed at the deprecated layout (the example docker-compose.yml still shows two services — that's arguably the real remaining bug, and it's tracked separately).

The one piece that is topology-independent is the sys.stdout.flush() in gateway.py (the startup banner reaching docker logs under s6). In full honesty: CI shows that flush alone does not fix test_supervised_gateway_stdout_reaches_docker_logs — in the latest build-amd64 run the banner reached neither docker logs nor the rotated current file, so there's a deeper buffering/hand-off issue in the supervised gateway run path that this PR does not fully solve. I don't want to claim a fix CI contradicts.

My recommendation: close this PR in favor of the supported single-container HERMES_DASHBOARD=1 path, and (if useful) let me open a small, separate, topology-independent PR for just the docker logs banner issue once I've root-caused why the flush isn't sufficient — plus possibly a docs/Compose-example PR to remove the deprecated two-container example so the next person doesn't fall into #34457.

I've left the branch in a clean, rebased, scoped state so it's easy to review if you'd rather salvage any part — but I'll defer to you on disposition and won't close it myself. Does closing in favor of HERMES_DASHBOARD=1 (and a follow-up docs/Compose cleanup) match how you'd like #34457 / #34480 resolved?

@Bartok9

Bartok9 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

After rebasing this clean onto current main and digging into the CI failure, I'm recommending closure of this PR — and want to lay out why transparently for the maintainers.

The build-amd64 failure (test_supervised_gateway_stdout_reaches_docker_logs) is not fixable at this layer. The startup banner reaches neither docker logs nor the log file even with the flush fix, which points to a deeper s6 hand-off issue rather than the per-container log lock this PR targets.

More fundamentally: the structural parts of this PR (#34457 per-container log nesting + #34480 logger-down marker) are designed around the gateway + dashboard two-container shared-volume topology — which commit 5671059 deprecated in favor of the single-container HERMES_DASHBOARD=1 flag (see docs L135/174/257 and #36235). The setup this PR hardens is on its way out, so investing further here is likely wasted effort.

The topology-independent flush fix is preserved in the diff if any of it is useful elsewhere, but on its own it's CI-proven insufficient.

Recommendation: close this in favor of the single-container path + a separate docs/compose cleanup to remove the deprecated two-container example. Happy to open that follow-up if it'd help. Deferring the actual close to you all rather than doing it unilaterally. 🙏

Bartok9 added a commit to Bartok9/hermes-agent that referenced this pull request Jun 8, 2026
…usResearch#34473)

The supervised gateway's startup banner was not reaching docker logs
because the s6-log run script runs mkdir -p + recursive chown on the
hot path every time s6-supervise (re)starts the logger. On a cold
boot the gateway writes its rich-console banner to stdout before the
logger pipeline finishes that mkdir/chown and takes over the fd, so
the banner is lost (test_supervised_gateway_stdout_reaches_docker_logs).

cont-init.d runs as root before s6-svscan starts the loggers, so
_register_service now pre-creates and chowns the per-container nested
log dir up front. With the dir already present and hermes-owned, the
script's mkdir -p is a no-op and s6-log forwards stdout immediately.

The new _resolve_container_log_dir mirrors _render_log_run's runtime
resolution exactly (HERMES_CONTAINER_ID -> HOSTNAME -> hostname ->
/etc/hostname, gated by HERMES_GATEWAY_LOG_PER_CONTAINER) so the
pre-created dir is the same path s6-log locks. Pre-creation is
best-effort; the script's own mkdir/chown remain as fallback.

Adds 5 tests covering resolution (nested, hostname fallback, flat
opt-out) and pre-creation (direct + end-to-end via reconcile).
Bartok9 added a commit to Bartok9/hermes-agent that referenced this pull request Jun 8, 2026
…NousResearch#34473)

Real root cause of test_supervised_gateway_stdout_reaches_docker_logs:
under s6 supervision the gateway child's stdout is a pipe to s6-log,
not a TTY, so Python BLOCK-buffers it. The '⚕ Hermes Gateway
Starting...' banner is printed but sits in Python's stdout buffer and
is never handed to s6-log's '1' stdout-tee within the test window (or
in practice until unrelated later output happens to fill the 4-8 KiB
buffer). Result: the banner is absent from both 'docker logs' and the
rotated 'current' file.

The CI docker logs confirmed this: only stderr (skill sync, s6-rc
messages, the supervisor 'gateway is now running' line, logging
WARNINGs) reached docker logs; the banner and the s6-log 'current'
file were empty. Python logging defaults to stderr (unbuffered enough
to appear), which is why errors showed but the stdout banner didn't.

Fix: sys.stdout.flush() immediately after printing the banner so it is
pushed to s6-log at startup and tee'd to /init's stdout = docker logs.

(The companion container_boot pre-creation of the per-container log
dir, also in this PR, removes the mkdir/chown from the s6-log hot path
as defense-in-depth, but the flush is the load-bearing fix.)
@Bartok9 Bartok9 force-pushed the fix/s6-log-shared-volume-lock-34457 branch from 3922539 to cb28226 Compare June 8, 2026 20:50
Bartok9 pushed a commit to Bartok9/hermes-agent that referenced this pull request Jun 8, 2026
…earch#34473)

Issue NousResearch#34480 writes a log/down marker whenever a profile gateway is
registered in the down state. s6 supervises the producer and its logger
independently, so s6-svc -u on the parent does not cascade to log/.

The docker run gateway run supervised redirect registers
gateway-default down at reconcile time (fresh volume, no autostart state),
then calls hermes gateway start. With log/down still present the s6-log
pipeline never starts, so the gateway stdout banner never reaches docker
logs or the rotated current file (test_supervised_gateway_stdout_reaches_docker_logs).

Fix: S6ServiceManager.start() and stop() now also dispatch s6-svc to the
log/ sub-service when present. restart is unchanged (logger stays up).
@Bartok9 Bartok9 force-pushed the fix/s6-log-shared-volume-lock-34457 branch from cb28226 to 71522a9 Compare June 8, 2026 21:02
Bartok9 added 2 commits June 10, 2026 23:22
…arch#34457)

Closes NousResearch#34457.
Closes NousResearch#34480.

s6-log takes an exclusive lock on its log directory
(logs/gateways/<profile>/lock). Two complementary problems cause an
endless sub-second s6-log crash loop when a gateway container and a
dashboard/sidecar container bind-mount the same data volume:

1. (NousResearch#34457) Both containers run a gateway-default s6-log against the
   SAME logs/gateways/default path. The first claims the lock; the
   second's s6-log dies with "unable to take lock" and s6-supervise
   restarts it instantly. Fix: by default nest the rendered log/run
   directory under a per-container token (HERMES_CONTAINER_ID, else
   $HOSTNAME / hostname / /etc/hostname) so each container locks a
   distinct path while still persisting under the shared volume:
   logs/gateways/<profile>/<container-id>/current. Resolved portably
   because log/run executes under with-contenv sh (dash/execline),
   where $HOSTNAME is not guaranteed. Opt out for single-container
   deployments with HERMES_GATEWAY_LOG_PER_CONTAINER=0.

2. (NousResearch#34480) container_boot._register_service writes a down marker on a
   profile's service slot when this container does not own the profile,
   but s6 supervises the log/ pipeline child independently of its
   producer, so the marker does not cascade. Every non-owning container
   therefore still starts gateway-<profile>/log's s6-log for ALL
   profiles. Per-container nesting (NousResearch#34457) stops them COLLIDING; this
   change also stops the un-owned loggers from running pointlessly by
   writing a down marker into the log/ sub-service whenever the parent
   gateway is registered down.

Tests:
- tests/hermes_cli/test_service_manager.py: per-container log nesting on
  by default + HERMES_GATEWAY_LOG_PER_CONTAINER=0 opt-out flat path.
- tests/hermes_cli/test_container_boot.py: down (un-owned) profile also
  marks log/ down; running (owned) profile leaves the logger free.
- 80 passed.
- sh -n + shellcheck -s sh on rendered log/run: clean (POSIX-safe).
- Docs: website/docs/user-guide/docker.md per-container layout, opt-out
  flag, and shared-volume section.
…usResearch#34473)

The supervised gateway's startup banner was not reaching docker logs
because the s6-log run script runs mkdir -p + recursive chown on the
hot path every time s6-supervise (re)starts the logger. On a cold
boot the gateway writes its rich-console banner to stdout before the
logger pipeline finishes that mkdir/chown and takes over the fd, so
the banner is lost (test_supervised_gateway_stdout_reaches_docker_logs).

cont-init.d runs as root before s6-svscan starts the loggers, so
_register_service now pre-creates and chowns the per-container nested
log dir up front. With the dir already present and hermes-owned, the
script's mkdir -p is a no-op and s6-log forwards stdout immediately.

The new _resolve_container_log_dir mirrors _render_log_run's runtime
resolution exactly (HERMES_CONTAINER_ID -> HOSTNAME -> hostname ->
/etc/hostname, gated by HERMES_GATEWAY_LOG_PER_CONTAINER) so the
pre-created dir is the same path s6-log locks. Pre-creation is
best-effort; the script's own mkdir/chown remain as fallback.

Adds 5 tests covering resolution (nested, hostname fallback, flat
opt-out) and pre-creation (direct + end-to-end via reconcile).
Bartok9 and others added 2 commits June 10, 2026 23:22
…NousResearch#34473)

Real root cause of test_supervised_gateway_stdout_reaches_docker_logs:
under s6 supervision the gateway child's stdout is a pipe to s6-log,
not a TTY, so Python BLOCK-buffers it. The '⚕ Hermes Gateway
Starting...' banner is printed but sits in Python's stdout buffer and
is never handed to s6-log's '1' stdout-tee within the test window (or
in practice until unrelated later output happens to fill the 4-8 KiB
buffer). Result: the banner is absent from both 'docker logs' and the
rotated 'current' file.

The CI docker logs confirmed this: only stderr (skill sync, s6-rc
messages, the supervisor 'gateway is now running' line, logging
WARNINGs) reached docker logs; the banner and the s6-log 'current'
file were empty. Python logging defaults to stderr (unbuffered enough
to appear), which is why errors showed but the stdout banner didn't.

Fix: sys.stdout.flush() immediately after printing the banner so it is
pushed to s6-log at startup and tee'd to /init's stdout = docker logs.

(The companion container_boot pre-creation of the per-container log
dir, also in this PR, removes the mkdir/chown from the s6-log hot path
as defense-in-depth, but the flush is the load-bearing fix.)
…earch#34473)

Issue NousResearch#34480 writes a log/down marker whenever a profile gateway is
registered in the down state. s6 supervises the producer and its logger
independently, so s6-svc -u on the parent does not cascade to log/.

The docker run gateway run supervised redirect registers
gateway-default down at reconcile time (fresh volume, no autostart state),
then calls hermes gateway start. With log/down still present the s6-log
pipeline never starts, so the gateway stdout banner never reaches docker
logs or the rotated current file (test_supervised_gateway_stdout_reaches_docker_logs).

Fix: S6ServiceManager.start() and stop() now also dispatch s6-svc to the
log/ sub-service when present. restart is unchanged (logger stays up).
@Bartok9 Bartok9 force-pushed the fix/s6-log-shared-volume-lock-34457 branch from 71522a9 to 45d0a43 Compare June 11, 2026 03:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

4 participants