fix(docker): per-container s6-log lock + logger down marker (#34457, #34480)#34473
fix(docker): per-container s6-log lock + logger down marker (#34457, #34480)#34473Bartok9 wants to merge 4 commits into
Conversation
de2c1bc to
c2b0fa6
Compare
b34533c to
a1a2a30
Compare
|
Rebased + CI triage. Rebased onto current CI triage of the prior Root-cause notes for reviewers:
I've re-triggered CI on the clean commit to confirm whether the failure reproduces post-rebase before proposing a fix (candidate: pre-create + chown the nested |
|
Pushed Root cause: the Fix: Pre-creation is best-effort; the script's own |
|
Found the real root cause by reading the CI docker logs ( The CI Root cause: under s6 supervision the gateway child's stdout is a pipe to s6-log, not a TTY, so Python block-buffers it. The banner is printed but sits in Python's stdout buffer and is never handed to s6-log's Fix: I also kept the earlier |
|
The underlying problem is weird, as this two-container setup is default in the docker-compose provided by the project: |
|
The two-container setup in the default docker-compose is a bug (see #36235). 5671059 deprecated the two-container setup in favor of the |
4916628 to
3922539
Compare
Rebased clean + the deprecation question, head-onRebase: Rebased onto current The bigger question @glensc / @hulet raised — and I think you're right. Thank you both for the pointer; it changes my read of this PR. The two structural changes here (per-container log-dir nesting for #34457, and the
Commit So I no longer think the structural parts of this PR should land as-is. Making the shared-volume topology log "more safely" works against the project's documented position that the topology corrupts data — it risks signaling that shared volumes are supported when the maintainers have deliberately moved away from them. The clean fix for #34457 is the single-container The one piece that is topology-independent is the My recommendation: close this PR in favor of the supported single-container I've left the branch in a clean, rebased, scoped state so it's easy to review if you'd rather salvage any part — but I'll defer to you on disposition and won't close it myself. Does closing in favor of |
|
After rebasing this clean onto current main and digging into the CI failure, I'm recommending closure of this PR — and want to lay out why transparently for the maintainers. The build-amd64 failure ( More fundamentally: the structural parts of this PR (#34457 per-container log nesting + #34480 logger-down marker) are designed around the gateway + dashboard two-container shared-volume topology — which commit The topology-independent flush fix is preserved in the diff if any of it is useful elsewhere, but on its own it's CI-proven insufficient. Recommendation: close this in favor of the single-container path + a separate docs/compose cleanup to remove the deprecated two-container example. Happy to open that follow-up if it'd help. Deferring the actual close to you all rather than doing it unilaterally. 🙏 |
…usResearch#34473) The supervised gateway's startup banner was not reaching docker logs because the s6-log run script runs mkdir -p + recursive chown on the hot path every time s6-supervise (re)starts the logger. On a cold boot the gateway writes its rich-console banner to stdout before the logger pipeline finishes that mkdir/chown and takes over the fd, so the banner is lost (test_supervised_gateway_stdout_reaches_docker_logs). cont-init.d runs as root before s6-svscan starts the loggers, so _register_service now pre-creates and chowns the per-container nested log dir up front. With the dir already present and hermes-owned, the script's mkdir -p is a no-op and s6-log forwards stdout immediately. The new _resolve_container_log_dir mirrors _render_log_run's runtime resolution exactly (HERMES_CONTAINER_ID -> HOSTNAME -> hostname -> /etc/hostname, gated by HERMES_GATEWAY_LOG_PER_CONTAINER) so the pre-created dir is the same path s6-log locks. Pre-creation is best-effort; the script's own mkdir/chown remain as fallback. Adds 5 tests covering resolution (nested, hostname fallback, flat opt-out) and pre-creation (direct + end-to-end via reconcile).
…NousResearch#34473) Real root cause of test_supervised_gateway_stdout_reaches_docker_logs: under s6 supervision the gateway child's stdout is a pipe to s6-log, not a TTY, so Python BLOCK-buffers it. The '⚕ Hermes Gateway Starting...' banner is printed but sits in Python's stdout buffer and is never handed to s6-log's '1' stdout-tee within the test window (or in practice until unrelated later output happens to fill the 4-8 KiB buffer). Result: the banner is absent from both 'docker logs' and the rotated 'current' file. The CI docker logs confirmed this: only stderr (skill sync, s6-rc messages, the supervisor 'gateway is now running' line, logging WARNINGs) reached docker logs; the banner and the s6-log 'current' file were empty. Python logging defaults to stderr (unbuffered enough to appear), which is why errors showed but the stdout banner didn't. Fix: sys.stdout.flush() immediately after printing the banner so it is pushed to s6-log at startup and tee'd to /init's stdout = docker logs. (The companion container_boot pre-creation of the per-container log dir, also in this PR, removes the mkdir/chown from the s6-log hot path as defense-in-depth, but the flush is the load-bearing fix.)
3922539 to
cb28226
Compare
…earch#34473) Issue NousResearch#34480 writes a log/down marker whenever a profile gateway is registered in the down state. s6 supervises the producer and its logger independently, so s6-svc -u on the parent does not cascade to log/. The docker run gateway run supervised redirect registers gateway-default down at reconcile time (fresh volume, no autostart state), then calls hermes gateway start. With log/down still present the s6-log pipeline never starts, so the gateway stdout banner never reaches docker logs or the rotated current file (test_supervised_gateway_stdout_reaches_docker_logs). Fix: S6ServiceManager.start() and stop() now also dispatch s6-svc to the log/ sub-service when present. restart is unchanged (logger stays up).
cb28226 to
71522a9
Compare
…arch#34457) Closes NousResearch#34457. Closes NousResearch#34480. s6-log takes an exclusive lock on its log directory (logs/gateways/<profile>/lock). Two complementary problems cause an endless sub-second s6-log crash loop when a gateway container and a dashboard/sidecar container bind-mount the same data volume: 1. (NousResearch#34457) Both containers run a gateway-default s6-log against the SAME logs/gateways/default path. The first claims the lock; the second's s6-log dies with "unable to take lock" and s6-supervise restarts it instantly. Fix: by default nest the rendered log/run directory under a per-container token (HERMES_CONTAINER_ID, else $HOSTNAME / hostname / /etc/hostname) so each container locks a distinct path while still persisting under the shared volume: logs/gateways/<profile>/<container-id>/current. Resolved portably because log/run executes under with-contenv sh (dash/execline), where $HOSTNAME is not guaranteed. Opt out for single-container deployments with HERMES_GATEWAY_LOG_PER_CONTAINER=0. 2. (NousResearch#34480) container_boot._register_service writes a down marker on a profile's service slot when this container does not own the profile, but s6 supervises the log/ pipeline child independently of its producer, so the marker does not cascade. Every non-owning container therefore still starts gateway-<profile>/log's s6-log for ALL profiles. Per-container nesting (NousResearch#34457) stops them COLLIDING; this change also stops the un-owned loggers from running pointlessly by writing a down marker into the log/ sub-service whenever the parent gateway is registered down. Tests: - tests/hermes_cli/test_service_manager.py: per-container log nesting on by default + HERMES_GATEWAY_LOG_PER_CONTAINER=0 opt-out flat path. - tests/hermes_cli/test_container_boot.py: down (un-owned) profile also marks log/ down; running (owned) profile leaves the logger free. - 80 passed. - sh -n + shellcheck -s sh on rendered log/run: clean (POSIX-safe). - Docs: website/docs/user-guide/docker.md per-container layout, opt-out flag, and shared-volume section.
…usResearch#34473) The supervised gateway's startup banner was not reaching docker logs because the s6-log run script runs mkdir -p + recursive chown on the hot path every time s6-supervise (re)starts the logger. On a cold boot the gateway writes its rich-console banner to stdout before the logger pipeline finishes that mkdir/chown and takes over the fd, so the banner is lost (test_supervised_gateway_stdout_reaches_docker_logs). cont-init.d runs as root before s6-svscan starts the loggers, so _register_service now pre-creates and chowns the per-container nested log dir up front. With the dir already present and hermes-owned, the script's mkdir -p is a no-op and s6-log forwards stdout immediately. The new _resolve_container_log_dir mirrors _render_log_run's runtime resolution exactly (HERMES_CONTAINER_ID -> HOSTNAME -> hostname -> /etc/hostname, gated by HERMES_GATEWAY_LOG_PER_CONTAINER) so the pre-created dir is the same path s6-log locks. Pre-creation is best-effort; the script's own mkdir/chown remain as fallback. Adds 5 tests covering resolution (nested, hostname fallback, flat opt-out) and pre-creation (direct + end-to-end via reconcile).
…NousResearch#34473) Real root cause of test_supervised_gateway_stdout_reaches_docker_logs: under s6 supervision the gateway child's stdout is a pipe to s6-log, not a TTY, so Python BLOCK-buffers it. The '⚕ Hermes Gateway Starting...' banner is printed but sits in Python's stdout buffer and is never handed to s6-log's '1' stdout-tee within the test window (or in practice until unrelated later output happens to fill the 4-8 KiB buffer). Result: the banner is absent from both 'docker logs' and the rotated 'current' file. The CI docker logs confirmed this: only stderr (skill sync, s6-rc messages, the supervisor 'gateway is now running' line, logging WARNINGs) reached docker logs; the banner and the s6-log 'current' file were empty. Python logging defaults to stderr (unbuffered enough to appear), which is why errors showed but the stdout banner didn't. Fix: sys.stdout.flush() immediately after printing the banner so it is pushed to s6-log at startup and tee'd to /init's stdout = docker logs. (The companion container_boot pre-creation of the per-container log dir, also in this PR, removes the mkdir/chown from the s6-log hot path as defense-in-depth, but the flush is the load-bearing fix.)
…earch#34473) Issue NousResearch#34480 writes a log/down marker whenever a profile gateway is registered in the down state. s6 supervises the producer and its logger independently, so s6-svc -u on the parent does not cascade to log/. The docker run gateway run supervised redirect registers gateway-default down at reconcile time (fresh volume, no autostart state), then calls hermes gateway start. With log/down still present the s6-log pipeline never starts, so the gateway stdout banner never reaches docker logs or the rotated current file (test_supervised_gateway_stdout_reaches_docker_logs). Fix: S6ServiceManager.start() and stop() now also dispatch s6-svc to the log/ sub-service when present. restart is unchanged (logger stays up).
71522a9 to
45d0a43
Compare
Summary
s6-lognow writes to a per-container subdirectory (logs/gateways/<profile>/<container-id>/current) so each container locks a distinct directory ([Bug]: s6-log lock collision loops endlessly in multi-container gateway + dashboard shared volume setups #34457).log/sub-service now also gets adownmarker so the un-owned logger never starts in the first place ([Bug]: s6-log lock collision in multi-container setup with shared /opt/data volume #34480).s6-logcrash loop in multi-container stacks that bind-mount the same data volume.Motivation
Closes #34457.
Closes #34480.
s6-logtakes an exclusive lock on its log directory (logs/gateways/<profile>/lock) to guarantee a single writer. In a multi-container Compose stack wheregateway, a custom-profile gateway, anddashboardbind-mount the same host data folder (the documented way to keep SQLite state, dynamic skills, and config in sync), two distinct problems combine into one crash loop:[Bug]: s6-log lock collision loops endlessly in multi-container gateway + dashboard shared volume setups #34457 — collision. Multiple containers run a
gateway-defaults6-logagainst the samelogs/gateways/defaultpath. The first claims the lock; the others die with "unable to take lock" ands6-superviserestarts them instantly.[Bug]: s6-log lock collision in multi-container setup with shared /opt/data volume #34480 — un-owned loggers run at all.
container_boot._register_servicewrites adownmarker on a profile's service slot when the container doesn't own that profile — but s6 supervises thelog/pipeline child independently of its producer, so the marker does not cascade. Every non-owning container therefore still startsgateway-<profile>/log'ss6-logfor all profiles. The dashboard container (which runs no gateway at all) is the clearest victim.The fix addresses both:
Per-container nesting ([Bug]: s6-log lock collision loops endlessly in multi-container gateway + dashboard shared volume setups #34457): nest each container's gateway log directory under a per-container token so the lock paths no longer collide, while still persisting on the shared volume. The token is resolved portably —
log/runexecutes underwith-contenv sh(dash/execline, not bash, where$HOSTNAMEis not guaranteed) — viaHERMES_CONTAINER_ID→$HOSTNAME→hostname→/etc/hostname. Docker populates the latter three with the unique container ID by default. On by default; single-container deployments can opt out withHERMES_GATEWAY_LOG_PER_CONTAINER=0.Logger down marker ([Bug]: s6-log lock collision in multi-container setup with shared /opt/data volume #34480, the issue's Option B): when a profile is registered in the
downstate, also write adownmarker into itslog/sub-service so the un-owned logger does not run at all. This is the source-level fix; the per-container nesting is the safety net for any logger that does start.This is effectively the issue's Option C (both), which #34480 explicitly recommends.
Verification
python3 -m pytest tests/hermes_cli/test_service_manager.py tests/hermes_cli/test_container_boot.py— 80 passed, including:HERMES_GATEWAY_LOG_PER_CONTAINER=0opt-out flat pathlog/down; running (owned) profile leaves the logger freesh -n+shellcheck -s shon the renderedlog/run— clean (POSIX-safe; no SC3028$HOSTNAMEbash-ism).currentfile viafind … -name currentso it is agnostic to the flat-vs-nested layout.website/docs/user-guide/docker.mddocuments the per-container layout, the opt-out flag, and a "Sharing a data volume across containers" section.Backwards compatibility
logs/gateways/<profile>/…); only a per-container segment is appended by default.HERMES_GATEWAY_LOG_PER_CONTAINER=0restores the exact legacy flat path for single-container users.downmarker is only added when the gateway slot itself is registered down (un-owned profile) — owned/started profiles are unaffected, so their logs are captured exactly as before.