Skip to content

fix(gateway): fall back to HERMES_HOME when ~/.local/state is unwritable#16559

Open
0xsir0000 wants to merge 1 commit into
NousResearch:mainfrom
0xsir0000:fix/gateway-lock-dir-podman-fallback
Open

fix(gateway): fall back to HERMES_HOME when ~/.local/state is unwritable#16559
0xsir0000 wants to merge 1 commit into
NousResearch:mainfrom
0xsir0000:fix/gateway-lock-dir-podman-fallback

Conversation

@0xsir0000

Copy link
Copy Markdown
Contributor

What does this PR do?

In containers where the runtime user is non-root and the image `WORKDIR` (typically `/opt/hermes` under the Dockerfile) is owned by root, `Path.home()` resolves to a directory the process cannot write to. The gateway lock directory then fails to materialize:

```
PermissionError: [Errno 13] Permission denied: '/opt/hermes/.local'
File "/opt/hermes/gateway/status.py", line 471, in acquire_scoped_lock
lock_path.parent.mkdir(parents=True, exist_ok=True)
```

The reproduction in the issue was a podman quadlet with `UserNS=keep-id` + `User=%U:%G`, where podman runs the container under the host UID without adjusting the container-side `/etc/passwd` entry, so `$HOME` ends up at the `WORKDIR` rather than the `hermes` user's home.

This PR adds a fallback step to `_get_lock_dir()`:

# Source When
1 `HERMES_GATEWAY_LOCK_DIR` explicit override (existing)
2 `XDG_STATE_HOME/hermes/gateway-locks` XDG spec env (existing)
3 `~/.local/state/hermes/gateway-locks` XDG default — probed
4 `HERMES_HOME/gateway-locks` new fallback when (3) is unwritable

`HERMES_HOME` is the documented runtime data root and is always intended to be writable (that's where the container volume mounts). It's the natural backstop for state files when `$HOME` is locked down.

The probe result is memoized in `_resolve_default_lock_dir` so repeated `acquire_scoped_lock` calls don't re-issue `mkdir` syscalls.

Related Issue

Fixes #16550

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • `gateway/status.py` — split `_get_lock_dir()`: env-driven branches stay un-cached for testability; the default-path probe lives in `_resolve_default_lock_dir` with `functools.lru_cache(maxsize=1)`. Probe failures fall back to `HERMES_HOME / "gateway-locks"` and emit a warning naming both `HERMES_GATEWAY_LOCK_DIR` and `XDG_STATE_HOME` as available overrides.
  • `tests/gateway/test_status.py` — 5 new tests under `TestGetLockDir`: explicit override priority, XDG_STATE_HOME honor, default-path probe under writable HOME, fallback to HERMES_HOME under read-only HOME (the container scenario), and probe memoization.

How to Test

Reproduce the original failure (no fix):

```bash
docker run --rm -u 1000:1000 -e HERMES_HOME=/opt/data -v /tmp/data:/opt/data \
ghcr.io/nousresearch/hermes-agent gateway run

→ PermissionError: '/opt/hermes/.local'

```

After this PR, the same command falls back to `/opt/data/gateway-locks` and logs a warning.

Automated:
```bash
pytest tests/gateway/test_status.py

38 passed (5 new in TestGetLockDir)

```

Wider `tests/gateway/` suite: 1006 passed (4 pre-existing flaky failures in Discord / approve-deny tests unrelated to this change — verified by running them on `upstream/main` without this PR).

Notes

  • Existing deployments with `~/.local/state` writable see no behavior change — the probe succeeds and returns the same path as before.
  • Operators who want the lock directory in a specific location can still set `HERMES_GATEWAY_LOCK_DIR` or `XDG_STATE_HOME` explicitly; both short-circuit ahead of the probe.
  • A complementary one-line fix in `docker/entrypoint.sh` (`export HERMES_GATEWAY_LOCK_DIR="$HERMES_HOME/gateway-locks"`) would make the explicit-override path active in the official image. Left out of this PR to keep the change purely fixing the runtime fallback; happy to follow up.

…ble (NousResearch#16550)

In containers where the runtime user is non-root and the image WORKDIR
(typically `/opt/hermes` under our Dockerfile) is owned by root,
`Path.home()` resolves to a directory the process cannot write to. The
gateway lock directory then fails to materialize:

    PermissionError: [Errno 13] Permission denied: '/opt/hermes/.local'

The reported reproduction was a podman quadlet with `UserNS=keep-id` +
`User=%U:%G`, where podman runs the container under the host UID without
adjusting the container-side passwd entry, so `$HOME` ends up at the
WORKDIR.

`_get_lock_dir()` now probes `~/.local/state/hermes/gateway-locks` once
and falls back to `HERMES_HOME/gateway-locks` (which is always intended
to be writable — that's where the container volume is mounted) when the
probe fails. Explicit overrides via `HERMES_GATEWAY_LOCK_DIR` and
`XDG_STATE_HOME` continue to short-circuit ahead of the probe.

Probe result is memoized in `_resolve_default_lock_dir` so repeated lock
acquires don't re-issue mkdir.

Fixes NousResearch#16550
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Podman-deployed Hermes gives error

1 participant