fix(gateway): fall back to HERMES_HOME when ~/.local/state is unwritable#16559
Open
0xsir0000 wants to merge 1 commit into
Open
fix(gateway): fall back to HERMES_HOME when ~/.local/state is unwritable#165590xsir0000 wants to merge 1 commit into
0xsir0000 wants to merge 1 commit into
Conversation
…ble (NousResearch#16550) In containers where the runtime user is non-root and the image WORKDIR (typically `/opt/hermes` under our Dockerfile) is owned by root, `Path.home()` resolves to a directory the process cannot write to. The gateway lock directory then fails to materialize: PermissionError: [Errno 13] Permission denied: '/opt/hermes/.local' The reported reproduction was a podman quadlet with `UserNS=keep-id` + `User=%U:%G`, where podman runs the container under the host UID without adjusting the container-side passwd entry, so `$HOME` ends up at the WORKDIR. `_get_lock_dir()` now probes `~/.local/state/hermes/gateway-locks` once and falls back to `HERMES_HOME/gateway-locks` (which is always intended to be writable — that's where the container volume is mounted) when the probe fails. Explicit overrides via `HERMES_GATEWAY_LOCK_DIR` and `XDG_STATE_HOME` continue to short-circuit ahead of the probe. Probe result is memoized in `_resolve_default_lock_dir` so repeated lock acquires don't re-issue mkdir. Fixes NousResearch#16550
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
In containers where the runtime user is non-root and the image `WORKDIR` (typically `/opt/hermes` under the Dockerfile) is owned by root, `Path.home()` resolves to a directory the process cannot write to. The gateway lock directory then fails to materialize:
```
PermissionError: [Errno 13] Permission denied: '/opt/hermes/.local'
File "/opt/hermes/gateway/status.py", line 471, in acquire_scoped_lock
lock_path.parent.mkdir(parents=True, exist_ok=True)
```
The reproduction in the issue was a podman quadlet with `UserNS=keep-id` + `User=%U:%G`, where podman runs the container under the host UID without adjusting the container-side `/etc/passwd` entry, so `$HOME` ends up at the `WORKDIR` rather than the `hermes` user's home.
This PR adds a fallback step to `_get_lock_dir()`:
`HERMES_HOME` is the documented runtime data root and is always intended to be writable (that's where the container volume mounts). It's the natural backstop for state files when `$HOME` is locked down.
The probe result is memoized in `_resolve_default_lock_dir` so repeated `acquire_scoped_lock` calls don't re-issue `mkdir` syscalls.
Related Issue
Fixes #16550
Type of Change
Changes Made
How to Test
Reproduce the original failure (no fix):
```bash
docker run --rm -u 1000:1000 -e HERMES_HOME=/opt/data -v /tmp/data:/opt/data \
ghcr.io/nousresearch/hermes-agent gateway run
→ PermissionError: '/opt/hermes/.local'
```
After this PR, the same command falls back to `/opt/data/gateway-locks` and logs a warning.
Automated:
```bash
pytest tests/gateway/test_status.py
38 passed (5 new in TestGetLockDir)
```
Wider `tests/gateway/` suite: 1006 passed (4 pre-existing flaky failures in Discord / approve-deny tests unrelated to this change — verified by running them on `upstream/main` without this PR).
Notes