Skip to content

fix(docker): respect HERMES_HOME env var in entrypoint#8115

Closed
malaiwah wants to merge 129 commits into
NousResearch:mainfrom
malaiwah:fix/entrypoint-hermes-home-env
Closed

fix(docker): respect HERMES_HOME env var in entrypoint#8115
malaiwah wants to merge 129 commits into
NousResearch:mainfrom
malaiwah:fix/entrypoint-hermes-home-env

Conversation

@malaiwah

Copy link
Copy Markdown
Contributor

Summary

The Docker entrypoint hardcodes HERMES_HOME="/opt/data", ignoring the environment variable set by the Dockerfile (ENV HERMES_HOME=/opt/data) or overridden at container runtime (-e HERMES_HOME=/custom/path).

This one-line fix changes it to HERMES_HOME="${HERMES_HOME:-/opt/data}" so the entrypoint respects the env var when set, falling back to /opt/data as before. Existing deployments are unaffected.

Use case: Running multiple hermes-agent instances on the same host with different data directories (e.g. a crash-test instance alongside production). Without this fix, the entrypoint creates template files and syncs skills into /opt/data regardless of where the actual data volume is mounted.

Changes

  • docker/entrypoint.sh line 12: HERMES_HOME="/opt/data"HERMES_HOME="${HERMES_HOME:-/opt/data}"

Test plan

  • Existing deployment with no HERMES_HOME override: entrypoint uses /opt/data (unchanged behavior)
  • podman run -e HERMES_HOME=/opt/data-test ...: entrypoint bootstraps into /opt/data-test
  • Skills sync, .env template, config.yaml template all land in the correct directory

Hermes Agent (angelos) and others added 30 commits April 8, 2026 03:05
- Add DEFAULT_ALLOWED_TOOLSETS including 'mcp' to enable MCP tools for subagents
- Make BLOCKED_TOOLSET_NAMES configurable (was hardcoded)
- Subagents now inherit MCP access from parent when available
- Fixes subagent limitation where only terminal+process were available
- Allows subagents to use SearXNG and Crawl4AI MCP servers
- Reorder imports to top of file (E402)
- Add noqa comment for registry import (circular import requirement)
- Readd missing constants: MAX_DEPTH, MAX_CONCURRENT_CHILDREN, DEFAULT_MAX_ITERATIONS
- All ruff checks now passing
- Remove 'memory' from BLOCKED_TOOLSET_NAMES in delegate_tool.py
- Add 'memory' to DEFAULT_ALLOWED_TOOLSETS for subagent access
- Add DEFAULT_SUBAGENT_MEMORY_MODE = 'read_only' configuration
- Modify memory_tool() to accept is_subagent and subagent_memory_mode params
- Enforce memory write blocking in read_only mode for subagents
- Support three modes: 'read_only', 'full', 'none'
- Add comprehensive tests for subagent memory access
- Maintains backward compatibility with existing memory tests

Benefits:
✅ Subagents can now query Honcho observations dialectically
✅ Subagents can read MEMORY.md and USER.md for context
✅ Subagents blocked from writing (prevents memory pollution)
✅ Parent agent remains sole writer of memory
✅ Enables orchestrator/coordinator pattern with long-lived subagents
✅ Configurable per-subagent or global default
- docker.py: remove --pids-limit (unavailable without cgroup delegation),
  add _cgroup_limits_available() probe for --cpus/--memory
- delegate_tool.py: add "browser" to DEFAULT_ALLOWED_TOOLSETS
- Dockerfile: build from source, add podman-remote shim, wait-for-honcho
- docker/wait-for-honcho.sh: poll Honcho API before starting gateway
…tation)

- delegate_tool.py: Set skip_memory=False and pass subagent_memory_mode to child agents
- delegate_tool.py: Add _is_subagent=True flag for memory tool access control
- run_agent.py: Pass is_subagent and subagent_memory_mode to memory_tool calls
- memory_tool.py: Enforce read-only mode for subagents (blocks add/replace/remove)
- memory_tool.py: Support three modes: 'read_only' (default), 'full', 'none'

Benefits:
✅ Subagents can read MEMORY.md and USER.md for context
✅ Subagents can query Honcho observations dialectically
✅ Subagents blocked from memory writes (prevents pollution)
✅ Parent agent remains sole writer of memory
✅ Configurable per-subagent via delegation config
✅ Enables orchestrator/coordinator pattern with long-lived subagents

Tests:
✅ test_memory_subagent_readonly.py (4/4 passed)
✅ test_mcp_subagent_access.py (4/4 passed)
✅ Existing memory tests (33/33 passed)
✅ Existing delegate tests (5/5 passed)
- Added _cleanup_orphaned_containers() function to gateway/run.py
- Automatically removes exited/dead/created hermes-* containers on startup
- Prevents container accumulation from crashes, OOM kills, or manual stops
- Logs cleanup activity with INFO level for visibility
- Added comprehensive documentation in docs/CONTAINER_CLEANUP.md
- Includes manual cleanup commands and CLI design proposal

Benefits:
✅ No more manual container cleanup needed
✅ Recovers gracefully from crashes
✅ Reduces disk space usage from stale containers
✅ Improves system hygiene automatically
✅ Safe - only removes non-running containers

Manual cleanup (if needed before deploying):
  podman ps -a --filter 'name=^hermes-' --filter 'status=exited' -q | xargs -r podman rm -f
Remove GitHub Actions workflows (deploy-site, docker-publish, tests, nix,
supply-chain-audit, docs-site-checks) — these are for upstream's GitHub CI.

Add .gitea/workflows/build-push.yml: builds the container image and pushes
to Gitea's container registry on every push to main.
- run_agent.py: Add _shared_memory_store parameter to AIAgent.__init__()
- run_agent.py: Use shared memory store when provided (subagents)
- run_agent.py: Add _is_subagent flag for access control
- delegate_tool.py: Pass parent's _memory_store to subagents
- Subagents now share parent's memory store (read-only enforced)
- Memory writes still blocked for subagents via memory_tool.py enforcement

Benefits:
✅ Subagents can READ from MEMORY.md and USER.md
✅ Subagents can query Honcho observations
✅ Subagents blocked from memory writes (add/replace/remove)
✅ Parent remains sole writer of memory
✅ Shared store prevents duplicate memory initialization
✅ Enables orchestrator pattern with long-lived subagents

Technical:
- Subagents share parent's MemoryStore instance
- Read-only enforcement in memory_tool.py still active
- No duplicate memory loading for subagents
- Memory stays consistent across parent + subagents
delegate_tool.py passes subagent_memory_mode and _is_subagent to
AIAgent() but AIAgent.__init__ did not declare subagent_memory_mode,
causing an unexpected keyword argument crash on every delegate_task call.

Added the parameter to the signature and stored both self._is_subagent
and self.subagent_memory_mode as instance attributes (previously only
accessed via getattr with defaults, never stored).
…ra_body config

- terminal_tool.py: add docker_forward_env to container_config dict passed to
  _create_environment — it was read from config but never propagated, so
  docker exec calls were built with no -e flags and credentials were never
  forwarded into sandbox containers

- run_agent.py: add subagent_memory_mode param to AIAgent.__init__ — delegate_tool
  was passing it but AIAgent didn't accept it, crashing every delegate_task call;
  also store self._is_subagent and self._config_extra_body; add model.extra_body
  config support so extra_body fields (e.g. enable_thinking: false) are merged
  into every API call

- tools/environments/docker.py: log exact docker exec command at WARNING level
  (secrets masked) for debugging env forwarding

- Dockerfile: add logging to podman-remote shim — appends full command to
  /opt/data/logs/shim.log for introspection

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…g to debug

- gateway/run.py: move _cleanup_orphaned_containers() before main() so it
  is defined before it is called; call it before asyncio.run() so cleanup
  happens at startup, not after the gateway exits
- gateway/run.py: replace hardcoded "podman" with find_docker() so the
  function respects the configured docker/shim executable
- docker.py: downgrade exec command log from WARNING to DEBUG (too noisy
  for normal operation; shim.log already captures all podman-remote calls)
- tools/: delete leftover delegate_tool.py.patch artifact

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Allows config.yaml to specify which Docker/Podman network sandbox
containers join via `terminal.docker_network`. When set, passes
`--network <name>` to docker run so containers can resolve hostnames
of other services on that network (e.g. hermes-litellm on hermes-net).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When pricing/cost estimation calls fetch_endpoint_model_metadata without
an api_key (e.g. from insights._get_pricing), the function made
unauthenticated requests to the /models endpoint causing repeated 401
errors every 5 minutes (cache TTL). Now falls back to LITELLM_KEY env
var so requests to the proxy are authenticated.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker_network was added to DockerEnvironment (docker.py) and terminal_tool.py
but the config.yaml → TERMINAL_DOCKER_NETWORK env var bridge was missing in
both code paths:
- cli.py: used by `hermes chat`, `hermes model`, etc.
- gateway/run.py: used by `hermes gateway run`

Also add TERMINAL_DOCKER_NETWORK reading to _get_env_config() in
terminal_tool.py, and add docker_network to the container_config dict
that's built before calling _create_environment().

Without this, `docker_network: "hermes-net"` in config.yaml had no effect —
sandbox containers were always created on the default podman network and could
not resolve hermes-net hostnames like hermes-litellm.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ge config and invalid workdir bypass

- PR NousResearch#4350: Load config.yaml terminal block as fallback before hardcoded defaults
  - Fixes docker_image in config.yaml not being loaded
  - Adds cfg.get() fallbacks for all terminal config options

- PR NousResearch#4673: Don't clobber already-resolved absolute TERMINAL_CWD
  - Fixes invalid workdir bypassing terminal.cwd config
  - Skips config override when env var already has absolute path
- Install gnupg in system dependencies
- Enables GPG-signed emails from angelos-hermes@mailbox.org
- Supports git commit signing with GPG keys
- Rename Dockerfile to Containerfile (Podman convention)

Signed-off-by: Angelos <angelos-hermes@mailbox.org>
Previous run NousResearch#46 failed due to transient network issue.
This is a no-op commit to re-trigger the build pipeline.

Related: feat: Add gnupg package for GPG email signing
Previous commit had double-escaped backslashes (\\) which broke
the Dockerfile syntax. This fixes the RUN instructions to use
proper single backslashes for line continuation.

Fixes build failure in run NousResearch#47.
- GPG support moved to hermes-sandbox-image (where it belongs)
- Sandbox image is the correct location for agent tooling
- Reduces production container attack surface
- Follows separation of concerns:
  * hermes-agent: Production service container
  * hermes-sandbox: Ephemeral agent execution environment

Related: angelos/hermes-sandbox-image@4294f7f
angelos and others added 18 commits April 10, 2026 03:08
Two merge fallout fixes:
1. delegate_tool.py uses tool_error() 6 times but never imported it
   from tools.registry — every error path crashed with NameError.
2. test_batch_capped_at_3 expected the old silent-truncation behavior;
   updated to test the new error-on-excess behavior.
…d lifetime

Two changes to how sandbox containers are spawned:

1. Add --init to docker run. This uses tini as PID 1, which
   automatically reaps zombie child processes. Previously PID 1 was
   sleep(1) which doesn't call wait() — every background process that
   exited became a zombie, and the process tool reported them as
   "running" because zombie PIDs still exist in the process table.
   Fixes NousResearch#6908 (upstream).

2. Replace 'sleep 2h' with 'sleep infinity'. The fixed 2-hour lifetime
   was arbitrary and sometimes too short for long agent sessions. The
   idle reaper (terminal.lifetime_seconds, default 300s, configurable
   via config.yaml) already handles cleanup based on last activity —
   there's no reason for the container itself to have a fixed death
   timer. With sleep infinity, the container lives until the idle
   reaper kills it or the task ends.

Both changes are one line each. No config changes needed — the
existing terminal.lifetime_seconds config controls idle timeout.
The upstream merge (PR NousResearch#14) auto-resolved gateway/run.py's 1598-line
diff and silently dropped the entire self-nudge system (~55 lines in
gateway/run.py, ~37 in run_agent.py, ~5 in model_tools.py). This
broke notify_on_complete for background processes — the mechanism that
fires a hidden turn when a background process exits was gone.

Cherry-picked b8737bc ("feat(gateway): add one-shot self nudge tool")
on top of the merged state, resolving 3 conflicts in gateway/run.py
(media_message_callback + self_nudge_callback coexistence) and 1 in
tests/test_model_tools.py (kept both clarify + self_nudge tests).

The self-nudge system provides:
- _arm_self_nudge / _cancel_self_nudge / _fire_self_nudge in gateway
- self_nudge_callback on AIAgent for tool-initiated delayed turns
- self_nudge tool exposed on gateway platforms (not CLI)
- _pending_hidden_turns for injecting hidden messages on next turn
- Wired to notify_on_complete in terminal_tool.py via
  process_registry.pending_watchers
gateway/run.py uses uuid.uuid4() at lines 1139 (restart-resume) and
1263 (self-nudge) but uuid was never imported at module level — upstream
uses local 'import uuid as _uuid' inside functions, and our fork's
additions used bare uuid without adding the import. Self-nudge fired
but crashed with NameError: name 'uuid' is not defined.
The self-nudge path called _handle_message_with_agent directly without
sending a typing indicator first. The user saw the agent silently
process and respond with no 'typing...' feedback in Telegram.
_run_self_nudge_entry called _handle_message_with_agent but discarded
its return value. When streaming doesn't deliver (typical for self-nudge
turns since there's no user message to stream-edit), the agent's
response vanished — the model did the work but the user never saw the
result. Now captures the return and explicitly sends via adapter.send
if streaming didn't already deliver it.
Adds logger.info when a turn is routed to the cheap model, showing the
route label and first 80 chars of the user message. Added to both
gateway and CLI paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds an optional 'model' parameter to delegate_task so the agent can
route subagents to a smaller/faster model for simple tasks (e.g.
summarization, formatting, lookups) while keeping the primary model
for complex reasoning.

Works at both levels:
- Top-level 'model' param for single-task delegation
- Per-task 'model' field in batch tasks array

The per-call model overrides delegation.model from config, which in
turn overrides inheriting the parent's model. Per-task model takes
precedence over top-level model in batch mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds tiered model selection for subagent delegation:
- Agent can pass model='small'/'medium'/'large' or a direct model name
- Tiers configured via delegation.model_tiers in config.yaml
- New list_models tool returns available models with tier assignments

Use cases:
- Delegate file exploration to a small/fast model
- Escalate to a large model when stuck on complex reasoning
- Spin up a peer review subagent on a stronger model
- Mixed batch: simple tasks on small, complex on default

Precedence: per-task model > top-level model param > delegation.model
config > inherit parent model. Tier names resolve to configured model
names; unknown tiers fall back to default.

Not yet upstreamed — local fork only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
list_models was only handled in the sequential dispatch path.
The concurrent path (used when multiple tools are called in one turn)
fell through to the default handler, causing the tool call to fail.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
list_models was only in the 'delegation' toolset but composite toolsets
(hermes-cli, hermes-telegram, etc.) list tools directly without including
the delegation toolset. Added list_models alongside delegate_task in all
composite toolsets.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oint

is_local_endpoint only matched IPs and localhost, missing Docker/Podman
DNS names like hermes-litellm. This caused stale stream timeouts (180s)
to fire on local LLM proxies instead of being auto-disabled.

Two fixes:
1. model.local_endpoints config: list of hostnames to treat as local
2. DNS resolution fallback: resolve hostname to IP, check if private

Not yet upstreamed — local fork only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Hostnames without dots (e.g. hermes-litellm, ollama) are always on the
local network — Docker/Podman DNS, mDNS, or /etc/hosts. No need to
configure them explicitly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds max_context_tokens guard to smart routing: if conversation context
exceeds the threshold, stay on the primary model instead of routing to
the cheap model. The cheap model is meant to be fast — sending it a
large context defeats the purpose.

Changes:
- choose_cheap_model_route accepts context_tokens parameter
- Gateway estimates context from cached agent's session_prompt_tokens
  or from history length (4 chars ≈ 1 token)
- CLI estimates from conversation_history
- Log line now includes context token count

Config: smart_model_routing.max_context_tokens (default: 0 = disabled)

Not yet upstreamed — local fork only.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When trim_context is enabled and context exceeds max_context_tokens,
trim conversation history from the head (keep most recent messages)
and still route to the cheap model — instead of falling back to the
primary model entirely.

For a simple "thanks!" in a 48K-token session, only the last ~32K
tokens of history are sent to the cheap model. The model can still
respond appropriately with recent context.

Config:
  smart_model_routing:
    max_context_tokens: 32000
    trim_context: true   # default: false (safe — skip route entirely)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reassigning the `history` parameter inside the trim block made Python
treat it as a local variable, causing UnboundLocalError on earlier
reads. Use _trimmed_history + _effective_history instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The entrypoint hardcoded HERMES_HOME="/opt/data", ignoring the
environment variable set by the Dockerfile or container runtime.
This made it impossible to run multiple instances with different
data directories (e.g. a crash-test instance alongside production).

Change to ${HERMES_HOME:-/opt/data} so the entrypoint respects
the env var when set, falling back to /opt/data as before.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
teknium1 added a commit that referenced this pull request Apr 15, 2026
- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR #3996) and malaiwah (PR #8115).
Closes #4084.
teknium1 added a commit that referenced this pull request Apr 15, 2026
#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR #3996) and malaiwah (PR #8115).
Closes #4084.
@teknium1

Copy link
Copy Markdown
Contributor

Thanks for the contribution! The specific fix proposed here — HERMES_HOME="${HERMES_HOME:-/opt/data}" in docker/entrypoint.sh — was already merged into main as part of PR #10066 (commit 8548893d1, "feat: entry-level Podman support — find_docker() + rootless entrypoint", April 14 2026). The commit message even calls it out explicitly: "Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)".

Closing as implemented on main. This is an automated hermes-sweeper review.

@teknium1 teknium1 closed this Apr 27, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
NousResearch#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR NousResearch#3996) and malaiwah (PR NousResearch#8115).
Closes NousResearch#4084.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
NousResearch#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR NousResearch#3996) and malaiwah (PR NousResearch#8115).
Closes NousResearch#4084.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
NousResearch#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR NousResearch#3996) and malaiwah (PR NousResearch#8115).
Closes NousResearch#4084.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
NousResearch#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR NousResearch#3996) and malaiwah (PR NousResearch#8115).
Closes NousResearch#4084.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
NousResearch#10066)

- find_docker() now checks HERMES_DOCKER_BINARY env var first, then
  docker on PATH, then podman on PATH, then macOS known locations
- Entrypoint respects HERMES_HOME env var (was hardcoded to /opt/data)
- Entrypoint uses groupmod -o to tolerate non-unique GIDs (fixes macOS
  GID 20 conflict with Debian's dialout group)
- Entrypoint makes chown best-effort so rootless Podman continues
  instead of failing with 'Operation not permitted'
- 5 new tests covering env var override, podman fallback, precedence

Based on work by alanjds (PR NousResearch#3996) and malaiwah (PR NousResearch#8115).
Closes NousResearch#4084.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants