Skip to content

[Bug]: dashboard /api/status reports gateway 'stopped' on Docker (PID-1) deployment — companion to #4776 #26181

@aliu-ronin

Description

@aliu-ronin

Bug Description

dashboard /api/status reports gateway_running: false / gateway_state: "stopped" / gateway_pid: null when Hermes runs as the container entrypoint in Docker/Kubernetes (PID-1 pattern), even though the gateway is fully functional: hermes status CLI correctly reports ✓ running / docker (foreground), cron jobs deliver, platform handlers (Feishu/Telegram/etc.) connect, and request handling works.

This is the dashboard counterpart to #4776 (CLI status path). PR #4792 was auto-closed by hermes-sweeper on the grounds that the CLI path was already refactored upstream into hermes_cli.gateway.get_gateway_runtime_snapshot() with is_container() detection — that is correct for the CLI, but the dashboard's /api/status handler in hermes_cli/web_server.py takes a different code path that still calls gateway.status.get_running_pid() (lock/pidfile-based), which the refactor did not touch. The sweeper review missed this case.

Steps to Reproduce

  1. Run Hermes via Docker with the standard image (v0.13.0 / 2026.5.7):
    command: ["gateway", "run"]
    environment:
      HERMES_DASHBOARD: "1"
      HERMES_DASHBOARD_HOST: 0.0.0.0
      HERMES_DASHBOARD_PORT: "9119"
  2. Wait for the container to become healthy.
  3. Confirm gateway is alive:
    $ docker exec hermes-agent hermes status
    ◆ Gateway Service
      Status:       ✓ running
      Manager:      docker (foreground)
      PID(s):       7
    
  4. Hit dashboard:
    $ curl http://127.0.0.1:9119/api/status
    {"version":"0.13.0","gateway_running":false,"gateway_pid":null,
     "gateway_state":"stopped","gateway_platforms":{},...}
    

Expected Behavior

/api/status should report gateway_running: true / gateway_state: "running" whenever a hermes gateway run process is alive in the container — consistent with what hermes status CLI reports.

Actual Behavior

Always reports stopped. Affects every dashboard consumer: TUI dashboard widgets, status badge, anything polling /api/status.

Root Cause (verified)

hermes_cli/web_server.py:537-545 /api/status handler:

gateway_pid = get_running_pid()       # from gateway.status — depends on pid/lock files
gateway_running = gateway_pid is not None

gateway.status.get_running_pid() first checks is_gateway_runtime_lock_active(), which depends on gateway.pid + gateway.lock being present. In the PID-1 entrypoint pattern these files are never reliably written (the fcntl lock fd is released after startup; the pidfile is then cleaned up by _cleanup_invalid_pid_path). So get_running_pid() returns None → handler reports stopped.

Affected Component

Web server (hermes_cli/web_server.py, line ~537)

Environment

  • Hermes v0.13.0 (2026.5.7)
  • Linux container (Debian 13.x base)
  • Python 3.13.5
  • cap_drop: [ALL] + selective adds (representative of locked-down deployments)

Proposed Fix

Mirror PR #4792's pgrep fallback approach, applied to the dashboard handler. Add an is_container()-gated _scan_gateway_pid_in_container() helper invoked when the local pid/lock check returns None. Use pgrep -f "hermes gateway run" for the candidate list, then re-validate each PID via /proc/<pid>/cmdline argv tokens (must contain gateway and run as independent tokens) to defend against pgrep -f's substring matching accidentally hitting python -c debug invocations.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Medium — degraded but workaround existsarea/dockerDocker image, Compose, packagingcomp/cliCLI entry point, hermes_cli/, setup wizardtype/bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions