Skip to content

fix(dashboard): detect gateway running as PID 1 in Docker/Kubernetes#26182

Open
aliu-ronin wants to merge 2 commits into
NousResearch:mainfrom
aliu-ronin:fix/dashboard-gateway-stopped-docker
Open

fix(dashboard): detect gateway running as PID 1 in Docker/Kubernetes#26182
aliu-ronin wants to merge 2 commits into
NousResearch:mainfrom
aliu-ronin:fix/dashboard-gateway-stopped-docker

Conversation

@aliu-ronin

@aliu-ronin aliu-ronin commented May 15, 2026

Copy link
Copy Markdown

What does this PR do?

Fixes the dashboard /api/status gateway liveness detection for Docker/Kubernetes PID-1 deployments.

When the gateway runs as the container entrypoint, the dashboard can report gateway_running: false / gateway_state: "stopped" even though the gateway is alive and hermes status reports it as running. The dashboard still used gateway.status.get_running_pid(), which depends on pid/lock files that are not reliable in this container pattern.

This PR adds a container-only fallback for the dashboard status handler. When the pid/lock check returns None, /api/status now reuses the canonical hermes_cli.gateway.find_gateway_pids() scanner instead of duplicating a narrow pgrep substring search. This keeps dashboard status detection aligned with the CLI path and covers supported invocations such as:

  • hermes gateway run
  • python -m hermes_cli.main gateway run
  • python /path/hermes_cli/main.py gateway run

Related Issue

Fixes #26181

Companion to #4776 / the CLI status-path container fix.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • hermes_cli/web_server.py

    • Adds a container-only fallback in /api/status when get_running_pid() returns None.
    • Reuses hermes_cli.gateway.find_gateway_pids() for gateway process detection instead of a direct pgrep -f "hermes gateway run" search.
    • Keeps non-container hosts, existing pid/lock success path, and remote GATEWAY_HEALTH_URL probe behavior unchanged.
  • tests/hermes_cli/test_web_server.py

    • Adds handler-level coverage for the dashboard container fallback.
    • Adds helper coverage for outside-container short-circuit, canonical scanner delegation, self-PID skipping, empty scanner results, and scanner exceptions.
    • Preserves the existing remote gateway health-probe tests by explicitly disabling the new container fallback in those tests.

How to Test

  1. Run the focused status tests:
    scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway'
  2. Run the full web server test file:
    scripts/run_tests.sh tests/hermes_cli/test_web_server.py
  3. In a Docker/Kubernetes PID-1 gateway deployment, compare dashboard status before and after the change:
    curl http://127.0.0.1:9119/api/status
    Expected after this change: gateway_running: true, a non-null gateway_pid, and gateway_state: "running" when the gateway process is alive.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
    • Targeted verification completed instead:
      • scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway' → 16 passed
      • scripts/run_tests.sh tests/hermes_cli/test_web_server.py → 153 passed
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS / Python 3.13.11; Docker PID-1 behavior verified in the issue reproduction environment

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — container-only fallback is gated on is_container()
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

Focused tests:

scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway'
16 passed

Full web server test file:

scripts/run_tests.sh tests/hermes_cli/test_web_server.py
153 passed, 5 warnings

Container end-to-end verification from the original reproduction:

Before: /api/status => gateway_running=false, gateway_pid=null, gateway_state=stopped
After:  /api/status => gateway_running=true, gateway_pid=7, gateway_state=running

dashboard /api/status 在 Docker/Kubernetes 部署下永远报 gateway_state=stopped,
即使 `hermes gateway run` 进程活着、CLI `hermes status` 已经能正确报 running。

根因:handler 调 gateway.status.get_running_pid(),该函数依赖 gateway.pid /
gateway.lock 文件,但容器内 entrypoint pattern 下这两个文件从来不写,所以
gateway_pid 永远是 None → gateway_running=False。

修复思路跟 PR NousResearch#4792(已 merge 进 main 的 CLI 状态 refactor)一致:is_container()
为 True 时用 pgrep 兜底。但加了两道防御避免误判:

1. self-PID 排除
2. /proc/<pid>/cmdline argv token 验证(要求 gateway 和 run 是独立 argv 元素,
   不仅仅是 substring 出现),避免 pgrep -f 误匹配到 cmdline 含字面字符串
   的 python -c 进程

测试:单元测试 16 个(3 个 fallback handler + 6 个 helper + 4 个 remote
probe no-regression + 3 个其他 status)全过;容器端到端验证 dashboard
/api/status 从 stopped 变 running。

Companion of NousResearch#4776 (CLI status path).
@alt-glitch alt-glitch added type/bug Something isn't working comp/cli CLI entry point, hermes_cli/, setup wizard area/docker Docker image, Compose, packaging P2 Medium — degraded but workaround exists labels May 15, 2026
NishantEC

This comment was marked as outdated.

@austinpickett austinpickett requested a review from Copilot May 18, 2026 14:35
@austinpickett

Copy link
Copy Markdown
Collaborator

Please use PULL_REQUEST_TEMPLATE.md

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the dashboard /api/status gateway liveness detection so Docker/Kubernetes PID-1 gateway deployments can be reported as running when PID/lock files are absent.

Changes:

  • Adds container-only process scanning helpers in hermes_cli/web_server.py.
  • Falls back to the scanner when get_running_pid() returns None.
  • Adds tests for status fallback behavior and process-scan edge cases.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
hermes_cli/web_server.py Adds /proc cmdline validation and container fallback PID scanning for /api/status.
tests/hermes_cli/test_web_server.py Adds unit and handler tests for the new fallback path and preserves remote-probe test intent.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hermes_cli/web_server.py Outdated
try:
import subprocess
result = subprocess.run(
["pgrep", "-f", "hermes gateway run"],
Use the canonical gateway process scanner for Docker/Kubernetes dashboard
status fallback instead of a direct pgrep substring search. This keeps
/api/status aligned with CLI gateway status detection and covers supported
entrypoints such as python -m hermes_cli.main gateway run.

Tests:
- scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway'
- scripts/run_tests.sh tests/hermes_cli/test_web_server.py
@aliu-ronin

Copy link
Copy Markdown
Author

Updated in 0850e36:

  • Replaced the direct pgrep -f "hermes gateway run" dashboard fallback with the canonical hermes_cli.gateway.find_gateway_pids() scanner.
  • This addresses the review concern about missing valid invocations such as python -m hermes_cli.main gateway run and python /path/hermes_cli/main.py gateway run.
  • Updated tests cover scanner delegation, empty results, scanner exceptions, and self-PID skipping.
  • Updated the PR description to use PULL_REQUEST_TEMPLATE.md.

Verification:

  • scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway' → 16 passed
  • scripts/run_tests.sh tests/hermes_cli/test_web_server.py → 153 passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/docker Docker image, Compose, packaging comp/cli CLI entry point, hermes_cli/, setup wizard P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: dashboard /api/status reports gateway 'stopped' on Docker (PID-1) deployment — companion to #4776

5 participants