fix(dashboard): detect gateway running as PID 1 in Docker/Kubernetes#26182
Open
aliu-ronin wants to merge 2 commits into
Open
fix(dashboard): detect gateway running as PID 1 in Docker/Kubernetes#26182aliu-ronin wants to merge 2 commits into
aliu-ronin wants to merge 2 commits into
Conversation
dashboard /api/status 在 Docker/Kubernetes 部署下永远报 gateway_state=stopped, 即使 `hermes gateway run` 进程活着、CLI `hermes status` 已经能正确报 running。 根因:handler 调 gateway.status.get_running_pid(),该函数依赖 gateway.pid / gateway.lock 文件,但容器内 entrypoint pattern 下这两个文件从来不写,所以 gateway_pid 永远是 None → gateway_running=False。 修复思路跟 PR NousResearch#4792(已 merge 进 main 的 CLI 状态 refactor)一致:is_container() 为 True 时用 pgrep 兜底。但加了两道防御避免误判: 1. self-PID 排除 2. /proc/<pid>/cmdline argv token 验证(要求 gateway 和 run 是独立 argv 元素, 不仅仅是 substring 出现),避免 pgrep -f 误匹配到 cmdline 含字面字符串 的 python -c 进程 测试:单元测试 16 个(3 个 fallback handler + 6 个 helper + 4 个 remote probe no-regression + 3 个其他 status)全过;容器端到端验证 dashboard /api/status 从 stopped 变 running。 Companion of NousResearch#4776 (CLI status path).
Collaborator
|
Please use PULL_REQUEST_TEMPLATE.md |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the dashboard /api/status gateway liveness detection so Docker/Kubernetes PID-1 gateway deployments can be reported as running when PID/lock files are absent.
Changes:
- Adds container-only process scanning helpers in
hermes_cli/web_server.py. - Falls back to the scanner when
get_running_pid()returnsNone. - Adds tests for status fallback behavior and process-scan edge cases.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
hermes_cli/web_server.py |
Adds /proc cmdline validation and container fallback PID scanning for /api/status. |
tests/hermes_cli/test_web_server.py |
Adds unit and handler tests for the new fallback path and preserves remote-probe test intent. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| try: | ||
| import subprocess | ||
| result = subprocess.run( | ||
| ["pgrep", "-f", "hermes gateway run"], |
Use the canonical gateway process scanner for Docker/Kubernetes dashboard status fallback instead of a direct pgrep substring search. This keeps /api/status aligned with CLI gateway status detection and covers supported entrypoints such as python -m hermes_cli.main gateway run. Tests: - scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway' - scripts/run_tests.sh tests/hermes_cli/test_web_server.py
Author
|
Updated in
Verification:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes the dashboard
/api/statusgateway liveness detection for Docker/Kubernetes PID-1 deployments.When the gateway runs as the container entrypoint, the dashboard can report
gateway_running: false/gateway_state: "stopped"even though the gateway is alive andhermes statusreports it as running. The dashboard still usedgateway.status.get_running_pid(), which depends on pid/lock files that are not reliable in this container pattern.This PR adds a container-only fallback for the dashboard status handler. When the pid/lock check returns
None,/api/statusnow reuses the canonicalhermes_cli.gateway.find_gateway_pids()scanner instead of duplicating a narrowpgrepsubstring search. This keeps dashboard status detection aligned with the CLI path and covers supported invocations such as:hermes gateway runpython -m hermes_cli.main gateway runpython /path/hermes_cli/main.py gateway runRelated Issue
Fixes #26181
Companion to #4776 / the CLI status-path container fix.
Type of Change
Changes Made
hermes_cli/web_server.py/api/statuswhenget_running_pid()returnsNone.hermes_cli.gateway.find_gateway_pids()for gateway process detection instead of a directpgrep -f "hermes gateway run"search.GATEWAY_HEALTH_URLprobe behavior unchanged.tests/hermes_cli/test_web_server.pyHow to Test
scripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway'gateway_running: true, a non-nullgateway_pid, andgateway_state: "running"when the gateway process is alive.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passscripts/run_tests.sh tests/hermes_cli/test_web_server.py -k 'status or ScanGateway'→ 16 passedscripts/run_tests.sh tests/hermes_cli/test_web_server.py→ 153 passedDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/Ais_container()For New Skills
N/A
Screenshots / Logs
Focused tests:
Full web server test file:
Container end-to-end verification from the original reproduction: