Skip to content

fix(gateway): recover stale pid and planned restart state#14179

Closed
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-runtime-lock-stale-pid
Closed

fix(gateway): recover stale pid and planned restart state#14179
helix4u wants to merge 1 commit into
NousResearch:mainfrom
helix4u:fix/gateway-runtime-lock-stale-pid

Conversation

@helix4u

@helix4u helix4u commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This hardens gateway liveness tracking and makes service restarts recover cleanly when Hermes asks the service manager to restart it.

There were two related failure modes in this area:

  • stale ~/.hermes/gateway.pid metadata could block a real restart even when the old gateway was gone
  • planned service restarts could wedge into a confusing systemd state where Hermes exited with code 75, systemd entered its restart window or failed/rate-limited state, and hermes gateway status did not explain what was happening clearly

The first half of this PR separates live ownership from stale metadata by adding a real runtime lock (gateway.lock) that is held by the live gateway process for its entire lifetime. That makes startup answer the important question cleanly:

  • if the runtime lock is held, there is still a live owner
  • if the PID file exists but the runtime lock is not held, the PID state is stale and can be cleaned up

The second half makes hermes gateway restart more idempotent for systemd-managed gateways. After a self-requested restart, the CLI now clears stale failed state, actively kicks the unit back into motion, and waits for the replacement process instead of leaving the operator in a silent dead window. hermes gateway status also now surfaces planned-restart states directly, including the common exit 75 failure case, and supports -l/--full so the same command can mirror the untruncated systemctl / journalctl output users are already being told to inspect.

Related Issue

N/A

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • Added a cross-process runtime gateway lock in gateway/status.py using OS file locking
  • Made get_running_pid() consult the runtime lock before trusting gateway.pid
  • Added fallback from stale gateway.pid metadata to live gateway.lock metadata when the lock is still held
  • Released the runtime lock during normal gateway shutdown in gateway/run.py
  • Claimed the runtime lock before writing gateway.pid during startup in gateway/run.py
  • Tightened stop_profile_gateway() in hermes_cli/gateway.py so it only removes the PID file after the process is actually gone
  • Added systemd unit-property probing in hermes_cli/gateway.py for restart/status recovery logic
  • Made systemd_restart() clear failed state and explicitly start the unit when a planned restart is pending or when the service is in the post-exit handoff window
  • Made systemd_status() explain pending auto-restart and stuck planned-restart states instead of just showing a generic failed service
  • Added hermes gateway status -l / --full to expose untruncated service and journal output
  • Added regression tests for lock lifecycle, stale PID-without-lock cleanup, live lock fallback, planned restart recovery, and the new status parser/status output paths

How to Test

  1. Run source venv/bin/activate
  2. Run scripts/run_tests.sh tests/gateway/test_status.py tests/hermes_cli/test_gateway.py tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_gateway_runtime_health.py -n 4
  3. Run scripts/run_tests.sh tests/gateway/test_runner_startup_failures.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_clean_shutdown_marker.py tests/hermes_cli/test_update_gateway_restart.py -n 4
  4. Run scripts/run_tests.sh tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_gateway_runtime_health.py tests/hermes_cli/test_gateway.py -n 4

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: Ubuntu 24.04 / Linux

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Screenshots / Logs

Focused test runs passed:

  • 148 passed across status/service/runtime-health coverage
  • 60 passed across startup/shutdown/restart coverage
  • 122 passed across the gateway service/runtime-health/status slice after the restart-status follow-up changes

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 22, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #14002 (stale gateway PID recovery + systemd cleanup) and #11156 (OSError from scoped-lock PID probe) — overlapping gateway liveness improvements.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related: #14002 (stale gateway PID recovery + systemd cleanup) and #11156 (OSError from scoped-lock PID probe) — overlapping gateway liveness improvements.

@helix4u

helix4u commented Apr 22, 2026

Copy link
Copy Markdown
Contributor Author

updated. should allow for detecting stale pids and systemd rate limits.

@helix4u helix4u changed the title fix(gateway): add runtime lock for stale pid recovery fix(gateway): recover stale pid and planned restart state Apr 22, 2026
@helix4u helix4u force-pushed the fix/gateway-runtime-lock-stale-pid branch from 34965cb to 1c5d088 Compare April 22, 2026 23:17
@helix4u helix4u marked this pull request as ready for review April 22, 2026 23:21
teknium1 added a commit that referenced this pull request Apr 22, 2026
Follow-up for salvaged PR #14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
@teknium1

Copy link
Copy Markdown
Contributor

Merged via PR #14200 — your commit (1c5d088) was cherry-picked onto current main with authorship preserved via rebase-merge, plus a small follow-up fix so that _cleanup_invalid_pid_path actually unlinks stale gateway.pid / gateway.lock files left behind by a crashed foreign PID (the defensive remove_pid_file() guard was swallowing those). Thanks for the fix and the detailed diagnosis in Discord! #14200

@teknium1 teknium1 closed this Apr 22, 2026
ulasbilgen pushed a commit to ulasbilgen/hermes-adhd-agent that referenced this pull request May 1, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
aj-nt pushed a commit to aj-nt/hermes-agent that referenced this pull request May 1, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
innocarpe pushed a commit to innocarpe/hermes-agent that referenced this pull request May 9, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
gweeteve pushed a commit to gweeteve/hermes-agent that referenced this pull request Jun 2, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
Follow-up for salvaged PR NousResearch#14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants