fix(gateway): recover stale pid and planned restart state#14179
Closed
helix4u wants to merge 1 commit into
Closed
Conversation
Collaborator
1 similar comment
Collaborator
Contributor
Author
|
updated. should allow for detecting stale pids and systemd rate limits. |
34965cb to
1c5d088
Compare
teknium1
added a commit
that referenced
this pull request
Apr 22, 2026
Follow-up for salvaged PR #14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
Contributor
|
Merged via PR #14200 — your commit (1c5d088) was cherry-picked onto current main with authorship preserved via rebase-merge, plus a small follow-up fix so that |
This was referenced Apr 27, 2026
ulasbilgen
pushed a commit
to ulasbilgen/hermes-adhd-agent
that referenced
this pull request
May 1, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
aj-nt
pushed a commit
to aj-nt/hermes-agent
that referenced
this pull request
May 1, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
innocarpe
pushed a commit
to innocarpe/hermes-agent
that referenced
this pull request
May 9, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
gweeteve
pushed a commit
to gweeteve/hermes-agent
that referenced
this pull request
Jun 2, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
Follow-up for salvaged PR NousResearch#14179. `_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the default PID path, but that helper defensively refuses to delete a PID file whose pid field differs from `os.getpid()` (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd gateway left behind a PID file owned by a now-dead foreign PID. Once `get_running_pid()` has confirmed the runtime lock is inactive, the on-disk metadata is known to belong to a dead process, so we can force-unlink both the PID file and the sibling `gateway.lock` directly instead of going through the defensive helper. Also adds a regression test with a dead foreign PID that would have failed against the previous cleanup logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This hardens gateway liveness tracking and makes service restarts recover cleanly when Hermes asks the service manager to restart it.
There were two related failure modes in this area:
~/.hermes/gateway.pidmetadata could block a real restart even when the old gateway was gone75, systemd entered its restart window or failed/rate-limited state, andhermes gateway statusdid not explain what was happening clearlyThe first half of this PR separates live ownership from stale metadata by adding a real runtime lock (
gateway.lock) that is held by the live gateway process for its entire lifetime. That makes startup answer the important question cleanly:The second half makes
hermes gateway restartmore idempotent for systemd-managed gateways. After a self-requested restart, the CLI now clears stale failed state, actively kicks the unit back into motion, and waits for the replacement process instead of leaving the operator in a silent dead window.hermes gateway statusalso now surfaces planned-restart states directly, including the commonexit 75failure case, and supports-l/--fullso the same command can mirror the untruncatedsystemctl/journalctloutput users are already being told to inspect.Related Issue
N/A
Type of Change
Changes Made
gateway/status.pyusing OS file lockingget_running_pid()consult the runtime lock before trustinggateway.pidgateway.pidmetadata to livegateway.lockmetadata when the lock is still heldgateway/run.pygateway.pidduring startup ingateway/run.pystop_profile_gateway()inhermes_cli/gateway.pyso it only removes the PID file after the process is actually gonehermes_cli/gateway.pyfor restart/status recovery logicsystemd_restart()clear failed state and explicitly start the unit when a planned restart is pending or when the service is in the post-exit handoff windowsystemd_status()explain pending auto-restart and stuck planned-restart states instead of just showing a generic failed servicehermes gateway status -l/--fullto expose untruncated service and journal outputHow to Test
source venv/bin/activatescripts/run_tests.sh tests/gateway/test_status.py tests/hermes_cli/test_gateway.py tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_gateway_runtime_health.py -n 4scripts/run_tests.sh tests/gateway/test_runner_startup_failures.py tests/gateway/test_gateway_shutdown.py tests/gateway/test_clean_shutdown_marker.py tests/hermes_cli/test_update_gateway_restart.py -n 4scripts/run_tests.sh tests/hermes_cli/test_gateway_service.py tests/hermes_cli/test_gateway_runtime_health.py tests/hermes_cli/test_gateway.py -n 4Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/AScreenshots / Logs
Focused test runs passed:
148 passedacross status/service/runtime-health coverage60 passedacross startup/shutdown/restart coverage122 passedacross the gateway service/runtime-health/status slice after the restart-status follow-up changes