Skip to content

fix(gateway): recover stale pid and planned restart state (salvage #14179)#14200

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-9a24d127
Apr 22, 2026
Merged

fix(gateway): recover stale pid and planned restart state (salvage #14179)#14200
teknium1 merged 2 commits into
mainfrom
hermes/hermes-9a24d127

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Salvages #14179 by @helix4u onto current main plus a follow-up cleanup fix.

What this PR does

Adds a real OS-held gateway.lock (fcntl on POSIX, msvcrt on Windows) as the source of truth for "is a gateway alive?" — so a stale gateway.pid from a Ctrl+C'd or crashed gateway no longer blocks the next hermes gateway start. The OS releases the lock automatically when the process dies, so the next startup cleanly detects the old instance is gone.

Also hardens hermes gateway restart for systemd-managed gateways: clears failed state, issues reset-failed + start, and waits for the replacement process instead of leaving operators in a silent dead window after a planned exit 75 restart.

Closes the Discord thread where a user hit Ctrl+C, closed the terminal before the gateway finished shutting down, and couldn't start the gateway again — matching helix4u's original report.

Follow-up commit (ours, on top of the salvage)

_cleanup_invalid_pid_path originally called remove_pid_file() for the default PID path, but that helper defensively refuses to delete a PID file whose pid differs from os.getpid() (to protect --replace handoffs). Every realistic stale-PID scenario is exactly that case — a crashed gateway's PID file belongs to a now-dead foreign PID.

Fix: once get_running_pid() has confirmed the runtime lock is inactive, the on-disk metadata is known stale, so force-unlink both gateway.pid and gateway.lock directly. Added a regression test with a dead foreign PID.

Validation

  • 213 passed across status/service/runtime-health/startup/shutdown/restart slices
  • E2E verified end-to-end: child process acquires lock + writes PID → os._exit (abnormal) → parent sees lock released AND both files unlinked on next get_running_pid() call

Credit

Authored by @helix4u (commit 1c5d088 preserved via rebase-merge). Follow-up cleanup commit by us.

helix4u and others added 2 commits April 22, 2026 16:32
Follow-up for salvaged PR #14179.

`_cleanup_invalid_pid_path` previously called `remove_pid_file()` for the
default PID path, but that helper defensively refuses to delete a PID file
whose pid field differs from `os.getpid()` (to protect --replace handoffs).
Every realistic stale-PID scenario is exactly that case: a crashed/Ctrl+C'd
gateway left behind a PID file owned by a now-dead foreign PID.

Once `get_running_pid()` has confirmed the runtime lock is inactive, the
on-disk metadata is known to belong to a dead process, so we can force-unlink
both the PID file and the sibling `gateway.lock` directly instead of going
through the defensive helper.

Also adds a regression test with a dead foreign PID that would have failed
against the previous cleanup logic.
@teknium1 teknium1 merged commit 402d048 into main Apr 22, 2026
10 of 11 checks passed
@teknium1 teknium1 deleted the hermes/hermes-9a24d127 branch April 22, 2026 23:33
@alt-glitch alt-glitch added type/bug Something isn't working P1 High — major feature broken, no workaround comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery P1 High — major feature broken, no workaround type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants