Skip to content

[codex] Fix gateway update restart race#13713

Closed
tommy29tmar wants to merge 1 commit into
NousResearch:mainfrom
tommy29tmar:codex/fix-gateway-restart-race
Closed

[codex] Fix gateway update restart race#13713
tommy29tmar wants to merge 1 commit into
NousResearch:mainfrom
tommy29tmar:codex/fix-gateway-restart-race

Conversation

@tommy29tmar

Copy link
Copy Markdown

Summary

This fixes a gateway restart race seen during hermes update when multiple hermes-gateway* systemd units are active. The change is not related to STT, Whisper, CUDA, or any local audio configuration.

Root Cause

get_running_pid() already determines when a PID file is stale, but cleanup for the current HERMES_HOME path delegated to remove_pid_file(). That helper intentionally refuses to delete a PID file owned by another process. For a stale PID record from a previous gateway process, this left gateway.pid behind and caused the next systemd start to fail repeatedly with PID file race lost to another gateway instance.

Separately, hermes update discovered active hermes-gateway* units in systemctl output order. If a profile gateway restarted before the default gateway, the default profile could hit stale PID state while another profile was already running.

Changes

  • Unlink stale PID paths directly once stale cleanup has proven the recorded process is gone.
  • Sort systemd gateway units so hermes-gateway restarts before profile units such as hermes-gateway-scout.
  • Add a short pause between gateway service restarts to reduce same-update races.
  • Add regression tests for stale PID cleanup and multi-profile restart ordering.

Validation

  • PYTHONPATH=/tmp/hermes-agent-upstream-pr /home/tommaso/.hermes/hermes-agent/venv/bin/python -m pytest /tmp/hermes-agent-upstream-pr/tests/gateway/test_status.py /tmp/hermes-agent-upstream-pr/tests/hermes_cli/test_update_gateway_restart.py -q
  • Result: 68 passed in 3.50s

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery comp/cli CLI entry point, hermes_cli/, setup wizard labels Apr 21, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Related to #13689 (systemd restart hardening) and #13559 / #13709 (stale PID cleanup) — overlapping but distinct code paths.

@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded by #14200.

Triage notes (medium confidence):
Merged PR #14200 (commit b52123e 'fix(gateway): recover stale pid and planned restart state') rewrote _cleanup_invalid_pid_path in gateway/status.py to force-unlink stale gateway.pid and lock files — same fix area.

Thanks for the contribution — the underlying problem this PR addresses has been resolved by the linked PR on current main. If you believe this was closed in error, please comment and we'll reopen.

(Bulk-closed during a CLI PR triage sweep.)

@teknium1 teknium1 closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/cli CLI entry point, hermes_cli/, setup wizard comp/gateway Gateway runner, session dispatch, delivery type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants