Skip to content

fix: hermes gateway restart waits for service to come back up (#8260)#9945

Merged
teknium1 merged 1 commit into
mainfrom
hermes/hermes-adbcc843
Apr 15, 2026
Merged

fix: hermes gateway restart waits for service to come back up (#8260)#9945
teknium1 merged 1 commit into
mainfrom
hermes/hermes-adbcc843

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

Fixes #8260hermes gateway restart returned immediately after sending SIGUSR1, while the gateway was still draining/restarting. Users saw "restart requested" but the service was down for 30-60 seconds with no feedback.

Before

$ hermes gateway restart
✓ User service restart requested     ← returns immediately, service still dying
$ hermes gateway status
✗ User gateway service is stopped    ← surprise

After

$ hermes gateway restart
⏳ User service draining active work...
⏳ Waiting for hermes-gateway to restart...
✓ User service restarted (PID 12345)    ← blocks until actually up

Or on timeout:

⚠ User service did not become active within 60s.
  Check status: hermes gateway status
  Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

Implementation

Two-phase wait after sending SIGUSR1:

  1. Phase 1 (up to 90s): Poll os.kill(pid, 0) until old process is dead
  2. Phase 2 (up to 60s): Poll systemctl is-active + verify new PID via get_running_pid()

The reload-or-restart fallback path is already synchronous (systemctl blocks), so no changes needed there.

Test plan

  • Updated test to verify wait-for-restart behavior with mocked process lifecycle
  • All 118 gateway CLI tests pass

Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.

Now the SIGUSR1 path blocks with progress feedback:

Phase 1 — wait for old process to die:
  ⏳ User service draining active work...
  Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)

Phase 2 — wait for new process to become active:
  ⏳ Waiting for hermes-gateway to restart...
  Polls systemctl is-active + verifies new PID (up to 60s)

Success:
  ✓ User service restarted (PID 12345)

Timeout:
  ⚠ User service did not become active within 60s.
    Check status: hermes gateway status
    Check logs: journalctl --user -u hermes-gateway --since '2 min ago'

The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.

Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hermes gateway restart returns before restart completes

1 participant