fix: hermes gateway restart waits for service to come back up (#8260)#9945
Merged
Conversation
Previously, systemd_restart() sent SIGUSR1 to the gateway, printed
'restart requested', and returned immediately. The gateway still
needed to drain active agents, exit with code 75, wait for systemd's
RestartSec=30, and start the new process. The user saw 'success' but
the gateway was actually down for 30-60 seconds.
Now the SIGUSR1 path blocks with progress feedback:
Phase 1 — wait for old process to die:
⏳ User service draining active work...
Polls os.kill(pid, 0) until ProcessLookupError (up to 90s)
Phase 2 — wait for new process to become active:
⏳ Waiting for hermes-gateway to restart...
Polls systemctl is-active + verifies new PID (up to 60s)
Success:
✓ User service restarted (PID 12345)
Timeout:
⚠ User service did not become active within 60s.
Check status: hermes gateway status
Check logs: journalctl --user -u hermes-gateway --since '2 min ago'
The reload-or-restart fallback path (line 1189) already blocks because
systemctl reload-or-restart is synchronous.
Test plan:
- Updated test to verify wait-for-restart behavior
- All 118 gateway CLI tests pass
This was referenced Apr 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #8260 —
hermes gateway restartreturned immediately after sending SIGUSR1, while the gateway was still draining/restarting. Users saw "restart requested" but the service was down for 30-60 seconds with no feedback.Before
After
Or on timeout:
Implementation
Two-phase wait after sending SIGUSR1:
os.kill(pid, 0)until old process is deadsystemctl is-active+ verify new PID viaget_running_pid()The
reload-or-restartfallback path is already synchronous (systemctl blocks), so no changes needed there.Test plan