fix(gateway): exit non-zero on /restart so launchd revives the gateway#43498
fix(gateway): exit non-zero on /restart so launchd revives the gateway#43498liuhao1024 wants to merge 1 commit into
Conversation
When a /restart command is issued on a launchd-managed gateway (macOS), the process exited 0. launchd's default plist sets KeepAlive.SuccessfulExit=false, which treats a clean exit as intentional and does NOT revive the gateway. The agent goes silently unreachable. Add a fallthrough branch after the systemd _restart_via_service check: when _restart_requested is True but no explicit service-manager shortcut was taken, return False (→ sys.exit(1)) so any service manager that uses KeepAlive / Restart=on-failure can restart the process. Fixes NousResearch#43475
austinpickett
left a comment
There was a problem hiding this comment.
✅ APPROVE
Bug confirmed. Before the fix, restart_signal_handler (and the end of start_gateway) had this decision table:
_signal_initiated_shutdown |
_restart_via_service |
_restart_requested |
exit |
|---|---|---|---|
| True | False | False | 1 (non-zero) ✅ |
| any | True | any | 75 (SystemExit) ✅ |
| False | False | False | 0 (clean) ✅ |
| False | False | True | 0 ← BUG |
A /restart that doesn't flow through the systemd shortcut (e.g. on macOS launchd, plain systemd without the service-restart hook, or bare process supervisors) set _restart_requested=True but hit return True → sys.exit(0). launchd's default plist uses KeepAlive.SuccessfulExit=false, meaning exit 0 is treated as intentional and the gateway stays dead.
Fix is correct. A new if runner._restart_requested branch explicitly return False (→ sys.exit(1)) when restart was requested but no service-manager shortcut was used. The check is correctly ordered after the _restart_via_service branch (which raises SystemExit(75)) so systemd still gets its special code.
Clean-shutdown path unaffected. test_clean_shutdown_exits_zero confirms _restart_requested=False, _signal_initiated=False → return True → exit 0. No regression.
launchd semantics. Exiting non-zero on /restart is correct: the service is expected to come back up, so it must signal failure to the supervisor. This is consistent with how systemd's Restart=on-failure and launchd's KeepAlive.SuccessfulExit=false both interpret the exit code.
Tests. Five scenarios are covered via inline logic replication of the decision block. They correctly verify all four rows of the table plus the SIGUSR1+restart edge case. The tests are unit-level (no live process), which is appropriate for this exit-code path.
What does this PR do?
Fixes a bug where
/restarton a launchd-managed gateway (macOS) exits 0, causing the gateway to stay dead because launchd'sKeepAlive.SuccessfulExit=falsetreats a clean exit as intentional.Related Issue
Fixes #43475
Type of Change
Changes Made
gateway/run.py: Add a fallthrough branch instart_gateway()exit logic — when_restart_requestedis True but no explicit service-manager shortcut was taken (neither systemd's_restart_via_servicenor a signal-initiated shutdown), returnFalse(→sys.exit(1)) so any service manager usingKeepAlive/Restart=on-failurecan restart the process.tests/gateway/test_restart_exit_code.py: 5 unit tests covering all exit-decision branches: restart-without-service (exit 1), restart-via-service (exit 75), signal-without-restart (exit 1), clean-shutdown (exit 0), and signal-with-restart (exit 1).How to Test
/restartfrom any platform (Discord, Telegram, etc.).pytest tests/gateway/test_restart_exit_code.py -v— all 5 tests should pass.Checklist
Code
fix(scope):,feat(scope):, etc.)pytest tests/ -qand all tests passDocumentation & Housekeeping
docs/, docstrings) — or N/Acli-config.yaml.exampleif I added/changed config keys — or N/ACONTRIBUTING.mdorAGENTS.mdif I changed architecture or workflows — or N/ACode Intelligence
gateway/run.py:start_gatewayexit-decision block (lines 15966–15995)_restart_requestedis True and_restart_via_serviceis False