Skip to content

fix(gateway): exit non-zero on /restart so launchd revives the gateway#43498

Open
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/launchd-restart-nonzero-exit
Open

fix(gateway): exit non-zero on /restart so launchd revives the gateway#43498
liuhao1024 wants to merge 1 commit into
NousResearch:mainfrom
liuhao1024:fix/launchd-restart-nonzero-exit

Conversation

@liuhao1024

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a bug where /restart on a launchd-managed gateway (macOS) exits 0, causing the gateway to stay dead because launchd's KeepAlive.SuccessfulExit=false treats a clean exit as intentional.

Related Issue

Fixes #43475

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

  • gateway/run.py: Add a fallthrough branch in start_gateway() exit logic — when _restart_requested is True but no explicit service-manager shortcut was taken (neither systemd's _restart_via_service nor a signal-initiated shutdown), return False (→ sys.exit(1)) so any service manager using KeepAlive / Restart=on-failure can restart the process.
  • tests/gateway/test_restart_exit_code.py: 5 unit tests covering all exit-decision branches: restart-without-service (exit 1), restart-via-service (exit 75), signal-without-restart (exit 1), clean-shutdown (exit 0), and signal-with-restart (exit 1).

How to Test

  1. Configure a launchd-managed gateway on macOS (the default install).
  2. Send /restart from any platform (Discord, Telegram, etc.).
  3. Verify the gateway restarts automatically (launchd revives it because exit code is now 1 instead of 0).
  4. Run pytest tests/gateway/test_restart_exit_code.py -v — all 5 tests should pass.

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

Code Intelligence

  • Analyzed: gateway/run.py:start_gateway exit-decision block (lines 15966–15995)
  • Blast radius: LOW — only affects the exit-code path when _restart_requested is True and _restart_via_service is False
  • Related patterns: signal-initiated shutdown (exit 1), systemd service restart (exit 75), planned-stop marker system

When a /restart command is issued on a launchd-managed gateway (macOS),
the process exited 0.  launchd's default plist sets
KeepAlive.SuccessfulExit=false, which treats a clean exit as intentional
and does NOT revive the gateway.  The agent goes silently unreachable.

Add a fallthrough branch after the systemd _restart_via_service check:
when _restart_requested is True but no explicit service-manager shortcut
was taken, return False (→ sys.exit(1)) so any service manager that uses
KeepAlive / Restart=on-failure can restart the process.

Fixes NousResearch#43475
@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery labels Jun 10, 2026

@austinpickett austinpickett left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ APPROVE

Bug confirmed. Before the fix, restart_signal_handler (and the end of start_gateway) had this decision table:

_signal_initiated_shutdown _restart_via_service _restart_requested exit
True False False 1 (non-zero) ✅
any True any 75 (SystemExit) ✅
False False False 0 (clean) ✅
False False True 0 ← BUG

A /restart that doesn't flow through the systemd shortcut (e.g. on macOS launchd, plain systemd without the service-restart hook, or bare process supervisors) set _restart_requested=True but hit return Truesys.exit(0). launchd's default plist uses KeepAlive.SuccessfulExit=false, meaning exit 0 is treated as intentional and the gateway stays dead.

Fix is correct. A new if runner._restart_requested branch explicitly return False (→ sys.exit(1)) when restart was requested but no service-manager shortcut was used. The check is correctly ordered after the _restart_via_service branch (which raises SystemExit(75)) so systemd still gets its special code.

Clean-shutdown path unaffected. test_clean_shutdown_exits_zero confirms _restart_requested=False, _signal_initiated=False → return True → exit 0. No regression.

launchd semantics. Exiting non-zero on /restart is correct: the service is expected to come back up, so it must signal failure to the supervisor. This is consistent with how systemd's Restart=on-failure and launchd's KeepAlive.SuccessfulExit=false both interpret the exit code.

Tests. Five scenarios are covered via inline logic replication of the decision block. They correctly verify all four rows of the table plus the SIGUSR1+restart edge case. The tests are unit-level (no live process), which is appropriate for this exit-code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

/restart bricks a launchd-managed gateway on macOS — exits 0, KeepAlive.SuccessfulExit=false won't revive it

3 participants