fix(gateway): raise macOS launchd fd ceiling (8192/16384) + self-heal existing installs#35627
Open
banditburai wants to merge 3 commits into
Open
fix(gateway): raise macOS launchd fd ceiling (8192/16384) + self-heal existing installs#35627banditburai wants to merge 3 commits into
banditburai wants to merge 3 commits into
Conversation
Collaborator
tonydwb
approved these changes
May 31, 2026
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved ✅
What it does
Adds SoftResourceLimits (8192) and HardResourceLimits (16384) for NumberOfFiles to the generated macOS launchd plist, and adds a restart=False mode to refresh_launchd_plist_if_needed() so the gateway can self-heal the plist on the next boot without bootout/bootstrap-ing itself. Adds 6 tests covering plist generation, staleness detection, self-heal (no bootout), bootout with restart flag, and no-op when current.
✅ Looks Good
- Root cause addressed: macOS launchd defaults to 256 fd soft cap — dashboard's slash/PTY children exhaust it (#24775). The fix raises it to practical limits (8192/16384)
- Self-heal without self-termination: The
restart=Falseparameter is a well-designed mirror of systemd'sdaemon-reloadsemantics — refresh on disk, apply on next boot, don't SIGTERM yourself - Comprehensive tests: 6 tests covering all edge cases — plist content, staleness detection, no-restart mode, restart mode, no-op case
normalize_launchd_plist_for_comparisonhandles fd limits: The normalization function already strips whitespace/variable content before comparing, so the new SoftResourceLimits/HardResourceLimits keys are correctly compared
💡 Suggestions (non-blocking)
- The
except Exception: passblocks inrun_gateway()for both systemd and launchd are intentional (best-effort). Consider loggingexc_info=Trueat DEBUG level so operators can trace failed self-heal attempts from logs
Reviewed by Hermes Agent
tonydwb
approved these changes
May 31, 2026
tonydwb
left a comment
There was a problem hiding this comment.
Code Review Summary
Verdict: Approved ✅
Reviewed Changes
Raises macOS launchd fd ceiling from 256 to 8192/16384 soft/hard. Adds self-heal that refreshes existing installs on next gateway boot without self-bootout.
✅ Looks Good
- Correctness: Correct plist keys (SoftResourceLimits/HardResourceLimits → NumberOfFiles). The
restart=Falseparameter is essential for self-heal from inside the gateway process — avoids SIGTERM mid-startup. - Self-heal design: Well-thought-out — plist updated on disk; next restart applies it. Explicit-CLI callers keep
restart=Truefor immediate application. - Test coverage: 6 tests covering plist generation, stale detection, self-heal (no bootout), bootout path, and no-op when current.
- No security concerns: No credential exposure.
Reviewed by Hermes Agent
macOS launchd defaults to a 256 fd soft cap; leaked slash/PTY children under the dashboard exhaust it (NousResearch#24775). Children inherit the limit from the single ai.hermes.gateway plist. Old limit-less plists now read as stale so the self-heal rewrites them. Task: swl-qrf.14
…s existing installs Mirrors the systemd on-boot refresh: an exit-75/`/restart` respawn skips `hermes gateway restart`, so without this the plist fd-limit bump wouldn't reach macOS installs until a manual restart or `hermes update`. Task: swl-qrf.15
…launchd plist run_gateway() (the launchd-supervised process) called refresh_launchd_plist_if_needed(), which ran 'launchctl bootout <gateway>' + bootstrap on a stale plist. bootout SIGTERMs the very process executing it: best case an extra restart on upgrade, worst case the process dies before the bootstrap line runs and the job is left unloaded. A user-customized plist also got clobbered + restarted on every boot. Add restart=False for the in-process path so it mirrors systemd's daemon-reload semantics: refresh the plist on disk and let the next restart apply it, never bootout/bootstrap ourselves. The explicit-CLI callers (install/start) keep restart=True (separate process, safe). Replace the test that mocked refresh_launchd_plist_if_needed away with real stale-plist coverage: restart=False rewrites without launchctl, restart=True still bootout/bootstraps, and current plist is a no-op. Refs NousResearch#24775, NousResearch#32377
304c3b0 to
f666e80
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
macOS launchd jobs inherit a 256 file-descriptor soft limit by default (no
*ResourceLimitskeys), so a long-running gateway hitsEMFILE("Too many open files") under sustained worker/socket load. This raises the ceiling and delivers the bump to already-installed agents without a manual restart. This is OS hardening — it does not fix the leak. The leak cause is fixed by the sibling PRfix/tui-gateway-slash-worker-leak(Fixes #24775); PR1 stops the bleed, PR2 widens the margin.What changed
hermes_cli/gateway.py— addSoftResourceLimits/HardResourceLimits→NumberOfFiles8192 / 16384 to the generated launchd plist;run_gatewaynow callsrefresh_launchd_plist_if_needed(restart=False)so the bump reaches existing installs on the next gateway boot (an exit-75 //restartrespawn skipshermes gateway restart, so without this the change wouldn't land until a manual restart orhermes update).tests/test_gateway.py— +6 tests; replaces a test that mocked the refresh function away.Self-heal without self-bootout (
restart=False)run_gatewayruns inside the launchd-supervised process, so it must not bootout/bootstrap itself.restart=Falserewrites the plist on disk only and skipslaunchctl bootout/bootstrap, mirroring systemd'sdaemon-reloadsemantics (refresh the unit definition; let the next restart apply it). Withrestart=Truehere, thebootoutwould send SIGTERM to the live gateway mid-startup — worst case the process dies before thebootstrapline runs, leaving the job unloaded. The explicit-CLI callers (hermes gateway install/start) keep the defaultrestart=True; they run in a separate process where bootout/bootstrap is correct and applies the change immediately.Scope
Raises the fd ceiling; it does not reduce fd consumption or fix the leak. A fast-enough leak still exceeds 16384 — prevention lives in PR1.
Refs #24775— mitigation (higher ceiling), not closure; the leak cause is fixed in the sibling PR.Addresses #32377— fd headroom softens the blast radius of accumulated half-open-socket leaks; detection + reap live in PR1.Tests (6 passing)
test_launchd_plist_declares_fd_resource_limits(plist carries 8192/16384) ·test_launchd_plist_without_fd_limits_reads_as_stale(old plist detected stale) ·test_run_gateway_self_heals_launchd_plist_on_macos(run_gatewaypassesrestart=False) ·test_refresh_launchd_no_restart_rewrites_without_bootout(rewrites file, no launchctl call) ·test_refresh_launchd_with_restart_does_bootout(restart=Truedoes bootout + bootstrap) ·test_refresh_launchd_noop_when_current(current plist → no-op). The self-heal and no-bootout paths were previously untested (the prior test mocked the function away).