Skip to content

fix(gateway): raise macOS launchd fd ceiling (8192/16384) + self-heal existing installs#35627

Open
banditburai wants to merge 3 commits into
NousResearch:mainfrom
banditburai:fix/gateway-fd-limit-hardening
Open

fix(gateway): raise macOS launchd fd ceiling (8192/16384) + self-heal existing installs#35627
banditburai wants to merge 3 commits into
NousResearch:mainfrom
banditburai:fix/gateway-fd-limit-hardening

Conversation

@banditburai

@banditburai banditburai commented May 31, 2026

Copy link
Copy Markdown
Contributor

macOS launchd jobs inherit a 256 file-descriptor soft limit by default (no *ResourceLimits keys), so a long-running gateway hits EMFILE ("Too many open files") under sustained worker/socket load. This raises the ceiling and delivers the bump to already-installed agents without a manual restart. This is OS hardening — it does not fix the leak. The leak cause is fixed by the sibling PR fix/tui-gateway-slash-worker-leak (Fixes #24775); PR1 stops the bleed, PR2 widens the margin.

What changed

  • hermes_cli/gateway.py — add SoftResourceLimits/HardResourceLimitsNumberOfFiles 8192 / 16384 to the generated launchd plist; run_gateway now calls refresh_launchd_plist_if_needed(restart=False) so the bump reaches existing installs on the next gateway boot (an exit-75 / /restart respawn skips hermes gateway restart, so without this the change wouldn't land until a manual restart or hermes update).
  • tests/test_gateway.py — +6 tests; replaces a test that mocked the refresh function away.

Self-heal without self-bootout (restart=False)

run_gateway runs inside the launchd-supervised process, so it must not bootout/bootstrap itself. restart=False rewrites the plist on disk only and skips launchctl bootout/bootstrap, mirroring systemd's daemon-reload semantics (refresh the unit definition; let the next restart apply it). With restart=True here, the bootout would send SIGTERM to the live gateway mid-startup — worst case the process dies before the bootstrap line runs, leaving the job unloaded. The explicit-CLI callers (hermes gateway install/start) keep the default restart=True; they run in a separate process where bootout/bootstrap is correct and applies the change immediately.

Scope

Raises the fd ceiling; it does not reduce fd consumption or fix the leak. A fast-enough leak still exceeds 16384 — prevention lives in PR1.

  • Refs #24775 — mitigation (higher ceiling), not closure; the leak cause is fixed in the sibling PR.
  • Addresses #32377 — fd headroom softens the blast radius of accumulated half-open-socket leaks; detection + reap live in PR1.

Tests (6 passing)

test_launchd_plist_declares_fd_resource_limits (plist carries 8192/16384) · test_launchd_plist_without_fd_limits_reads_as_stale (old plist detected stale) · test_run_gateway_self_heals_launchd_plist_on_macos (run_gateway passes restart=False) · test_refresh_launchd_no_restart_rewrites_without_bootout (rewrites file, no launchctl call) · test_refresh_launchd_with_restart_does_bootout (restart=True does bootout + bootstrap) · test_refresh_launchd_noop_when_current (current plist → no-op). The self-heal and no-bootout paths were previously untested (the prior test mocked the function away).

@alt-glitch alt-glitch added type/bug Something isn't working comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists labels May 31, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Supersedes #26536 (which raised fd limit to 4096). This PR raises to 8192/16384 and adds self-heal for existing installs. Sibling of #35626 (fixes the leak cause). Refs #24775.

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

What it does

Adds SoftResourceLimits (8192) and HardResourceLimits (16384) for NumberOfFiles to the generated macOS launchd plist, and adds a restart=False mode to refresh_launchd_plist_if_needed() so the gateway can self-heal the plist on the next boot without bootout/bootstrap-ing itself. Adds 6 tests covering plist generation, staleness detection, self-heal (no bootout), bootout with restart flag, and no-op when current.

✅ Looks Good

  • Root cause addressed: macOS launchd defaults to 256 fd soft cap — dashboard's slash/PTY children exhaust it (#24775). The fix raises it to practical limits (8192/16384)
  • Self-heal without self-termination: The restart=False parameter is a well-designed mirror of systemd's daemon-reload semantics — refresh on disk, apply on next boot, don't SIGTERM yourself
  • Comprehensive tests: 6 tests covering all edge cases — plist content, staleness detection, no-restart mode, restart mode, no-op case
  • normalize_launchd_plist_for_comparison handles fd limits: The normalization function already strips whitespace/variable content before comparing, so the new SoftResourceLimits/HardResourceLimits keys are correctly compared

💡 Suggestions (non-blocking)

  • The except Exception: pass blocks in run_gateway() for both systemd and launchd are intentional (best-effort). Consider logging exc_info=True at DEBUG level so operators can trace failed self-heal attempts from logs

Reviewed by Hermes Agent

@tonydwb tonydwb left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Verdict: Approved

Reviewed Changes

Raises macOS launchd fd ceiling from 256 to 8192/16384 soft/hard. Adds self-heal that refreshes existing installs on next gateway boot without self-bootout.

✅ Looks Good

  • Correctness: Correct plist keys (SoftResourceLimits/HardResourceLimits → NumberOfFiles). The restart=False parameter is essential for self-heal from inside the gateway process — avoids SIGTERM mid-startup.
  • Self-heal design: Well-thought-out — plist updated on disk; next restart applies it. Explicit-CLI callers keep restart=True for immediate application.
  • Test coverage: 6 tests covering plist generation, stale detection, self-heal (no bootout), bootout path, and no-op when current.
  • No security concerns: No credential exposure.

Reviewed by Hermes Agent

macOS launchd defaults to a 256 fd soft cap; leaked slash/PTY children under
the dashboard exhaust it (NousResearch#24775). Children inherit the limit from the single
ai.hermes.gateway plist. Old limit-less plists now read as stale so the
self-heal rewrites them.

Task: swl-qrf.14
…s existing installs

Mirrors the systemd on-boot refresh: an exit-75/`/restart` respawn skips
`hermes gateway restart`, so without this the plist fd-limit bump wouldn't reach
macOS installs until a manual restart or `hermes update`.

Task: swl-qrf.15
…launchd plist

run_gateway() (the launchd-supervised process) called
refresh_launchd_plist_if_needed(), which ran 'launchctl bootout <gateway>'
+ bootstrap on a stale plist. bootout SIGTERMs the very process executing
it: best case an extra restart on upgrade, worst case the process dies
before the bootstrap line runs and the job is left unloaded. A
user-customized plist also got clobbered + restarted on every boot.

Add restart=False for the in-process path so it mirrors systemd's
daemon-reload semantics: refresh the plist on disk and let the next
restart apply it, never bootout/bootstrap ourselves. The explicit-CLI
callers (install/start) keep restart=True (separate process, safe).

Replace the test that mocked refresh_launchd_plist_if_needed away with
real stale-plist coverage: restart=False rewrites without launchctl,
restart=True still bootout/bootstraps, and current plist is a no-op.

Refs NousResearch#24775, NousResearch#32377
@banditburai banditburai force-pushed the fix/gateway-fd-limit-hardening branch from 304c3b0 to f666e80 Compare June 1, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants