fix(gateway): retryable platform failures stay alive instead of restart-looping by kainotomic-inc · Pull Request #11241 · NousResearch/hermes-agent

kainotomic-inc · 2026-04-16T21:26:49Z

Problem

In GatewayRunner._handle_adapter_fatal_error (gateway/run.py), when a platform adapter with fatal_error_retryable=True fires a fatal error and no other adapters are currently connected, the second elif branch currently calls self.stop() "so systemd Restart=on-failure can restart the process."

Under systemd this is fine. Under container runtimes with a restart policy (docker-compose restart: always, k8s Always), it produces an endless restart loop: SIGTERM → container exits → runtime restarts it → same transient failure fires → exit → repeat. The _failed_platforms retry queue populated three lines earlier never gets a chance to reconnect.

Reproduction

Deploy a gateway with WHATSAPP_ENABLED=true in a container with an expired WhatsApp session file.
Bridge starts, generates QR codes for ~60s, nobody scans, bridge exits with code -15.
whatsapp.py calls _set_fatal_error("whatsapp_bridge_exited", retryable=True).
_handle_adapter_fatal_error runs; self.adapters is empty after the adapter is popped; self._failed_platforms now contains WhatsApp.
The elif branch fires, self._exit_with_failure = True, self.stop() called, container exits.
Docker restarts it. Goto 1.

On a gateway with WhatsApp as the only enabled platform this is infinite. On one with Telegram + WhatsApp, Telegram never has time to connect before WhatsApp takes everyone down.

Fix

Drop the retryable-exit branch. "Retryable" means the retry loop (_reconnect_failed_platforms, 30s cadence) can recover, so the gateway should stay alive and let that run. The existing else of that elif was already unreachable — non-retryable failures aren't queued into _failed_platforms above, so a non-retryable error reaching this branch with an empty self.adapters was impossible. Collapsing both into the warning simplifies the code path.

Non-retryable failures are unaffected: they still hit the first if not self.adapters and not self._failed_platforms branch and exit cleanly. If every retryable platform ends up permanently broken, _failed_platforms drains and the first if-branch catches that case on the next fatal.

Test

Reproduced on a dockerised gateway with WhatsApp + stale bridge session; confirmed restart loop. Applied the patch; confirmed:

fatal_error_retryable=True still fires the error log.
_failed_platforms still gets the platform queued for reconnection.
Process PID 1 stays alive across the fatal event (container RestartCount stays at 0 over 2+ minutes).
Other platforms (Telegram) continue serving throughout.
Healthcheck endpoint stays 200.

Verified across 6 production gateways after rollout.

…rt-looping When a platform adapter with fatal_error_retryable=True fires a fatal error and no other adapters are currently connected, the elif branch in _handle_adapter_fatal_error calls self.stop() "so systemd Restart=on-failure can restart the process." In systemd this is fine. In Docker (docker-compose restart: always, k8s Always restart policy) it produces a restart loop: SIGTERM -> container exits -> runtime restarts it -> same transient failure fires again -> exit -> repeat. The _failed_platforms retry queue populated three lines earlier becomes dead code. Observed symptom: a WhatsApp bridge with an expired session times out its QR pairing flow after ~60s, exits with code -15, and fatal_error_retryable=True. The enclosing gateway process dies. Container restart policy brings it back. Same thing happens on the next boot. Other platforms (Telegram, API server) never have time to come up. The customer's dashboard says "Telegram connected" because the channel config is present, but no inbound message ever reaches the agent. Drop the retryable-exit branch. "Retryable" means the retry loop can recover, so the gateway should stay alive and let _reconnect_failed_platforms do its job every 30s. Non-retryable failures don't reach this branch because the preceding block only queues retryable ones into _failed_platforms; non-retryable failures still hit the first if-branch and exit cleanly. Cron keeps running. Platforms that DO connect keep serving. If all platforms end up permanently broken the first if-branch still catches that case once _failed_platforms drains. The existing else-branch of this elif was unreachable code (non-retryable failures can't reach it), so the only live code path was the exit-with-failure one; this commit collapses both into the warning.

futureworld678 · 2026-04-24T17:36:44Z

Hit the same root cause from a different entry point: WSL2 suspend/poweroff sends SIGTERM to the WhatsApp bridge child, and whatsapp.py::_check_managed_bridge_exit currently flags any non-None returncode as a fatal error — including -15 (SIGTERM), -2 (SIGINT), -9 (SIGKILL). Combined with the behavior this PR addresses, it produces the exact restart-loop symptom on WSL2 every time Windows suspends/resumes:

Apr 24 18:14:20  systemd-resolved: Clock change detected. Flushing caches.       # resume
Apr 24 18:14:36  systemd-logind:   The system will power off now!                 # suspend
Apr 24 18:14:37  gateway.platforms.whatsapp: ERROR WhatsApp bridge process exited unexpectedly (code -15).
Apr 24 18:14:37  gateway.run: ERROR Fatal whatsapp adapter error (whatsapp_bridge_exited)
Apr 24 18:14:38  systemd: hermes-gateway.service: Main process exited, status=1/FAILURE
...  [user resumes Windows] ...
Apr 24 18:28:22  gateway.run: ERROR Fatal whatsapp adapter error (whatsapp_bridge_exited)   # cycle repeats
Apr 24 18:28:23  systemd: Main process exited, status=75/TEMPFAIL
Apr 24 18:28:26  systemd: Started hermes-gateway.service.                              # ~14s later, back up

Not a big deal in practice (systemd Restart=always bounces the gateway within ~15s), but the logs make it look like a real crash and the extra restart adds startup latency.

Suggested complementary tweak, orthogonal to the run.py change in this PR — treat negative returncodes as a normal shutdown path in platforms/whatsapp.py:

async def _check_managed_bridge_exit(self) -> Optional[str]:
    if self._bridge_process is None:
        return None
    returncode = self._bridge_process.poll()
    if returncode is None:
        return None

    # Negative returncode = killed by signal. -15/-2/-9 are normal
    # shutdown paths (WSL2 suspend, systemctl stop, OS poweroff) —
    # not failures. Don't fire fatal_error for them.
    if returncode in (-15, -2, -9):
        logger.info(
            "[%s] WhatsApp bridge received signal %d (normal shutdown path)",
            self.name, -returncode,
        )
        self._close_bridge_log()
        return None

    message = f"WhatsApp bridge process exited unexpectedly (code {returncode})."
    # ... existing fatal_error path unchanged

This PR fixes the runtime-level symptom; the snippet above avoids firing the error in the first place on the signal path. Happy to send a separate PR with this if it'd help — or leave it here for whoever picks this up.

denhubr mentioned this pull request Apr 20, 2026

fix(gateway): keep retryable platform reconnects queued #13197

Open

19 tasks

alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/whatsapp WhatsApp Business adapter labels Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gateway): retryable platform failures stay alive instead of restart-looping#11241

fix(gateway): retryable platform failures stay alive instead of restart-looping#11241
kainotomic-inc wants to merge 1 commit into
NousResearch:mainfrom
kainotomic-inc:upstream-pr/gateway-retryable-non-fatal

kainotomic-inc commented Apr 16, 2026

Uh oh!

futureworld678 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kainotomic-inc commented Apr 16, 2026

Problem

Reproduction

Fix

Test

Uh oh!

futureworld678 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants