Skip to content

fix(gateway): retryable platform failures stay alive instead of restart-looping#11241

Open
kainotomic-inc wants to merge 1 commit into
NousResearch:mainfrom
kainotomic-inc:upstream-pr/gateway-retryable-non-fatal
Open

fix(gateway): retryable platform failures stay alive instead of restart-looping#11241
kainotomic-inc wants to merge 1 commit into
NousResearch:mainfrom
kainotomic-inc:upstream-pr/gateway-retryable-non-fatal

Conversation

@kainotomic-inc

Copy link
Copy Markdown

Problem

In GatewayRunner._handle_adapter_fatal_error (gateway/run.py), when a platform adapter with fatal_error_retryable=True fires a fatal error and no other adapters are currently connected, the second elif branch currently calls self.stop() "so systemd Restart=on-failure can restart the process."

Under systemd this is fine. Under container runtimes with a restart policy (docker-compose restart: always, k8s Always), it produces an endless restart loop: SIGTERM → container exits → runtime restarts it → same transient failure fires → exit → repeat. The _failed_platforms retry queue populated three lines earlier never gets a chance to reconnect.

Reproduction

  1. Deploy a gateway with WHATSAPP_ENABLED=true in a container with an expired WhatsApp session file.
  2. Bridge starts, generates QR codes for ~60s, nobody scans, bridge exits with code -15.
  3. whatsapp.py calls _set_fatal_error("whatsapp_bridge_exited", retryable=True).
  4. _handle_adapter_fatal_error runs; self.adapters is empty after the adapter is popped; self._failed_platforms now contains WhatsApp.
  5. The elif branch fires, self._exit_with_failure = True, self.stop() called, container exits.
  6. Docker restarts it. Goto 1.

On a gateway with WhatsApp as the only enabled platform this is infinite. On one with Telegram + WhatsApp, Telegram never has time to connect before WhatsApp takes everyone down.

Fix

Drop the retryable-exit branch. "Retryable" means the retry loop (_reconnect_failed_platforms, 30s cadence) can recover, so the gateway should stay alive and let that run. The existing else of that elif was already unreachable — non-retryable failures aren't queued into _failed_platforms above, so a non-retryable error reaching this branch with an empty self.adapters was impossible. Collapsing both into the warning simplifies the code path.

Non-retryable failures are unaffected: they still hit the first if not self.adapters and not self._failed_platforms branch and exit cleanly. If every retryable platform ends up permanently broken, _failed_platforms drains and the first if-branch catches that case on the next fatal.

Test

Reproduced on a dockerised gateway with WhatsApp + stale bridge session; confirmed restart loop. Applied the patch; confirmed:

  • fatal_error_retryable=True still fires the error log.
  • _failed_platforms still gets the platform queued for reconnection.
  • Process PID 1 stays alive across the fatal event (container RestartCount stays at 0 over 2+ minutes).
  • Other platforms (Telegram) continue serving throughout.
  • Healthcheck endpoint stays 200.

Verified across 6 production gateways after rollout.

…rt-looping

When a platform adapter with fatal_error_retryable=True fires a fatal
error and no other adapters are currently connected, the elif branch
in _handle_adapter_fatal_error calls self.stop() "so systemd
Restart=on-failure can restart the process." In systemd this is fine.
In Docker (docker-compose restart: always, k8s Always restart policy)
it produces a restart loop: SIGTERM -> container exits -> runtime
restarts it -> same transient failure fires again -> exit -> repeat.
The _failed_platforms retry queue populated three lines earlier
becomes dead code.

Observed symptom: a WhatsApp bridge with an expired session times
out its QR pairing flow after ~60s, exits with code -15, and
fatal_error_retryable=True. The enclosing gateway process dies.
Container restart policy brings it back. Same thing happens on the
next boot. Other platforms (Telegram, API server) never have time to
come up. The customer's dashboard says "Telegram connected" because
the channel config is present, but no inbound message ever reaches
the agent.

Drop the retryable-exit branch. "Retryable" means the retry loop can
recover, so the gateway should stay alive and let _reconnect_failed_platforms
do its job every 30s. Non-retryable failures don't reach this branch
because the preceding block only queues retryable ones into
_failed_platforms; non-retryable failures still hit the first if-branch
and exit cleanly. Cron keeps running. Platforms that DO connect keep
serving. If all platforms end up permanently broken the first if-branch
still catches that case once _failed_platforms drains.

The existing else-branch of this elif was unreachable code (non-retryable
failures can't reach it), so the only live code path was the
exit-with-failure one; this commit collapses both into the warning.
@futureworld678

Copy link
Copy Markdown

Hit the same root cause from a different entry point: WSL2 suspend/poweroff sends SIGTERM to the WhatsApp bridge child, and whatsapp.py::_check_managed_bridge_exit currently flags any non-None returncode as a fatal error — including -15 (SIGTERM), -2 (SIGINT), -9 (SIGKILL). Combined with the behavior this PR addresses, it produces the exact restart-loop symptom on WSL2 every time Windows suspends/resumes:

Apr 24 18:14:20  systemd-resolved: Clock change detected. Flushing caches.       # resume
Apr 24 18:14:36  systemd-logind:   The system will power off now!                 # suspend
Apr 24 18:14:37  gateway.platforms.whatsapp: ERROR WhatsApp bridge process exited unexpectedly (code -15).
Apr 24 18:14:37  gateway.run: ERROR Fatal whatsapp adapter error (whatsapp_bridge_exited)
Apr 24 18:14:38  systemd: hermes-gateway.service: Main process exited, status=1/FAILURE
...  [user resumes Windows] ...
Apr 24 18:28:22  gateway.run: ERROR Fatal whatsapp adapter error (whatsapp_bridge_exited)   # cycle repeats
Apr 24 18:28:23  systemd: Main process exited, status=75/TEMPFAIL
Apr 24 18:28:26  systemd: Started hermes-gateway.service.                              # ~14s later, back up

Not a big deal in practice (systemd Restart=always bounces the gateway within ~15s), but the logs make it look like a real crash and the extra restart adds startup latency.

Suggested complementary tweak, orthogonal to the run.py change in this PR — treat negative returncodes as a normal shutdown path in platforms/whatsapp.py:

async def _check_managed_bridge_exit(self) -> Optional[str]:
    if self._bridge_process is None:
        return None
    returncode = self._bridge_process.poll()
    if returncode is None:
        return None

    # Negative returncode = killed by signal. -15/-2/-9 are normal
    # shutdown paths (WSL2 suspend, systemctl stop, OS poweroff) —
    # not failures. Don't fire fatal_error for them.
    if returncode in (-15, -2, -9):
        logger.info(
            "[%s] WhatsApp bridge received signal %d (normal shutdown path)",
            self.name, -returncode,
        )
        self._close_bridge_log()
        return None

    message = f"WhatsApp bridge process exited unexpectedly (code {returncode})."
    # ... existing fatal_error path unchanged

This PR fixes the runtime-level symptom; the snippet above avoids firing the error in the first place on the signal path. Happy to send a separate PR with this if it'd help — or leave it here for whoever picks this up.

@alt-glitch alt-glitch added type/bug Something isn't working P2 Medium — degraded but workaround exists comp/gateway Gateway runner, session dispatch, delivery platform/whatsapp WhatsApp Business adapter labels Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/gateway Gateway runner, session dispatch, delivery P2 Medium — degraded but workaround exists platform/whatsapp WhatsApp Business adapter type/bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants