fix(gateway): retryable platform failures stay alive instead of restart-looping#11241
Conversation
…rt-looping When a platform adapter with fatal_error_retryable=True fires a fatal error and no other adapters are currently connected, the elif branch in _handle_adapter_fatal_error calls self.stop() "so systemd Restart=on-failure can restart the process." In systemd this is fine. In Docker (docker-compose restart: always, k8s Always restart policy) it produces a restart loop: SIGTERM -> container exits -> runtime restarts it -> same transient failure fires again -> exit -> repeat. The _failed_platforms retry queue populated three lines earlier becomes dead code. Observed symptom: a WhatsApp bridge with an expired session times out its QR pairing flow after ~60s, exits with code -15, and fatal_error_retryable=True. The enclosing gateway process dies. Container restart policy brings it back. Same thing happens on the next boot. Other platforms (Telegram, API server) never have time to come up. The customer's dashboard says "Telegram connected" because the channel config is present, but no inbound message ever reaches the agent. Drop the retryable-exit branch. "Retryable" means the retry loop can recover, so the gateway should stay alive and let _reconnect_failed_platforms do its job every 30s. Non-retryable failures don't reach this branch because the preceding block only queues retryable ones into _failed_platforms; non-retryable failures still hit the first if-branch and exit cleanly. Cron keeps running. Platforms that DO connect keep serving. If all platforms end up permanently broken the first if-branch still catches that case once _failed_platforms drains. The existing else-branch of this elif was unreachable code (non-retryable failures can't reach it), so the only live code path was the exit-with-failure one; this commit collapses both into the warning.
|
Hit the same root cause from a different entry point: WSL2 suspend/poweroff sends SIGTERM to the WhatsApp bridge child, and Not a big deal in practice (systemd Suggested complementary tweak, orthogonal to the async def _check_managed_bridge_exit(self) -> Optional[str]:
if self._bridge_process is None:
return None
returncode = self._bridge_process.poll()
if returncode is None:
return None
# Negative returncode = killed by signal. -15/-2/-9 are normal
# shutdown paths (WSL2 suspend, systemctl stop, OS poweroff) —
# not failures. Don't fire fatal_error for them.
if returncode in (-15, -2, -9):
logger.info(
"[%s] WhatsApp bridge received signal %d (normal shutdown path)",
self.name, -returncode,
)
self._close_bridge_log()
return None
message = f"WhatsApp bridge process exited unexpectedly (code {returncode})."
# ... existing fatal_error path unchangedThis PR fixes the runtime-level symptom; the snippet above avoids firing the error in the first place on the signal path. Happy to send a separate PR with this if it'd help — or leave it here for whoever picks this up. |
Problem
In
GatewayRunner._handle_adapter_fatal_error(gateway/run.py), when a platform adapter withfatal_error_retryable=Truefires a fatal error and no other adapters are currently connected, the secondelifbranch currently callsself.stop()"so systemd Restart=on-failure can restart the process."Under systemd this is fine. Under container runtimes with a restart policy (
docker-compose restart: always, k8sAlways), it produces an endless restart loop:SIGTERM → container exits → runtime restarts it → same transient failure fires → exit → repeat. The_failed_platformsretry queue populated three lines earlier never gets a chance to reconnect.Reproduction
WHATSAPP_ENABLED=truein a container with an expired WhatsApp session file.whatsapp.pycalls_set_fatal_error("whatsapp_bridge_exited", retryable=True)._handle_adapter_fatal_errorruns;self.adaptersis empty after the adapter is popped;self._failed_platformsnow contains WhatsApp.self._exit_with_failure = True,self.stop()called, container exits.On a gateway with WhatsApp as the only enabled platform this is infinite. On one with Telegram + WhatsApp, Telegram never has time to connect before WhatsApp takes everyone down.
Fix
Drop the retryable-exit branch. "Retryable" means the retry loop (
_reconnect_failed_platforms, 30s cadence) can recover, so the gateway should stay alive and let that run. The existingelseof that elif was already unreachable — non-retryable failures aren't queued into_failed_platformsabove, so a non-retryable error reaching this branch with an emptyself.adapterswas impossible. Collapsing both into the warning simplifies the code path.Non-retryable failures are unaffected: they still hit the first
if not self.adapters and not self._failed_platformsbranch and exit cleanly. If every retryable platform ends up permanently broken,_failed_platformsdrains and the first if-branch catches that case on the next fatal.Test
Reproduced on a dockerised gateway with WhatsApp + stale bridge session; confirmed restart loop. Applied the patch; confirmed:
fatal_error_retryable=Truestill fires the error log._failed_platformsstill gets the platform queued for reconnection.RestartCountstays at 0 over 2+ minutes).Verified across 6 production gateways after rollout.