fix(gateway): isolate channel startup failures to prevent cascade#54215
fix(gateway): isolate channel startup failures to prevent cascade#54215obviyus merged 3 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR fixes a cascading startup failure where a single channel's initialization error (e.g., WhatsApp missing Key changes:
Confidence Score: 4/5
Reviews (2): Last reviewed commit: "fix(gateway): isolate channel startup fa..." | Re-trigger Greptile |
7fd765e to
13159c4
Compare
|
@greptile review |
13159c4 to
faf9aaa
Compare
|
Pushed a simplification to this branch. Kept Also added a regression test covering the actual bug: one channel startup throws, a later channel still starts. Commit: |
When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup
faf9aaa to
3a60bb2
Compare
|
Landed on main. Thanks @JonathanJing. |
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Wrap individual channel startup in try-catch so one broken channel (e.g. WhatsApp missing runtime) doesn't block other channels (e.g. Discord) from starting. Changes: - Add try-catch in startChannels loop - Log per-channel startup errors - Continue starting remaining channels after failure Upstream commit: - 30e80fb: fix: isolate channel startup failures (openclaw#54215) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* port: before_dispatch hook delivery semantics (upstream a10d587, b497f3c) - Add before_dispatch hook allowing plugins to intercept messages before model dispatch - Extract sendFinalPayload helper to unify TTS + routing logic - Preserve delivery semantics (TTS, routed delivery) when hook handles message - Use canonical hook metadata (normalized conversation id, sender id, channel) Upstream commits: - a10d587: fix: preserve before_dispatch delivery semantics (openclaw#50444) - b497f3c: fix: normalize before_dispatch conversation id Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * port: gateway restart sentinel wake + delivery context (upstream 1c9f62f partial) - Add deliveryContext field to SystemEvent for routing preservation - Wake interrupted session via heartbeat after restart (requestHeartbeatNow) - Add retry logic for outbound delivery (2 attempts with 750ms delay) - Preserve threadId routing through wake path - Always enqueue wake even when delivery fails Partial port of upstream 1c9f62f: - Core: system-events deliveryContext, restart sentinel wake - Deferred: heartbeat-runner turnSource integration, targets.ts routing updates (complex changes, requires more analysis) Upstream commit: - 1c9f62f: fix(gateway): restart sentinel wakes session after restart (openclaw#53940) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * port: prefer freshest duplicate session rows (upstream f48571b, 40f820f) - Add resolveFreshestSessionStoreMatchFromStoreKeys to prefer newest updatedAt - Add resolveFreshestSessionEntryFromStoreKeys wrapper - Use in sessions.preview for duplicate row handling Handles case-insensitive and legacy alias keys by sorting duplicates by updatedAt timestamp and returning the freshest entry. Upstream commits: - f48571b: fix: prefer freshest duplicate rows in session loads - 40f820f: fix: prefer freshest duplicate session rows in reads Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: remove delivery-queue dependency (not yet ported) Remove enqueueDelivery/ackDelivery/failDelivery usage from server-restart-sentinel.ts as delivery-queue.ts is not yet ported to our codebase. Keep retry logic but use agentCommand directly. Resolves build failure. * port: isolate channel startup failures (upstream 30e80fb) Wrap individual channel startup in try-catch so one broken channel (e.g. WhatsApp missing runtime) doesn't block other channels (e.g. Discord) from starting. Changes: - Add try-catch in startChannels loop - Log per-channel startup errors - Continue starting remaining channels after failure Upstream commit: - 30e80fb: fix: isolate channel startup failures (openclaw#54215) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * fix: implement runBeforeDispatch without runClaimingHook - Implement runBeforeDispatch directly using getHooksForName - Remove GroupName check (not in our MsgContext) - Fix TypeScript build errors Resolves build failures after before_dispatch hook port. * port: per-model cooldown scope + stepped backoff (upstream 8440122) Scope rate-limit cooldowns per model so one 429 no longer blocks every model on the same auth profile. Replace exponential 1min→1h escalation with stepped 30s/1min/5min ladder. Key changes: - Add cooldownReason and cooldownModel to ProfileUsageStats type - Implement stepped backoff: 30s → 1min → 5min (was: 1min → 5min → 25min → 60min) - Add model-aware cooldown bypass in isProfileInCooldown() - Track model scope when marking auth profile failures - Pass modelId through markAuthProfileFailure and related calls - Update isProfileInCooldown calls in model-fallback and pi-embedded-runner to pass forModel Upstream ref: 8440122 * port: surface mid-turn 429 rate limits (upstream 4ae4d1f partial) Surface rate limit and overload errors that occur mid-turn (after tool calls) instead of silently returning an empty response. Only applies when the assistant produced no valid (non-error) reply text, so tool-level rate-limit messages don't override a successful turn. Changes: - Add isReasoning field to EmbeddedPiRunResult payload type - Detect mid-turn rate limits in agent-runner-execution.ts when there's no valid content (checking for text/media, excluding errors/reasoning) - Import isRateLimitErrorMessage and isOverloadedErrorMessage - Replace empty responses with user-facing rate limit message Note: Skipped upstream's incomplete turn detection in run.ts (detecting stopReason=toolUse with no payloads) as it requires deep understanding of our specific agent loop structure and could cause false positives. The agent-runner-execution.ts check catches the issue at final output. Upstream ref: 4ae4d1f (partial port) * port: isolate session:patch hook payload (upstream 765182d, 3e2e9bc partial) Changes: - Add hasInternalHookListeners() to check for listeners before cloning - Add session:patch hook with structuredClone to isolate payload - Add SessionPatchHookEvent and isSessionPatchEvent() type guard - Only clone and dispatch when listeners are registered (performance) Why structuredClone: Fire-and-forget hooks cannot mutate objects used by the response path. Deep cloning sessionEntry, patch, and cfg prevents plugin corruption. Skip model default reasoning guards (6c04ce3, b91374e): Those patches require agentEntry.reasoningDefault and modelState.resolveDefaultReasoningLevel() which haven't been ported yet. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
…hanJing) * fix(gateway): isolate channel startup failures to prevent cascade When one channel (e.g., WhatsApp) fails to start due to missing runtime modules, it should not block other channels (e.g., Discord) from starting. Changes: - Use Promise.allSettled to start channels concurrently - Catch individual channel startup errors without affecting others - Add startup summary logging for observability Before: Sequential await startChannel() - if one throws, subsequent channels never start. After: Concurrent startup with per-channel error handling - all channels attempt to start, failures are logged but don't cascade. Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup * fix(gateway): keep channel startup isolation sequential * fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing) --------- Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Problem
When one channel (e.g., WhatsApp) fails to start due to missing runtime modules (like
light-runtime-api), it blocks all subsequent channels (e.g., Discord) from starting. This creates a cascading failure where a single channel issue appears as a total system outage.Symptoms
channel startup failed: WhatsApp plugin runtime is unavailable: missing light-runtime-apiSolution
Use
Promise.allSettledto start channels concurrently with per-channel error isolation:Before:
After:
Changes
Testing
Related Issues
Backwards Compatibility
startChannels()still returnsPromise<void>)