Skip to content

fix(gateway): isolate channel startup failures to prevent cascade#54215

Merged
obviyus merged 3 commits intoopenclaw:mainfrom
JonathanJing:pr/channel-fault-isolation
Mar 25, 2026
Merged

fix(gateway): isolate channel startup failures to prevent cascade#54215
obviyus merged 3 commits intoopenclaw:mainfrom
JonathanJing:pr/channel-fault-isolation

Conversation

@JonathanJing
Copy link
Copy Markdown
Contributor

Problem

When one channel (e.g., WhatsApp) fails to start due to missing runtime modules (like light-runtime-api), it blocks all subsequent channels (e.g., Discord) from starting. This creates a cascading failure where a single channel issue appears as a total system outage.

Symptoms

  • channel startup failed: WhatsApp plugin runtime is unavailable: missing light-runtime-api
  • Discord shows as "stopped/disconnected" even though it was never given a chance to start
  • Users misdiagnose this as a Discord configuration problem

Solution

Use Promise.allSettled to start channels concurrently with per-channel error isolation:

Before:

for (const plugin of listChannelPlugins()) {
  await startChannel(plugin.id); // If one throws, loop exits
}

After:

await Promise.allSettled(
  listChannelPlugins().map(async (plugin) => {
    try {
      await startChannel(plugin.id);
    } catch (err) {
      // Log error but continue with other channels
    }
  })
);

Changes

  • Start channels concurrently instead of sequentially
  • Catch individual channel startup errors without affecting others
  • Add startup summary logging for observability

Testing

  • Configure WhatsApp with missing runtime
  • Verify Discord still starts successfully
  • Check logs show both failure and success

Related Issues

  • P0: WhatsApp runtime exception blocks Discord startup
  • P1: Channel status inconsistency (partial fix)

Backwards Compatibility

  • Interface unchanged (startChannels() still returns Promise<void>)
  • Existing error handling in callers continues to work
  • Only behavioral change: more channels start successfully when some fail

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: XS labels Mar 25, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 25, 2026

Greptile Summary

This PR fixes a cascading startup failure where a single channel's initialization error (e.g., WhatsApp missing light-runtime-api) would prevent all subsequent channels from starting. The fix replaces sequential for-await iteration with Promise.allSettled, adds per-channel error isolation via a nested try/catch, and emits a startup summary log via the first successfully-started channel's logger.

Key changes:

  • Channel startup is now concurrent rather than sequential — each channel's failure is caught independently and tracked in failedPlugins
  • A nested inner try/catch guards against secondary failures during error formatting or logging, preventing those from escaping the outer catch
  • The startup summary log correctly selects a succeeded channel's logger (falling back to the first plugin), addressing the previous concern about the summary appearing under a failed channel's log stream
  • Previous reviewer concerns (Promise.allSettled vs Promise.all, logger attribution for summary) are both addressed in this revision

Confidence Score: 4/5

  • This PR is safe to merge — the primary failure isolation goal is achieved and both prior review concerns have been addressed in this revision.
  • The implementation correctly uses Promise.allSettled, adds a defensive inner try/catch for secondary logging failures, and incorporates the logger-selection fix suggested in the previous review thread. No new critical bugs are introduced. The one behavioral change (concurrent vs sequential startup) is explicitly intended and documented in the PR description.
  • No files require special attention.

Reviews (2): Last reviewed commit: "fix(gateway): isolate channel startup fa..." | Re-trigger Greptile

Comment thread src/gateway/server-channels.ts Outdated
Comment thread src/gateway/server-channels.ts Outdated
@JonathanJing JonathanJing force-pushed the pr/channel-fault-isolation branch from 7fd765e to 13159c4 Compare March 25, 2026 04:03
@JonathanJing
Copy link
Copy Markdown
Contributor Author

@greptile review

@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented Mar 25, 2026

Pushed a simplification to this branch.

Kept startChannels() sequential and isolated failures with a local per-channel try/catch instead of switching gateway startup to concurrent Promise.allSettled(...).

Also added a regression test covering the actual bug: one channel startup throws, a later channel still starts.

Commit: faf9aaa69e

@obviyus obviyus self-assigned this Mar 25, 2026
JonathanJing and others added 3 commits March 25, 2026 10:19
When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup
@obviyus obviyus force-pushed the pr/channel-fault-isolation branch from faf9aaa to 3a60bb2 Compare March 25, 2026 04:51
@obviyus obviyus merged commit 30e80fb into openclaw:main Mar 25, 2026
21 checks passed
@obviyus
Copy link
Copy Markdown
Contributor

obviyus commented Mar 25, 2026

Landed on main.

Thanks @JonathanJing.

netandreus pushed a commit to netandreus/openclaw that referenced this pull request Mar 25, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
npmisantosh pushed a commit to npmisantosh/openclaw that referenced this pull request Mar 25, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
fuller-stack-dev pushed a commit to fuller-stack-dev/openclaw that referenced this pull request Mar 25, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
jacobtomlinson pushed a commit to jacobtomlinson/openclaw that referenced this pull request Mar 25, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
planfit-alan added a commit to planfit/openclaw that referenced this pull request Mar 26, 2026
Wrap individual channel startup in try-catch so one broken channel
(e.g. WhatsApp missing runtime) doesn't block other channels
(e.g. Discord) from starting.

Changes:
- Add try-catch in startChannels loop
- Log per-channel startup errors
- Continue starting remaining channels after failure

Upstream commit:
- 30e80fb: fix: isolate channel startup failures (openclaw#54215)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
planfit-alan added a commit to planfit/openclaw that referenced this pull request Mar 26, 2026
* port: before_dispatch hook delivery semantics (upstream a10d587, b497f3c)

- Add before_dispatch hook allowing plugins to intercept messages before model dispatch
- Extract sendFinalPayload helper to unify TTS + routing logic
- Preserve delivery semantics (TTS, routed delivery) when hook handles message
- Use canonical hook metadata (normalized conversation id, sender id, channel)

Upstream commits:
- a10d587: fix: preserve before_dispatch delivery semantics (openclaw#50444)
- b497f3c: fix: normalize before_dispatch conversation id

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* port: gateway restart sentinel wake + delivery context (upstream 1c9f62f partial)

- Add deliveryContext field to SystemEvent for routing preservation
- Wake interrupted session via heartbeat after restart (requestHeartbeatNow)
- Add retry logic for outbound delivery (2 attempts with 750ms delay)
- Preserve threadId routing through wake path
- Always enqueue wake even when delivery fails

Partial port of upstream 1c9f62f:
- Core: system-events deliveryContext, restart sentinel wake
- Deferred: heartbeat-runner turnSource integration, targets.ts routing updates
  (complex changes, requires more analysis)

Upstream commit:
- 1c9f62f: fix(gateway): restart sentinel wakes session after restart (openclaw#53940)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* port: prefer freshest duplicate session rows (upstream f48571b, 40f820f)

- Add resolveFreshestSessionStoreMatchFromStoreKeys to prefer newest updatedAt
- Add resolveFreshestSessionEntryFromStoreKeys wrapper
- Use in sessions.preview for duplicate row handling

Handles case-insensitive and legacy alias keys by sorting duplicates
by updatedAt timestamp and returning the freshest entry.

Upstream commits:
- f48571b: fix: prefer freshest duplicate rows in session loads
- 40f820f: fix: prefer freshest duplicate session rows in reads

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: remove delivery-queue dependency (not yet ported)

Remove enqueueDelivery/ackDelivery/failDelivery usage from
server-restart-sentinel.ts as delivery-queue.ts is not yet ported
to our codebase. Keep retry logic but use agentCommand directly.

Resolves build failure.

* port: isolate channel startup failures (upstream 30e80fb)

Wrap individual channel startup in try-catch so one broken channel
(e.g. WhatsApp missing runtime) doesn't block other channels
(e.g. Discord) from starting.

Changes:
- Add try-catch in startChannels loop
- Log per-channel startup errors
- Continue starting remaining channels after failure

Upstream commit:
- 30e80fb: fix: isolate channel startup failures (openclaw#54215)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

* fix: implement runBeforeDispatch without runClaimingHook

- Implement runBeforeDispatch directly using getHooksForName
- Remove GroupName check (not in our MsgContext)
- Fix TypeScript build errors

Resolves build failures after before_dispatch hook port.

* port: per-model cooldown scope + stepped backoff (upstream 8440122)

Scope rate-limit cooldowns per model so one 429 no longer blocks every
model on the same auth profile. Replace exponential 1min→1h escalation
with stepped 30s/1min/5min ladder.

Key changes:
- Add cooldownReason and cooldownModel to ProfileUsageStats type
- Implement stepped backoff: 30s → 1min → 5min (was: 1min → 5min → 25min → 60min)
- Add model-aware cooldown bypass in isProfileInCooldown()
- Track model scope when marking auth profile failures
- Pass modelId through markAuthProfileFailure and related calls
- Update isProfileInCooldown calls in model-fallback and pi-embedded-runner to pass forModel

Upstream ref: 8440122

* port: surface mid-turn 429 rate limits (upstream 4ae4d1f partial)

Surface rate limit and overload errors that occur mid-turn (after tool
calls) instead of silently returning an empty response.

Only applies when the assistant produced no valid (non-error) reply text,
so tool-level rate-limit messages don't override a successful turn.

Changes:
- Add isReasoning field to EmbeddedPiRunResult payload type
- Detect mid-turn rate limits in agent-runner-execution.ts when there's
  no valid content (checking for text/media, excluding errors/reasoning)
- Import isRateLimitErrorMessage and isOverloadedErrorMessage
- Replace empty responses with user-facing rate limit message

Note: Skipped upstream's incomplete turn detection in run.ts (detecting
stopReason=toolUse with no payloads) as it requires deep understanding
of our specific agent loop structure and could cause false positives.
The agent-runner-execution.ts check catches the issue at final output.

Upstream ref: 4ae4d1f (partial port)

* port: isolate session:patch hook payload (upstream 765182d, 3e2e9bc partial)

Changes:
- Add hasInternalHookListeners() to check for listeners before cloning
- Add session:patch hook with structuredClone to isolate payload
- Add SessionPatchHookEvent and isSessionPatchEvent() type guard
- Only clone and dispatch when listeners are registered (performance)

Why structuredClone:
Fire-and-forget hooks cannot mutate objects used by the response path.
Deep cloning sessionEntry, patch, and cfg prevents plugin corruption.

Skip model default reasoning guards (6c04ce3, b91374e):
Those patches require agentEntry.reasoningDefault and
modelState.resolveDefaultReasoningLevel() which haven't been ported yet.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
godlin-gh pushed a commit to YouMindInc/openclaw that referenced this pull request Mar 27, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
lovewanwan pushed a commit to lovewanwan/openclaw that referenced this pull request Apr 28, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
ogt-redknie pushed a commit to ogt-redknie/OPENX that referenced this pull request May 2, 2026
…hanJing)

* fix(gateway): isolate channel startup failures to prevent cascade

When one channel (e.g., WhatsApp) fails to start due to missing runtime
modules, it should not block other channels (e.g., Discord) from starting.

Changes:
- Use Promise.allSettled to start channels concurrently
- Catch individual channel startup errors without affecting others
- Add startup summary logging for observability

Before: Sequential await startChannel() - if one throws, subsequent
channels never start.

After: Concurrent startup with per-channel error handling - all channels
attempt to start, failures are logged but don't cascade.

Fixes: P0 - WhatsApp runtime exception no longer blocks Discord startup

* fix(gateway): keep channel startup isolation sequential

* fix: isolate channel startup failures (openclaw#54215) (thanks @JonathanJing)

---------

Co-authored-by: Ayaan Zaidi <hi@obviy.us>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants