Skip to content

Discord: multi-account gateway startup hangs at 'awaiting gateway readiness' after Carbon reconcile change (v2026.3.22) #53132

@Skeptomenos

Description

@Skeptomenos

Summary

Running 4 Discord bot accounts on a single gateway causes 2–4 bots to hang indefinitely at client initialized as <id> (<name>); awaiting gateway readiness on every restart. The issue appeared after upgrading from v2026.3.13 to v2026.3.22 with the same configuration and bot tokens. v2026.3.13 starts all 4 bots reliably.

Environment

  • OpenClaw version: 2026.3.22 (upgraded from 2026.3.13)
  • Carbon version: @buape/carbon 0.0.0-beta-20260317045421
  • OS: macOS (Apple Silicon Mac Mini)
  • Node: 25.8.1
  • Service: launchd (ai.openclaw.gateway)
  • Discord accounts: 4 separate bot applications in the same guild, each bound to a different agent
  • Commands per bot: 94 native slash commands (identical set)

Reproduction

  1. Configure 4 Discord bot accounts under channels.discord.accounts (each with its own token, guild, and channel bindings)
  2. Start the gateway with openclaw gateway restart
  3. Observe logs: 2–4 bots log client initialized as <id> (<name>); awaiting gateway readiness and never transition to logged in to discord as <id> (<name>)
  4. The specific bots that get stuck vary across restarts (non-deterministic)
  5. Restart again — a different combination hangs

Tested across 6+ restarts (both launchctl kickstart -k and openclaw gateway restart). Same behavior every time. openclaw doctor reports Discord: ok and active sessions exist for bots that did connect.

Expected

All 4 bots should reach logged in to discord within ~15 seconds, as they did reliably on v2026.3.13 with the same configuration.

Actual

Only 0–2 of 4 bots reach logged in. The rest hang at awaiting gateway readiness, then enter the 15s READY timeout → force reconnect → another 15s timeout → fatal throw → crash-loop restart cycle (5s→5min exponential backoff).

Sample log (typical restart)

[discord] native commands using Carbon reconcile path
[discord] native commands using Carbon reconcile path
[discord] native commands using Carbon reconcile path
[discord] native commands using Carbon reconcile path
[discord] client initialized as 1481040852020236500 (Gilfoyle); awaiting gateway readiness
[discord] client initialized as 1470175784990671045 (Donna); awaiting gateway readiness
[discord] logged in to discord as 1470071510071902353 (Alfred)
[discord] logged in to discord as 1470144062978785342 (Krause)

Next restart — Alfred and Donna succeed, Krause and Gilfoyle hang. Completely non-deterministic.

Root Cause Analysis

Traced through the source code. Three factors compound:

1. All accounts start in parallel with no stagger

src/gateway/server-channels.ts:258:

await Promise.all(
  accountIds.map(async (id) => {
    // ...each account starts concurrently
  }),
);

All 4 bots simultaneously: open WebSocket connections, deploy 94 commands each via REST, and fetch bot identity. Discord's max_concurrency: 1 for small bots means IDENTIFY payloads must be sequenced, but the gateway fires all 4 at once.

2. Command deploy blocks the startup pipeline while the READY clock ticks

extensions/discord/src/monitor/provider.ts:783-833:

Client constructor (starts gateway WebSocket immediately)
  → await deployDiscordCommands() ← BLOCKS (REST API, up to 3 retry attempts)
    → await fetchUser("@me") ← BLOCKS (REST API)
      → log status message (checks isConnected)
        → runDiscordGatewayLifecycle() (READY wait starts here)

The Carbon Client constructor starts the WebSocket connection at line 783, but await deployDiscordCommands() at line 833 blocks the startup pipeline with REST API calls. If deploy is slow or rate-limited (4 bots × 94 commands simultaneously), the gateway READY event may arrive and expire before the lifecycle READY wait even begins.

3. READY timeout is only 15 seconds with a hard fail after 30s

extensions/discord/src/monitor/provider.lifecycle.ts:18:

const DISCORD_GATEWAY_READY_TIMEOUT_MS = 15_000;

At line 330–380: if the gateway isn't connected after 15s, it force-disconnects and reconnects. After another 15s, it throws a fatal error, causing the channel manager to enter a crash-loop restart cycle with exponential backoff (5s initial, up to 5min, max 10 restarts).

Why this worked in v2026.3.13

The Carbon reconcile change (#46597) switched from OpenClaw's local command deploy to @buape/carbon's commandDeploymentMode: "reconcile". The Carbon beta bump likely changed internal timing of the REST deploy path, making it slower or more susceptible to rate limiting when multiple accounts deploy simultaneously. The old local deploy path was faster or non-blocking.

Suggested Fixes

Fix 1: Stagger account connections (low risk, high impact)

Replace parallel account startup with sequential + delay in src/gateway/server-channels.ts:

// BEFORE (line 258):
await Promise.all(
  accountIds.map(async (id) => {
    // ...start account...
  }),
);

// AFTER — sequential with stagger:
for (let i = 0; i < accountIds.length; i++) {
  const id = accountIds[i];
  if (store.tasks.has(id)) continue;
  const existingStart = store.starting.get(id);
  if (existingStart) {
    await existingStart;
    continue;
  }

  // ...existing startGate + start logic (lines 269-rest)...

  // Stagger: wait 3s between accounts to avoid simultaneous IDENTIFY + REST storms
  if (accountIds.length > 1 && i < accountIds.length - 1) {
    await new Promise<void>((resolve) => {
      const t = setTimeout(resolve, 3000);
      t.unref?.();
    });
  }
}

Risk: Very low — only affects multi-account timing. Single-account setups are unchanged. The 3s delay ensures each bot has time to complete its IDENTIFY handshake before the next one starts.

Fix 2: Increase READY timeout (near-zero risk)

In extensions/discord/src/monitor/provider.lifecycle.ts:18:

// BEFORE:
const DISCORD_GATEWAY_READY_TIMEOUT_MS = 15_000;

// AFTER:
const DISCORD_GATEWAY_READY_TIMEOUT_MS = 30_000;

Risk: Near zero. This only affects how long the gateway waits for Discord's READY event before force-reconnecting. The 15s timeout is too aggressive when multiple bots compete for the same IP's connection budget. Worst case: slower error detection when Discord is truly unreachable.

Longer-term consideration

Move deployDiscordCommands() to fire after the lifecycle READY wait confirms the gateway is connected, rather than blocking the startup pipeline before it. Commands persist across restarts, so there's no availability gap. This would decouple command deploy timing from gateway connection health.

Additional Context

  • openclaw doctor shows Discord: ok after the bots that DID connect are online
  • TCP connections to Discord (162.159.130.234:443) are ESTABLISHED for all 4 bots (verified via lsof) — the hang is above the TCP layer
  • Setting commands.native: false globally did not prevent the Carbon reconcile path from running
  • The anthropic-vertex built-in provider (new in v2026.3.22) works correctly once bots are connected — this issue is solely about the Discord startup race
  • Config changes (allowBots, removing custom providers) have no effect on the hang
  • Health monitor detects stale sockets and restarts affected bots, but the same race recurs

Workaround

Rolling back to v2026.3.13: openclaw update --tag v2026.3.13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions