Skip to content

fix(discord): surface stalled transport health#76327

Merged
joshavant merged 3 commits intomainfrom
harden-discord-stall-status
May 3, 2026
Merged

fix(discord): surface stalled transport health#76327
joshavant merged 3 commits intomainfrom
harden-discord-stall-status

Conversation

@joshavant
Copy link
Copy Markdown
Contributor

Summary

  • Problem: Discord can appear running while the transport is degraded or the Gateway event loop is starved, which makes intermittent socket resets look like healthy channel state.
  • Why it matters: Issue Discord inbound messages not reaching agent after gateway reconnects #75346 needed end-to-end evidence for Discord stall/reset diagnostics, including Dave's follow-up symptoms where a stalled runtime hid useful health signals.
  • What changed: propagate Gateway event-loop health into channel/status summaries, annotate channel account health, surface Discord degraded transport issues, and add fetch-timeout timer-delay diagnostics.
  • What did NOT change (scope boundary): no Discord auth, delivery routing, reconnect policy, or provider behavior changes.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

Root Cause (if applicable)

  • Root cause: channels.status and status --deep reported process/account running state without carrying the existing Gateway event-loop health or per-account channel health classification through the RPC and CLI surfaces.
  • Missing detection / guardrail: fetch timeout logs did not distinguish nominal request timeout from a timer firing late because the Node event loop was starved.
  • Contributing context (if known): live Discord validation found that actual transport stalls can leave the user-facing status path looking healthier than the runtime behavior.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/gateway/server-methods/channels.status.test.ts, src/commands/channels.surfaces-signal-runtime-errors-channels-status-output.test.ts, src/commands/status.command-sections.test.ts, extensions/discord/src/status-issues.test.ts, src/utils/fetch-timeout.test.ts.
  • Scenario the test should lock in: degraded event-loop and channel health are visible in status outputs, Discord status issues include stale/stuck transport states, and late timeout timers log timer-delay hints.
  • Why this is the smallest reliable guardrail: the tests cover the RPC payload seams, CLI rendering, Discord issue collector, and fetch-timeout diagnostics without requiring live Discord for every run.
  • Existing test that already covers this (if any): existing channel status tests covered running/configured rows, but not degraded event-loop or account health propagation.
  • If no new test is added, why not: N/A.

User-visible / Behavior Changes

openclaw channels status, openclaw status --deep, Discord status issue output, and fetch-timeout logs now include degraded transport/event-loop starvation hints when the Gateway has the data.

Diagram (if applicable)

Before:
Discord transport stall -> channel account still shown as running -> operator lacks stall clue

After:
Discord transport stall -> health/event-loop state flows into status -> operator sees degraded/stale/stuck hint

Security Impact (required)

  • New permissions/capabilities? (Yes/No) No
  • Secrets/tokens handling changed? (Yes/No) No
  • New/changed network calls? (Yes/No) No
  • Command/tool execution surface changed? (Yes/No) No
  • Data access scope changed? (Yes/No) No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS host plus Docker live reproduction; Testbox Linux CI-parity changed gate.
  • Runtime/container: Docker image openclaw-75346-hardened-status for patched live validation; current branch rebased on latest origin/main.
  • Model/provider: OpenAI key available for live path where needed; no model response changes in this PR.
  • Integration/channel (if any): live Discord bot/guild/channel validation with secrets redacted.
  • Relevant config (redacted): Discord bot token, guild ID, channel ID, user ID, and OpenAI API key supplied outside git.

Steps

  1. Reproduced the current-branch issue behavior before patch with live Discord/Docker validation.
  2. Applied status hardening and regression tests.
  3. Rebuilt and ran patched Docker live validation with a forced event-loop stall.
  4. Verified CLI/RPC status output and fetch-timeout logs surfaced degraded event-loop and transport health.

Expected

  • Status paths identify degraded Discord transport and event-loop starvation instead of reporting only healthy running state.

Actual

  • Patched status returned after the forced stall with eventLoop.degraded: true, and the CLI/status issue paths expose the degraded health signals.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Validation run after rebase:

  • git diff --check
  • pnpm exec oxfmt --check --threads=1 CHANGELOG.md
  • pnpm check:changelog-attributions
  • pnpm test src/gateway/server-methods/channels.status.test.ts extensions/discord/src/status-issues.test.ts src/commands/channels.surfaces-signal-runtime-errors-channels-status-output.test.ts src/commands/status.command-sections.test.ts src/utils/fetch-timeout.test.ts
  • Testbox OPENCLAW_TESTBOX=1 pnpm check:changed on tbx_01kqnjpj34468hzpshgp60hz8p

Human Verification (required)

  • Verified scenarios: live Discord current-branch reproduction, patched Docker event-loop-stall validation, CLI/status rendering, RPC payload propagation, Discord issue collection, fetch-timeout late-timer logging.
  • Edge cases checked: degraded Gateway event loop, account health states including stale/stuck/disconnected/not-running, running-but-not-connected Discord account state, late timeout timer delay hints.
  • What you did not verify: Dave's exact Linux x64 / Node 22.22.2 machine; validation used available Docker/Linux-equivalent and maintainer live Discord credentials.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

Compatibility / Migration

  • Backward compatible? (Yes/No) Yes
  • Config/env changes? (Yes/No) No
  • Migration needed? (Yes/No) No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: Status output gains new optional health fields that callers might not expect.
    • Mitigation: fields are additive/optional, and existing summaries remain intact.
  • Risk: Event-loop starvation diagnostics could be noisy in severe host overload.
    • Mitigation: warnings are only added when existing health state or timeout elapsed-vs-budget evidence indicates degradation.

@openclaw-barnacle openclaw-barnacle Bot added channel: discord Channel integration: discord gateway Gateway runtime commands Command implementations size: M maintainer Maintainer-authored PR labels May 3, 2026
@joshavant joshavant force-pushed the harden-discord-stall-status branch from 26d58e3 to 5fd0a22 Compare May 3, 2026 00:14
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 3, 2026

Codex review: needs changes before merge.

Summary
The PR adds Gateway event-loop and channel health signals to channels/status surfaces, Discord runtime status issues, fetch-timeout late-timer diagnostics, tests, protocol schema coverage, and a changelog entry.

Reproducibility: yes. Source inspection of current main shows the status payload lacks event-loop health propagation, and the PR's remaining blocker is reproduced by exact-head CI plus generated Swift models that still omit eventLoop.

Next step before merge
The remaining blocker is a narrow mechanical protocol-generation repair on an automerge-opted PR branch.

Security
Cleared: No concrete security or supply-chain concern was found; the diff is limited to additive diagnostics, status rendering, protocol schema/tests, and changelog text.

Review findings

  • [P2] Regenerate protocol artifacts — src/gateway/protocol/schema/channels.ts:320
Review details

Best possible solution:

Ship the additive diagnostics with the protocol schema and generated client artifacts aligned, then let exact-head checks gate automerge.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection of current main shows the status payload lacks event-loop health propagation, and the PR's remaining blocker is reproduced by exact-head CI plus generated Swift models that still omit eventLoop.

Is this the best way to solve the issue?

No, not yet. The implementation direction is maintainable, but the protocol schema change must be paired with regenerated protocol artifacts before merge.

Full review comments:

  • [P2] Regenerate protocol artifacts — src/gateway/protocol/schema/channels.ts:320
    Adding eventLoop to ChannelsStatusResultSchema changes the generated protocol contract, but the PR does not update the generated Swift gateway models, and exact-head checks-fast-protocol is failing. Run pnpm protocol:check and commit the regenerated protocol outputs so schema consumers stay in sync.
    Confidence: 0.91

Overall correctness: patch is incorrect
Overall confidence: 0.91

Acceptance criteria:

  • pnpm protocol:check
  • pnpm test src/gateway/protocol/channels.schema.test.ts src/gateway/server-methods/channels.status.test.ts src/gateway/server-methods/server-methods.test.ts src/commands/channels.surfaces-signal-runtime-errors-channels-status-output.test.ts src/commands/status.command-sections.test.ts extensions/discord/src/status-issues.test.ts src/utils/fetch-timeout.test.ts
  • pnpm check:changed

What I checked:

Likely related people:

  • steipete: Recent history shows repeated maintenance of channel health policy, channels.status, health summaries, and the gateway protocol schema that this PR extends. (role: recent maintainer and Gateway/status contract owner; confidence: high; commits: eb02161bbe95, 9fcae8458e91, d8d0380297f4; files: src/gateway/channel-health-policy.ts, src/gateway/server-methods/channels.ts, src/gateway/protocol/schema/channels.ts)
  • vincentkoc: History shows work introducing Gateway event-loop readiness health and preserving runtime-backed health state, both directly adjacent to this PR's propagation path. (role: adjacent event-loop and runtime-health contributor; confidence: medium; commits: 75ba8398f939, be6263da4f51; files: src/gateway/server/event-loop-health.ts, src/commands/health.ts)
  • scoootscooob: History shows the Discord channel implementation was moved into the extension area, which is the owner boundary for the Discord status issue collector touched here. (role: Discord extension history owner; confidence: low; commits: 5682ec37fada; files: extensions/discord/src/status-issues.ts)

Remaining risk / open question:

  • Exact-head CI is not merge-ready: checks-fast-protocol failed, and several broader checks were still in progress at inspection time.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 31ed93ff58db.

@joshavant joshavant force-pushed the harden-discord-stall-status branch 2 times, most recently from 5ec6e29 to 587fb0d Compare May 3, 2026 00:31
@joshavant
Copy link
Copy Markdown
Contributor Author

@clawsweeper automerge

@clawsweeper clawsweeper Bot added the clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge label May 3, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 3, 2026

ClawSweeper 🐠 automerge status

ClawSweeper took another look; no safe branch change was available on this pass.

Executor outcome: source PR branch changed while the repair worker was preparing its push; requeue against the latest head.
Worker summary: Canonical this PR is a same-repo, writable, automerge-opted repair candidate for Discord stalled transport health. It is not merge-ready because ClawSweeper review and exact-head CI identify one narrow blocker: regenerate protocol artifacts so the additive channels.status event-loop field is reflected in generated client models. No close or merge action is allowed by this job.

Worker actions:

  • fix_needed on this PR: planned - Canonical PR is useful and branch-writable, but relevant CI and ClawSweeper review block automerge until generated protocol artifacts are repaired.
  • keep_related on #75346: planned - The issue is linked to the canonical repair path but still carries live-verification scope and cannot be closed in this job.
  • build_fix_artifact on cluster:automerge-openclaw-openclaw-76327: planned - The executor needs a deterministic repair artifact for this PR before automerge can continue.

No branch push, rebase, replacement PR, merge, or ClawSweeper re-review was started on this pass.

fish notes: model gpt-5.5, reasoning high.

Automerge progress:

  • 2026-05-03 00:33:43 UTC review queued [`587fb0ddeefb`](https://github.com/openclaw/openclaw/commit/587fb0ddeefbdb68c31770672938c6f939873f33) (queued)
  • 2026-05-03 00:37:50 UTC review requested repair [`587fb0ddeefb`](https://github.com/openclaw/openclaw/commit/587fb0ddeefbdb68c31770672938c6f939873f33) (structured ClawSweeper marker: fix-required (finding=review-feedback sha=587fb0...)
  • 2026-05-03 00:38:06 UTC repair queued [`587fb0ddeefb`](https://github.com/openclaw/openclaw/commit/587fb0ddeefbdb68c31770672938c6f939873f33) (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/25265786688
  • 2026-05-03 00:49:49 UTC repair completed [`265ac3e6178d`](https://github.com/openclaw/openclaw/commit/265ac3e6178db2314da59c09a1e6171f08ec9362) (branch updated) in 8m 50s Run: https://github.com/openclaw/clawsweeper/actions/runs/25265786688 initial automerge rebase is delegated to Codex repair
  • 2026-05-03 00:49:48 UTC review queued [`265ac3e6178d`](https://github.com/openclaw/openclaw/commit/265ac3e6178db2314da59c09a1e6171f08ec9362) (after repair)
  • 2026-05-03 00:53:33 UTC review passed [`265ac3e6178d`](https://github.com/openclaw/openclaw/commit/265ac3e6178db2314da59c09a1e6171f08ec9362) (structured ClawSweeper verdict: pass (sha=265ac3e6178db2314da59c09a1e6171f08ec9...)
  • 2026-05-03 00:53:47 UTC repair queued [`265ac3e6178d`](https://github.com/openclaw/openclaw/commit/265ac3e6178db2314da59c09a1e6171f08ec9362) (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/25266063284
  • 2026-05-03 01:30:28 UTC repair completed (no branch change) in 34m 8s Run: https://github.com/openclaw/clawsweeper/actions/runs/25266063284 validation command failed (pnpm check:changed): command timed out after 30000ms: pnpm check:changed [check:changed] lanes=core, coreTests, extensions, extensio...
  • 2026-05-03 03:06:52 UTC review requested repair [`ff35223b2da0`](https://github.com/openclaw/openclaw/commit/ff35223b2da009cfb6f4efc5855054aea65a9b19) (structured ClawSweeper marker: fix-required (finding=review-feedback sha=ff3522...)
  • 2026-05-03 03:07:10 UTC repair queued [`ff35223b2da0`](https://github.com/openclaw/openclaw/commit/ff35223b2da009cfb6f4efc5855054aea65a9b19) (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/25268411455
  • 2026-05-03 03:18:10 UTC review queued [`ff35223b2da0`](https://github.com/openclaw/openclaw/commit/ff35223b2da009cfb6f4efc5855054aea65a9b19) (queued)
  • 2026-05-03 03:19:20 UTC repair completed [`5b2e8ac15ff6`](https://github.com/openclaw/openclaw/commit/5b2e8ac15ff60f9938dc160a68f8039cab913562) (branch updated) in 9m 43s Run: https://github.com/openclaw/clawsweeper/actions/runs/25268411455 initial automerge rebase is delegated to Codex repair
  • 2026-05-03 03:19:19 UTC review queued [`5b2e8ac15ff6`](https://github.com/openclaw/openclaw/commit/5b2e8ac15ff60f9938dc160a68f8039cab913562) (after repair)
  • 2026-05-03 03:23:16 UTC review requested repair [`5b2e8ac15ff6`](https://github.com/openclaw/openclaw/commit/5b2e8ac15ff60f9938dc160a68f8039cab913562) (structured ClawSweeper marker: fix-required (finding=review-feedback sha=5b2e8a...)
  • 2026-05-03 03:23:30 UTC repair queued [`5b2e8ac15ff6`](https://github.com/openclaw/openclaw/commit/5b2e8ac15ff60f9938dc160a68f8039cab913562) (autonomous) Run: https://github.com/openclaw/clawsweeper/actions/runs/25268708640
  • 2026-05-03 03:56:05 UTC repair completed (no branch change) in 29m 42s Run: https://github.com/openclaw/clawsweeper/actions/runs/25268708640 source PR branch changed while the repair worker was preparing its push; requeue against the latest head

@clawsweeper clawsweeper Bot force-pushed the harden-discord-stall-status branch from 587fb0d to 265ac3e Compare May 3, 2026 00:49
@joshavant joshavant force-pushed the harden-discord-stall-status branch from 265ac3e to ff35223 Compare May 3, 2026 03:02
@joshavant
Copy link
Copy Markdown
Contributor Author

@clawsweeper automerge

@clawsweeper clawsweeper Bot force-pushed the harden-discord-stall-status branch from ff35223 to 5b2e8ac Compare May 3, 2026 03:19
@openclaw-barnacle openclaw-barnacle Bot added the app: web-ui App: web-ui label May 3, 2026
@joshavant joshavant merged commit ba31afb into main May 3, 2026
103 of 104 checks passed
@joshavant joshavant deleted the harden-discord-stall-status branch May 3, 2026 03:33
arieldiego73 pushed a commit to arieldiego73/openclaw that referenced this pull request May 5, 2026
* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

---------

Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
lxe pushed a commit to lxe/openclaw that referenced this pull request May 6, 2026
* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

---------

Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

* fix(discord): surface stalled transport health

---------

Co-authored-by: clawsweeper <274271284+clawsweeper[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: web-ui App: web-ui channel: discord Channel integration: discord clawsweeper:automerge Maintainer opted this PR into bounded ClawSweeper-reviewed automerge commands Command implementations gateway Gateway runtime maintainer Maintainer-authored PR size: M

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Discord inbound messages not reaching agent after gateway reconnects

1 participant