Skip to content

fix: recover Slack channel restart after stop timeout#77686

Merged
kevinslin merged 2 commits intoopenclaw:mainfrom
kevinslin:dev/kevinlin/slack-recovery-stop-timeout
May 5, 2026
Merged

fix: recover Slack channel restart after stop timeout#77686
kevinslin merged 2 commits intoopenclaw:mainfrom
kevinslin:dev/kevinlin/slack-recovery-stop-timeout

Conversation

@kevinslin
Copy link
Copy Markdown
Contributor

@kevinslin kevinslin commented May 5, 2026

Summary

  • Treat channel health-monitor restarts as recovery stops, not manual stops, so a timed-out stop does not poison future auto-reconnect state.
  • Preserve manual-stop timeout behavior for user-initiated stops while marking recovery stop timeouts as restart-pending.
  • Add regression coverage for the timed-out recovery stop path and the manual-stop-during-recovery-backoff path.
  • Update the changelog.

Fixes #77651.

Real behavior proof

  • Behavior or issue addressed: A recovery restart that times out while stopping a stuck channel task should remain eligible for auto-restart when the old task finally settles, while later manual stops should still cancel recovery backoff.
  • Real environment tested: Local OpenClaw checkout on macOS, running the actual gateway createChannelManager implementation through Node/tsx with a temporary channel plugin installed in the runtime registry.
  • Exact steps or command run after this patch: From repo root, ran a node --import tsx --input-type=module runtime harness that imports src/gateway/server-channels.ts, starts a channel account whose first task ignores abort until explicitly released, calls stopChannel('discord', 'default', { manual: false }), immediately requests startChannel, releases the old task, waits for the manager's real auto-restart path, then performs a manual cleanup stop.
  • Evidence after fix: Terminal output from the local runtime harness:
after start: startCalls=1
[recovery-proof] [default] channel stop exceeded 5000ms after abort; continuing shutdown
after recovery stop timeout: elapsedMs=5023 running=false restartPending=true manuallyStopped=false lastError=channel stop timed out after 5000ms
after immediate restart request while old task is stuck: startCalls=1
[recovery-proof] [default] auto-restart attempt 1/10 in 5s
after old task settled: startCalls=2 running=true restartPending=false reconnectAttempts=1 lastError=null
events=startAccount(default) #1 | first task received abort but stays stuck until release | first task settled after simulated stuck stop | startAccount(default) #2
cleanup manual stop: running=false restartPending=false manuallyStopped=true
  • Observed result after fix: The recovery stop timed out without setting the account as manually stopped, the immediate restart request did not double-start while the old task was still stuck, and after the old task settled the manager auto-restarted the account (startCalls=2, running=true, lastError=null). A later manual stop left restartPending=false and manuallyStopped=true.
  • What was not tested: Live Slack credentials/socket reconnection. The proof uses the real gateway channel manager and a temporary local channel plugin to reproduce the stuck-task lifecycle deterministically.

Verification

  • pnpm exec oxfmt --check --threads=1 src/gateway/server-channels.ts src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.ts src/gateway/channel-health-monitor.test.ts
  • git diff --check HEAD~2..HEAD
  • pnpm changed:lanes --json
  • pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
  • Local Node/tsx runtime harness output copied in Real behavior proof above.

Notes

  • pnpm changed:lanes --json selected broad lanes after the rebase because the branch had been rewritten on top of current main; the touched-surface checks above are the local targeted proof for this PR.
  • Testbox/Crabbox broad validation was not run from this environment because neither blacksmith nor crabbox is installed here.

@openclaw-barnacle openclaw-barnacle Bot added gateway Gateway runtime size: S maintainer Maintainer-authored PR labels May 5, 2026
@kevinslin kevinslin force-pushed the dev/kevinlin/slack-recovery-stop-timeout branch 2 times, most recently from 48b7ff4 to ca57928 Compare May 5, 2026 04:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca5792855e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/gateway/server-channels.ts Outdated
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented May 5, 2026

Codex review: needs maintainer review before merge.

Summary
The PR adds a non-manual recovery-stop mode for health-monitor channel restarts, preserves manual-stop timeout behavior, adds regression tests, and updates the changelog.

Reproducibility: yes. at source level: current main routes health-monitor restarts through the manual stop path, and the timeout branch leaves manuallyStopped plus non-restarting runtime state. I did not run a live Slack event-loop starvation reproduction in this read-only review.

Real behavior proof
Sufficient (terminal): The PR body includes after-fix terminal output from a real local Gateway manager harness showing the recovery timeout path auto-restarts and later manual stop still cancels recovery.

Next step before merge
The protected maintainer label keeps this in human review/merge flow, and no narrow automated repair remains after the latest patch.

Security
Cleared: The patch changes Gateway lifecycle logic, tests, and changelog text without introducing a concrete security or supply-chain regression.

Review details

Best possible solution:

Land the narrow Gateway lifecycle fix after maintainer review and required checks, keeping manual stops suppressing restarts while recovery stops can reconnect after stop timeouts.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main routes health-monitor restarts through the manual stop path, and the timeout branch leaves manuallyStopped plus non-restarting runtime state. I did not run a live Slack event-loop starvation reproduction in this read-only review.

Is this the best way to solve the issue?

Yes: splitting manual and recovery stops at the ChannelManager boundary is the narrowest maintainable fix, and the latest patch keeps manual stops during recovery backoff cancellable with a separate abort signal.

Acceptance criteria:

  • pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server-channels.ts src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.ts src/gateway/channel-health-monitor.test.ts
  • git diff --check
  • pnpm check:changed

What I checked:

  • Current main routes health restarts through manual stop: On current main, the health monitor restarts unhealthy running accounts by calling stopChannel(channelId, accountId) with no recovery/manual distinction. (src/gateway/channel-health-monitor.ts:166, 557c5bf70521)
  • Current main timeout leaves poisoned state: Current stopChannel adds the account to manuallyStopped before aborting, and the timeout branch leaves running: true with restartPending: false, which matches the linked Slack dead-socket failure chain. (src/gateway/server-channels.ts:659, 557c5bf70521)
  • PR marks health-monitor stops as recovery stops: The latest patch changes the health-monitor restart path to call stopChannel(..., { manual: false }), while existing callers keep the default manual behavior. (src/gateway/channel-health-monitor.ts:166, b77aec55c2bc)
  • PR keeps recovery backoff cancellable: The latest patch adds recovery-stop timeout tracking and uses a fresh AbortController for the recovery restart sleep, addressing the prior review finding that manual stops during recovery backoff must remain cancellable. (src/gateway/server-channels.ts:573, b77aec55c2bc)
  • Regression coverage added: The PR adds coverage for a recovery stop timeout not poisoning auto-restart state and for manual stop cancellation during the recovery backoff path. (src/gateway/server-channels.test.ts:330, b77aec55c2bc)
  • Real behavior proof supplied: The PR body now includes terminal output from a local Node/tsx harness using the real createChannelManager, showing restartPending=true after the recovery timeout, startCalls=2 after the old task settles, and later manual cleanup setting manuallyStopped=true. (b77aec55c2bc)

Likely related people:

  • steipete: Current-main blame attributes the relevant server-channels.ts manual-stop/restart loop and channel-health-monitor.ts restart call path to Peter's current-main snapshot, and nearby history shows Peter refactored channel health policy. (role: recent maintainer and current gateway lifecycle owner; confidence: high; commits: 2f3a9629d837, 2f6718b8e749; files: src/gateway/server-channels.ts, src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
  • Dennis Rankin: The stale Slack Socket Mode detection and health-monitor auto-restart behavior central to this bug appears to date to commit a28a4b1, which added stale socket restart handling. (role: introduced related Slack stale-socket restart behavior; confidence: medium; commits: a28a4b1b619a; files: src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
  • Takhoffman: Recent adjacent work hardened stale-socket restarts before the first event and centralized connect-time liveness tracking, which shares the same health-monitor surface. (role: adjacent stale-socket restart maintainer; confidence: medium; commits: 8873e13f1e6f; files: src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
  • vincentkoc: Recent history shows Vincent maintaining channel-backed readiness probes that consume restartPending and Slack Socket Mode documentation, adjacent to the observable state changed here. (role: adjacent channel readiness and Slack docs maintainer; confidence: medium; commits: ab5fcfcc0128, 04b7e4894d72; files: src/gateway/server/readiness.ts, src/gateway/server/readiness.test.ts, docs/channels/slack.md)

Remaining risk / open question:

  • I did not execute the PR tests in this read-only review; validation relies on static diff inspection and the contributor's reported commands.
  • Live Slack Socket Mode reconnection with real credentials is still unproved; the supplied proof uses the real Gateway channel manager with a temporary local channel plugin.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 557c5bf70521.

Re-review progress:

@openclaw-barnacle openclaw-barnacle Bot added the channel: whatsapp-web Channel integration: whatsapp-web label May 5, 2026
@kevinslin kevinslin force-pushed the dev/kevinlin/slack-recovery-stop-timeout branch from 0b8c713 to b77aec5 Compare May 5, 2026 05:01
@openclaw-barnacle openclaw-barnacle Bot removed the channel: whatsapp-web Channel integration: whatsapp-web label May 5, 2026
@kevinslin kevinslin merged commit 5a8ccb6 into openclaw:main May 5, 2026
100 of 103 checks passed
@kevinslin kevinslin deleted the dev/kevinlin/slack-recovery-stop-timeout branch May 5, 2026 05:49
@aiZKP
Copy link
Copy Markdown

aiZKP commented May 6, 2026

Hi @kevinslin I hope you are doing well. I'd like to discuss something regarding contribution. If you're free, can we connect on Telegram?

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026
* fix: recover Slack channel restart after stop timeout

* fix: keep recovery restart cancellable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gateway Gateway runtime maintainer Maintainer-authored PR size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Slack socket permanently dead after event-loop starvation — manuallyStopped suppresses auto-reconnect

2 participants