fix: recover Slack channel restart after stop timeout by kevinslin · Pull Request #77686 · openclaw/openclaw

kevinslin · 2026-05-05T04:33:49Z

Summary

Treat channel health-monitor restarts as recovery stops, not manual stops, so a timed-out stop does not poison future auto-reconnect state.
Preserve manual-stop timeout behavior for user-initiated stops while marking recovery stop timeouts as restart-pending.
Add regression coverage for the timed-out recovery stop path and the manual-stop-during-recovery-backoff path.
Update the changelog.

Real behavior proof

Behavior or issue addressed: A recovery restart that times out while stopping a stuck channel task should remain eligible for auto-restart when the old task finally settles, while later manual stops should still cancel recovery backoff.
Real environment tested: Local OpenClaw checkout on macOS, running the actual gateway createChannelManager implementation through Node/tsx with a temporary channel plugin installed in the runtime registry.
Exact steps or command run after this patch: From repo root, ran a node --import tsx --input-type=module runtime harness that imports src/gateway/server-channels.ts, starts a channel account whose first task ignores abort until explicitly released, calls stopChannel('discord', 'default', { manual: false }), immediately requests startChannel, releases the old task, waits for the manager's real auto-restart path, then performs a manual cleanup stop.
Evidence after fix: Terminal output from the local runtime harness:

after start: startCalls=1
[recovery-proof] [default] channel stop exceeded 5000ms after abort; continuing shutdown
after recovery stop timeout: elapsedMs=5023 running=false restartPending=true manuallyStopped=false lastError=channel stop timed out after 5000ms
after immediate restart request while old task is stuck: startCalls=1
[recovery-proof] [default] auto-restart attempt 1/10 in 5s
after old task settled: startCalls=2 running=true restartPending=false reconnectAttempts=1 lastError=null
events=startAccount(default) #1 | first task received abort but stays stuck until release | first task settled after simulated stuck stop | startAccount(default) #2
cleanup manual stop: running=false restartPending=false manuallyStopped=true

Observed result after fix: The recovery stop timed out without setting the account as manually stopped, the immediate restart request did not double-start while the old task was still stuck, and after the old task settled the manager auto-restarted the account (startCalls=2, running=true, lastError=null). A later manual stop left restartPending=false and manuallyStopped=true.
What was not tested: Live Slack credentials/socket reconnection. The proof uses the real gateway channel manager and a temporary local channel plugin to reproduce the stuck-task lifecycle deterministically.

Verification

pnpm exec oxfmt --check --threads=1 src/gateway/server-channels.ts src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.ts src/gateway/channel-health-monitor.test.ts
git diff --check HEAD~2..HEAD
pnpm changed:lanes --json
pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
Local Node/tsx runtime harness output copied in Real behavior proof above.

Notes

pnpm changed:lanes --json selected broad lanes after the rebase because the branch had been rewritten on top of current main; the touched-surface checks above are the local targeted proof for this PR.
Testbox/Crabbox broad validation was not run from this environment because neither blacksmith nor crabbox is installed here.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca5792855e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

clawsweeper · 2026-05-05T04:40:20Z

Codex review: needs maintainer review before merge.

Summary
The PR adds a non-manual recovery-stop mode for health-monitor channel restarts, preserves manual-stop timeout behavior, adds regression tests, and updates the changelog.

Reproducibility: yes. at source level: current main routes health-monitor restarts through the manual stop path, and the timeout branch leaves manuallyStopped plus non-restarting runtime state. I did not run a live Slack event-loop starvation reproduction in this read-only review.

Real behavior proof
Sufficient (terminal): The PR body includes after-fix terminal output from a real local Gateway manager harness showing the recovery timeout path auto-restarts and later manual stop still cancels recovery.

Next step before merge
The protected maintainer label keeps this in human review/merge flow, and no narrow automated repair remains after the latest patch.

Security
Cleared: The patch changes Gateway lifecycle logic, tests, and changelog text without introducing a concrete security or supply-chain regression.

Review details

Best possible solution:

Land the narrow Gateway lifecycle fix after maintainer review and required checks, keeping manual stops suppressing restarts while recovery stops can reconnect after stop timeouts.

Do we have a high-confidence way to reproduce the issue?

Yes at source level: current main routes health-monitor restarts through the manual stop path, and the timeout branch leaves manuallyStopped plus non-restarting runtime state. I did not run a live Slack event-loop starvation reproduction in this read-only review.

Is this the best way to solve the issue?

Yes: splitting manual and recovery stops at the ChannelManager boundary is the narrowest maintainable fix, and the latest patch keeps manual stops during recovery backoff cancellable with a separate abort signal.

Acceptance criteria:

pnpm test src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.test.ts
pnpm exec oxfmt --check --threads=1 src/gateway/server-channels.ts src/gateway/server-channels.test.ts src/gateway/channel-health-monitor.ts src/gateway/channel-health-monitor.test.ts
git diff --check
pnpm check:changed

What I checked:

Current main routes health restarts through manual stop: On current main, the health monitor restarts unhealthy running accounts by calling stopChannel(channelId, accountId) with no recovery/manual distinction. (src/gateway/channel-health-monitor.ts:166, 557c5bf70521)
Current main timeout leaves poisoned state: Current stopChannel adds the account to manuallyStopped before aborting, and the timeout branch leaves running: true with restartPending: false, which matches the linked Slack dead-socket failure chain. (src/gateway/server-channels.ts:659, 557c5bf70521)
PR marks health-monitor stops as recovery stops: The latest patch changes the health-monitor restart path to call stopChannel(..., { manual: false }), while existing callers keep the default manual behavior. (src/gateway/channel-health-monitor.ts:166, b77aec55c2bc)
PR keeps recovery backoff cancellable: The latest patch adds recovery-stop timeout tracking and uses a fresh AbortController for the recovery restart sleep, addressing the prior review finding that manual stops during recovery backoff must remain cancellable. (src/gateway/server-channels.ts:573, b77aec55c2bc)
Regression coverage added: The PR adds coverage for a recovery stop timeout not poisoning auto-restart state and for manual stop cancellation during the recovery backoff path. (src/gateway/server-channels.test.ts:330, b77aec55c2bc)
Real behavior proof supplied: The PR body now includes terminal output from a local Node/tsx harness using the real createChannelManager, showing restartPending=true after the recovery timeout, startCalls=2 after the old task settles, and later manual cleanup setting manuallyStopped=true. (b77aec55c2bc)

Likely related people:

steipete: Current-main blame attributes the relevant server-channels.ts manual-stop/restart loop and channel-health-monitor.ts restart call path to Peter's current-main snapshot, and nearby history shows Peter refactored channel health policy. (role: recent maintainer and current gateway lifecycle owner; confidence: high; commits: 2f3a9629d837, 2f6718b8e749; files: src/gateway/server-channels.ts, src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
Dennis Rankin: The stale Slack Socket Mode detection and health-monitor auto-restart behavior central to this bug appears to date to commit a28a4b1, which added stale socket restart handling. (role: introduced related Slack stale-socket restart behavior; confidence: medium; commits: a28a4b1b619a; files: src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
Takhoffman: Recent adjacent work hardened stale-socket restarts before the first event and centralized connect-time liveness tracking, which shares the same health-monitor surface. (role: adjacent stale-socket restart maintainer; confidence: medium; commits: 8873e13f1e6f; files: src/gateway/channel-health-monitor.ts, src/gateway/channel-health-policy.ts)
vincentkoc: Recent history shows Vincent maintaining channel-backed readiness probes that consume restartPending and Slack Socket Mode documentation, adjacent to the observable state changed here. (role: adjacent channel readiness and Slack docs maintainer; confidence: medium; commits: ab5fcfcc0128, 04b7e4894d72; files: src/gateway/server/readiness.ts, src/gateway/server/readiness.test.ts, docs/channels/slack.md)

Remaining risk / open question:

I did not execute the PR tests in this read-only review; validation relies on static diff inspection and the contributor's reported commands.
Live Slack Socket Mode reconnection with real credentials is still unproved; the supplied proof uses the real Gateway channel manager with a temporary local channel plugin.

Codex review notes: model gpt-5.5, reasoning high; reviewed against 557c5bf70521.

Re-review progress:

State: Complete
Detail: The targeted re-review finished, the durable review comment was updated, and the synced verdict was routed.
Run: https://github.com/openclaw/clawsweeper/actions/runs/25358844222
Updated: 2026-05-05T05:10:13.735Z

aiZKP · 2026-05-06T02:18:49Z

Hi @kevinslin I hope you are doing well. I'd like to discuss something regarding contribution. If you're free, can we connect on Telegram?

* fix: recover Slack channel restart after stop timeout * fix: keep recovery restart cancellable

openclaw-barnacle Bot added gateway Gateway runtime size: S maintainer Maintainer-authored PR labels May 5, 2026

kevinslin force-pushed the dev/kevinlin/slack-recovery-stop-timeout branch 2 times, most recently from 48b7ff4 to ca57928 Compare May 5, 2026 04:36

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/gateway/server-channels.ts Outdated

openclaw-barnacle Bot added the channel: whatsapp-web Channel integration: whatsapp-web label May 5, 2026

kevinslin added 2 commits May 4, 2026 22:00

fix: recover Slack channel restart after stop timeout

d55639a

fix: keep recovery restart cancellable

b77aec5

kevinslin force-pushed the dev/kevinlin/slack-recovery-stop-timeout branch from 0b8c713 to b77aec5 Compare May 5, 2026 05:01

openclaw-barnacle Bot removed the channel: whatsapp-web Channel integration: whatsapp-web label May 5, 2026

kevinslin merged commit 5a8ccb6 into openclaw:main May 5, 2026
100 of 103 checks passed

kevinslin deleted the dev/kevinlin/slack-recovery-stop-timeout branch May 5, 2026 05:49

github-actions Bot mentioned this pull request May 5, 2026

📡 Upstream Digest — 2026-05-05 06:10 UTC curtismercier/openclaw-mods#772

Open

github-actions Bot pushed a commit to Desicool/openclaw that referenced this pull request May 9, 2026

fix: recover Slack channel restart after stop timeout (openclaw#77686)

c252bb2

* fix: recover Slack channel restart after stop timeout * fix: keep recovery restart cancellable

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: recover Slack channel restart after stop timeout#77686

fix: recover Slack channel restart after stop timeout#77686
kevinslin merged 2 commits intoopenclaw:mainfrom
kevinslin:dev/kevinlin/slack-recovery-stop-timeout

kevinslin commented May 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

clawsweeper Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

aiZKP commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevinslin commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Real behavior proof

Verification

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

clawsweeper Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

aiZKP commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevinslin commented May 5, 2026 •

edited

Loading

clawsweeper Bot commented May 5, 2026 •

edited

Loading