Skip to content

fix(slack): fail fast when socket mode disconnects#24283

Closed
frankekn wants to merge 2 commits intoopenclaw:mainfrom
frankekn:fix/slack-socket-disconnect-failfast
Closed

fix(slack): fail fast when socket mode disconnects#24283
frankekn wants to merge 2 commits intoopenclaw:mainfrom
frankekn:fix/slack-socket-disconnect-failfast

Conversation

@frankekn
Copy link
Contributor

@frankekn frankekn commented Feb 23, 2026

What

  • watch Slack socket lifecycle disconnect events in monitor startup
  • fail fast by surfacing a provider error when socket mode disconnects
  • add regression test to verify monitor exits on disconnect

Why

Socket mode could silently stop delivering events while the process stayed alive. Exiting fast lets the supervisor restart the provider.

Validation

  • pnpm vitest run src/slack/monitor.tool-result.test.ts src/slack/monitor/provider.group-policy.test.ts
  • pnpm check

Greptile Summary

This PR adds fail-fast behavior for Slack socket mode by monitoring the underlying SocketModeClient lifecycle events (disconnected, unable_to_socket_mode_start). When either event fires, the provider surfaces an error and exits, allowing the supervisor to restart it. Previously, a silent socket disconnect would leave the process alive but unable to receive events.

  • Adds resolveSlackSocketLifecycleClient() to access the receiver.client from the Bolt app, with defensive null/type checks
  • Introduces detachSocketLifecycleListener() with fallback support for both off and removeListener APIs
  • Wires up a Promise.race between the existing abort-wait and a new disconnect-wait, so either graceful shutdown or unexpected disconnect ends the provider
  • Includes proper cleanup (idempotent listener detachment) in both the error path and the finally block
  • Adds a regression test that verifies the monitor rejects when a disconnected event is emitted
  • Extends test helpers with socket client mock (on/off/emit) and handler tracking via a Map<string, Set<handler>>

Confidence Score: 5/5

  • This PR is safe to merge — it adds defensive socket lifecycle monitoring with proper cleanup, guards against race conditions with the abort signal, and includes a focused regression test.
  • The changes are well-scoped and focused on a single concern (socket disconnect detection). The implementation handles edge cases correctly: idempotent cleanup, abort-signal guard to prevent spurious rejections during graceful shutdown, defensive null checks when resolving the lifecycle client, and fallback listener removal APIs. The test properly validates the core disconnect scenario. No existing behavior is altered for HTTP mode (guarded by slackMode === "socket"). The code follows the repo's conventions (strict typing with unknown, no any, under LOC guideline).
  • No files require special attention

Last reviewed commit: 956b807

Context used:

  • Context from dashboard - CLAUDE.md (source)
  • Context from dashboard - AGENTS.md (source)

@openclaw-barnacle openclaw-barnacle bot added channel: slack Channel integration: slack size: S labels Feb 23, 2026
Copy link
Contributor

@markshields-tl markshields-tl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Simplest approach of the three PRs targeting #17847 — fail fast on disconnect and let the supervisor restart.

Pros:

  • Minimal code change, low risk
  • Defensive: if you can't fix reconnect, at least don't pretend to be alive
  • Good test coverage

Concern:

  • Relies on an external supervisor (systemd, launchd, etc.) to restart the gateway. Not all deployments have this — e.g., openclaw gateway start in a terminal session. A silent exit is better than silent death, but auto-reconnect (#27232) is strictly more useful.
  • Doesn't address the "no disconnect event fires at all" failure mode documented in our production data (see my comment on #17847). Sometimes the socket just stops delivering with no event.

Verdict: Good safety net, but #27232 (reconnect loop) + #27241 (staleness watchdog) together would be the complete solution.

— Mort (AI assistant reviewing on behalf of @markshields-tl)

@frankekn
Copy link
Contributor Author

Thanks. One clarification re the supervisor concern: this provider intentionally throws on socket lifecycle disconnect so the Gateway channel manager can auto-restart it (even when running openclaw gateway start in a terminal session). It does not require systemd/launchd.

Follow-up commit 7cb6c13 also avoids a possible unhandled rejection when the disconnect waiter triggers during startup by resolving the waiter and throwing after the race.

Agreed the silent-stall / no-disconnect mode needs a watchdog + reconnect loop (see #27241 / #27232).

@Takhoffman Takhoffman added the close:superseded PR close reason label Mar 1, 2026
@Takhoffman
Copy link
Contributor

Closing with Tak approval: superseded

@Takhoffman Takhoffman closed this Mar 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: slack Channel integration: slack close:superseded PR close reason size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants