fix(slack): reconnect socket mode after disconnect#27232
fix(slack): reconnect socket mode after disconnect#27232Takhoffman merged 3 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryAdded automatic reconnection logic for Slack socket mode with exponential backoff and bounded retry attempts. The implementation watches for disconnect, error, and
Confidence Score: 3/5
Last reviewed commit: 366de2d |
markshields-tl
left a comment
There was a problem hiding this comment.
Review
This is the most complete of the three PRs targeting the silent socket death bug (#17847). Reviewing against the other two approaches (#24283 fail-fast, #27241 DNS watchdog):
What I like:
- Watches both
disconnectanderrorevents on theSocketModeClient— covers the full failure surface - Reconnect with backoff is the right pattern (vs. fail-fast which relies on an external supervisor that may not exist)
- Bounded retry count prevents infinite loops
- Tests cover the disconnect watcher behavior
Suggestions:
-
Consider also incorporating the staleness watchdog timer from #27241. The key insight there is that sometimes there is NO disconnect event at all — the SDK's internal reconnect silently swallows errors and the socket just goes dead without emitting anything. A 3-minute timer that checks
lastInboundAt(which already exists in the codebase at the provider level) would catch this "silent death" case. -
The
lastInboundAttimestamp is already being tracked (set on every inbound event). A watchdog could periodically check: "have we received ANY inbound event in the last N minutes?" If not, and the socket claims to be connected, force a reconnect. This would catch the failure mode where no disconnect/error event fires at all. -
From production data: we see ~2-6 hour silent gaps with zero log output between the last successful inbound and discovery. There is genuinely no event to hook — the socket just stops delivering.
Bottom line: This PR + the staleness timer from #27241 would be the complete fix. Either way, this is a significant improvement over the current "hope for the best" approach.
— Mort (AI assistant reviewing on behalf of @markshields-tl)
|
Thanks for the review. I agree on the staleness-watchdog gap. I kept this PR scoped to reconnect-on-disconnect/error plus listener-leak fix for safe mergeability. I will implement |
|
Quick status ping on this PR. Current intent remains the same: keep this PR focused on reconnect-on-disconnect/error and listener-leak handling for a safe, narrow merge path. I still agree the If maintainers prefer combining both into this PR now, I can update it accordingly. |
336a4c4 to
e04db96
Compare
🔒 Aisle Security AnalysisWe found 1 potential security issue(s) in this PR:
1. 🟡 Potential sensitive data exposure & log forging via JSON.stringify of Slack socket errors
DescriptionIn
Vulnerable code: function formatUnknownError(error: unknown): string {
if (error instanceof Error) {
return error.message;
}
if (typeof error === "string") {
return error;
}
try {
return JSON.stringify(error);
} catch {
return "unknown error";
}
}
runtime.error?.(
`slack socket mode failed to start ... (${formatUnknownError(err)})`,
);
...
runtime.error?.(
`slack socket disconnected ...${disconnect.error ? ` (${formatUnknownError(disconnect.error)})` : ""}`,
);RecommendationAvoid logging raw/structured error objects from external libraries; log stable, redacted, single-line summaries. Suggested approach:
Example safer formatter: function toSingleLine(s: string): string {
return s.replace(/[\r\n\t]+/g, " ").slice(0, 500);
}
function redactSecrets(s: string): string {
// Slack tokens commonly start with xoxb-, xapp-, xoxp-, etc.
return s.replace(/xox[baprs]-[A-Za-z0-9-]+/g, "<redacted>")
.replace(/xapp-\d-[A-Za-z0-9-]+/g, "<redacted>");
}
function formatUnknownErrorSafe(err: unknown): string {
if (err instanceof Error) {
return redactSecrets(toSingleLine(`${err.name}: ${err.message}`));
}
if (typeof err === "string") {
return redactSecrets(toSingleLine(err));
}
if (err && typeof err === "object") {
const e = err as Record<string, unknown>;
const summary = {
name: typeof e.name === "string" ? e.name : undefined,
code: typeof e.code === "string" ? e.code : undefined,
status: typeof e.status === "number" ? e.status : undefined,
message: typeof e.message === "string" ? e.message : undefined,
};
return redactSecrets(toSingleLine(JSON.stringify(summary)));
}
return "unknown error";
}Then use Analyzed PR: #27232 at commit |
|
Merged via squash in 949faff. Thanks for the contribution. For traceability, I added the required changelog entry before merge and validated the PR with:
Both gate runs passed on the merged head. |
Cherry-pick of upstream 949faff.
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
* fix(slack): reconnect socket mode after disconnect * fix(slack): avoid orphaned disconnect waiters on start failure * docs(changelog): record slack socket reconnect reliability fix --------- Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
Cherry-pick of upstream 949faff.
Summary
monitorSlackProviderapp.start()lifecycle with backoffValidation
Closes #27077