feat(slack): persistent thread participation + bounded fallback#58731
feat(slack): persistent thread participation + bounded fallback#58731soichiyo wants to merge 6 commits intoopenclaw:mainfrom
Conversation
Greptile SummaryThis PR adds persistent thread-participation state for Slack's Key changes:
Issues found:
Confidence Score: 5/5Safe to merge; all issues found are P2 (observability, performance, and test-quality concerns) with no correctness or data-integrity impact on the primary happy path. The persistent store, TTL pruning, atomic write, and mention-gating split are all correctly implemented. The in-memory sent-thread-cache remains the hot path and continues to work correctly after the refactor. No P0 or P1 findings — the conversations.replies API misuse is always caught and handled, the _ambiguousThreadReply cache gap only affects log verbosity on retries, and the single-slot module cache is a performance concern rather than a correctness issue. extensions/slack/src/monitor/thread-resolution.ts — the resolveThreadTsFromReplies fallback should be validated against the Slack API to confirm it can actually resolve a child-reply ts to its parent thread ts before shipping to production.
|
| const response = (await params.client.conversations.replies({ | ||
| channel: params.channelId, | ||
| ts: params.messageTs, | ||
| limit: 1, | ||
| inclusive: true, | ||
| })) as { messages?: Array<{ ts?: string; thread_ts?: string }> }; | ||
| const msg = response.messages?.[0]; | ||
| const threadTs = normalizeThreadTs(msg?.thread_ts); | ||
| // Only return if the resolved thread_ts differs from the message ts | ||
| // (i.e., this message belongs to a parent thread). | ||
| if (threadTs && threadTs !== params.messageTs) { | ||
| return threadTs; | ||
| } |
There was a problem hiding this comment.
conversations.replies receives the reply's own ts, not the parent's
The ts parameter for conversations.replies must be the parent/root message timestamp. When this fallback is reached, we know the message is a reply (it has parent_user_id) but we don't know its parent's ts — which is exactly what we're trying to recover. Passing the reply's own ts will cause Slack to return thread_not_found in most cases, making the fallback silently inert.
Since the result is always caught and returns undefined, this doesn't cause crashes, but it does fire an unnecessary Tier-3 API call (conversations.replies is rate-limited at Tier 3) for every message that conversations.history couldn't resolve. In channels where history access is restricted, this could burn rate-limit quota without any benefit.
The only scenario where this could succeed is if Slack's API internally resolves a reply ts to its parent thread — which is not documented behavior. Consider either:
- Documenting this explicitly if it is confirmed to work in testing, or
- Removing the call if tests show it never resolves anything that history didn't
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/slack/src/monitor/thread-resolution.ts
Line: 57-69
Comment:
**`conversations.replies` receives the reply's own `ts`, not the parent's**
The `ts` parameter for `conversations.replies` must be the parent/root message timestamp. When this fallback is reached, we know the message is a reply (it has `parent_user_id`) but we don't know its parent's `ts` — which is exactly what we're trying to recover. Passing the reply's own `ts` will cause Slack to return `thread_not_found` in most cases, making the fallback silently inert.
Since the result is always caught and returns `undefined`, this doesn't cause crashes, but it does fire an unnecessary Tier-3 API call (`conversations.replies` is rate-limited at Tier 3) for every message that `conversations.history` couldn't resolve. In channels where history access is restricted, this could burn rate-limit quota without any benefit.
The only scenario where this could succeed is if Slack's API internally resolves a reply ts to its parent thread — which is not documented behavior. Consider either:
- Documenting this explicitly if it is confirmed to work in testing, or
- Removing the call if tests show it never resolves anything that history didn't
How can I resolve this? If you propose a fix, please make it concise.| it("survives simulated process restart (re-reads from disk)", () => { | ||
| persistThreadParticipation(sessionStorePath, "A1", "C123", "1700000000.000001", "main"); | ||
|
|
||
| // Clear the module-level cache by importing fresh — we simulate restart | ||
| // by directly reading from disk in a new call after clearing internal cache. | ||
| // The simplest way: just verify the file exists and has the right content. | ||
| const filePath = deriveParticipationStorePath(sessionStorePath); | ||
| const raw = JSON.parse(fs.readFileSync(filePath, "utf-8")); | ||
| expect(raw["A1:C123:1700000000.000001"]).toBeDefined(); | ||
| expect(raw["A1:C123:1700000000.000001"].agentId).toBe("main"); | ||
| expect(raw["A1:C123:1700000000.000001"].repliedAt).toBeGreaterThan(0); | ||
| }); |
There was a problem hiding this comment.
Restart-survival test doesn't actually exercise the cache-miss path
The test comment acknowledges the limitation: the module-level cachedStore / cachedStorePath is not cleared between calls in the same process, so hasPersistedThreadParticipation in this test reads from the in-memory cache, not from disk. The test only verifies that the file was written correctly, not that hasPersistedThreadParticipation re-reads from disk after a restart (cache miss).
A true restart simulation would need to reset the module's internal state — either via vi.resetModules() + dynamic re-import, or by exposing a _resetCacheForTesting() helper. Without that, the test named "survives simulated process restart" is misleading: it validates saveStore but not loadStore's disk-read path after a cold start.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/slack/src/thread-participation-store.test.ts
Line: 75-86
Comment:
**Restart-survival test doesn't actually exercise the cache-miss path**
The test comment acknowledges the limitation: the module-level `cachedStore` / `cachedStorePath` is not cleared between calls in the same process, so `hasPersistedThreadParticipation` in this test reads from the in-memory cache, not from disk. The test only verifies that the file was written correctly, not that `hasPersistedThreadParticipation` re-reads from disk after a restart (cache miss).
A true restart simulation would need to reset the module's internal state — either via `vi.resetModules()` + dynamic re-import, or by exposing a `_resetCacheForTesting()` helper. Without that, the test named "survives simulated process restart" is misleading: it validates `saveStore` but not `loadStore`'s disk-read path after a cold start.
How can I resolve this? If you propose a fix, please make it concise.| let cachedStore: ParticipationStore | null = null; | ||
| let cachedStorePath: string | null = null; |
There was a problem hiding this comment.
Module-level single-path cache is silently inefficient with multiple store paths
cachedStore and cachedStorePath hold exactly one store at a time. If multiple agents (each with a different storePath) process messages concurrently and interleave their persistThreadParticipation / hasPersistedThreadParticipation calls, the cache will alternate between paths and every call will result in a synchronous fs.readFileSync. The Tier-1 in-memory sent-thread-cache remains the hot path so correctness is unaffected, but the "avoid repeated reads within the same event loop tick" comment no longer holds in multi-agent deployments.
A Map<string, ParticipationStore> keyed by filePath would make the cache work correctly for any number of agents without behavioural changes.
Prompt To Fix With AI
This is a comment left during a code review.
Path: extensions/slack/src/thread-participation-store.ts
Line: 29-30
Comment:
**Module-level single-path cache is silently inefficient with multiple store paths**
`cachedStore` and `cachedStorePath` hold exactly one store at a time. If multiple agents (each with a different `storePath`) process messages concurrently and interleave their `persistThreadParticipation` / `hasPersistedThreadParticipation` calls, the cache will alternate between paths and every call will result in a synchronous `fs.readFileSync`. The Tier-1 in-memory `sent-thread-cache` remains the hot path so correctness is unaffected, but the "avoid repeated reads within the same event loop tick" comment no longer holds in multi-agent deployments.
A `Map<string, ParticipationStore>` keyed by `filePath` would make the cache work correctly for any number of agents without behavioural changes.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6e6985a44a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (!resolved) { | ||
| logVerbose( | ||
| `slack inbound: conversations.history miss for channel=${message.channel} ts=${message.ts}, trying conversations.replies fallback`, | ||
| ); | ||
| resolved = await resolveThreadTsFromReplies({ |
There was a problem hiding this comment.
Deduplicate replies fallback before caching thread_ts result
When two handlers resolve the same Slack event concurrently (for example message and app_mention with the same channel:ts), both callers await the shared conversations.history promise, then each independently runs conversations.replies and finally calls setCached. This means a later failing fallback can overwrite an earlier successful resolution with null, causing that message key to be treated as unresolved for the cache TTL and also doubling fallback API calls under contention. Keep the fallback+cache write inside the same inflight promise so only one final result wins.
Useful? React with 👍 / 👎.
363f8af to
8259995
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8259995e47
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| if (shouldLogVerbose()) { | ||
| setCached(cacheKey, resolved ?? null, Date.now()); |
There was a problem hiding this comment.
Preserve ambiguous reply state in cache
setCached(cacheKey, resolved ?? null, ...) records unresolved lookups as null, but unresolved lookups now return a decorated event with _ambiguousThreadReply: true. When the same Slack event is processed again within TTL (for example concurrent message and app_mention handlers), the cache-hit path treats null as a plain miss and returns the original message, so the second caller loses the ambiguity marker and can fall back to root-message behavior. Store an explicit ambiguous-miss sentinel (or cached decorated result) so all callers get consistent thread-reply semantics.
Useful? React with 👍 / 👎.
martingarramon
left a comment
There was a problem hiding this comment.
Thanks for the write-up and the test coverage. Core pattern (persistent store behind hot in-memory cache, diagnostic log promotion, bounded fallback) reads soundly. A few concerns, none strictly blocking but worth addressing before merge — three beyond what the bots already flagged.
Bot-flagged risk cluster. Greptile's three P2s (reply-ts-as-parent-ts fallback, single-path module cache, test not exercising disk re-read) and Codex's two P2s (concurrent-handler race overwriting cache, ambiguous sentinel not preserved in cache) collectively suggest the thread-resolution flow hasn't been pressure-tested for concurrency or multi-agent layouts. The Map<string, ParticipationStore> fix from Greptile and the shared-inflight-promise fix from Codex both seem correct and cheap.
Beyond the bots:
-
threadContinuation?added toMentionGateParams/MentionGateWithBypassParamsis dead code. Both types are@deprecated, and the deprecatedresolveMentionGatingatsrc/channels/mention-gating.ts:194-206only readsimplicitMention, neverthreadContinuation. The new API (resolveInboundMentionDecisionviaInboundMentionFacts.implicitMentionKinds) is whatprepare.tsactually uses — correctly, viaimplicitMentionKindWhen("bot_thread_participant", ...)at the persistent-check site. So the field additions to the deprecated types have no runtime effect. Either remove them or wire them into the deprecated gate for non-Slack callers that may still use it. The field comment claiming it "contributes to effectiveWasMentioned independently" is currently untrue. -
_ambiguousThreadReplyhas zero consumers. The PR introduces the sentinel onSlackMessageEvent(types.ts:43) and sets it inthread-resolution.tswhen both history + replies fallbacks fail, with a comment saying "downstream should not treat this as a genuine root message." But grepping the repo finds no code that reads the field. As shipped, an ambiguous reply is functionally indistinguishable from a root message in the downstream path. Is a follow-up PR intended to wire this, or was a consumer missed? If the former, worth saying so explicitly in the body; if the latter, the PR isn't complete yet. -
Write amplification on the hot path.
persistThreadParticipationis called fromdispatch.ts:793-808on every reply delivery. Each call rewrites the entire store via syncfs.writeFileSync+renameSync(thread-participation-store.ts:57-64). AtMAX_ENTRIES = 10_000and ~100 bytes per entry, that's ~1MB rewritten synchronously per reply. On a tenant with high concurrent thread throughput, this blocks the event loop and raises IOPS pressure. Consider: (a) debounced writes batched every N seconds, (b) append-only log with periodic compaction, or (c) keep persistence but makesaveStoreasync + fire-and-forget so it stays off the ack path.pruneExpiredalso runs on every write — worth lazy-pruning only when size exceeds the cap given the O(n log n) sort cost.
Architectural question on the replies fallback. Greptile's P2 claim that conversations.replies with the reply's own ts returns thread_not_found matches the documented Slack API contract. The guard if (threadTs && threadTs !== params.messageTs) return threadTs; at thread-resolution.ts:155 implies you've observed it occasionally working — is that documented behavior or undocumented? If undocumented, worth flagging in the code comment and in the PR body so it doesn't turn into a Tier-3-quota silent regression when Slack tightens the API. If documented or test-validated, please link the evidence — otherwise the fallback is strictly quota burn for inert lookups.
Minor, non-blocking: loadStore at thread-participation-store.ts:43-51 catches all errors (including JSON parse) and returns {}, which the next saveStore will then persist. If the file is ever briefly corrupt for any reason outside the atomic-rename path (manual edit, backup-restore mishap), the recovery pattern silently wipes historical data. A noisier failure on parse error (fail-open in-memory only, never write back {} as replacement) is safer for durability. Low probability but zero-cost to log loud and skip the overwrite.
Happy to re-review once the concurrency + dead-code cleanup lands.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Thanks for the context here. I swept through the related work, and this is now duplicate or superseded. Close as superseded. The central Slack restart-survival problem has landed on current main through the merged SDK-backed persistence path, while this branch still carries a competing JSON participation store and is no longer the right merge target. So I’m closing this here and keeping the remaining discussion on the canonical linked item. Review detailsBest possible solution: Keep the merged SDK-backed Slack participation persistence on main as the canonical implementation, and handle any future missing- Do we have a high-confidence way to reproduce the issue? Yes. The original bug is reproducible on the pre-fix path by recording Slack thread participation, restarting or clearing process memory, then processing an unmentioned reply in a Is this the best way to solve the issue? No. This PR was useful source work, but the best solution is the merged #75583 path because it uses the current plugin state-store API instead of adding a second custom JSON persistence layer. Security review: Security review cleared: The PR diff adds local Slack participation persistence, Slack API handling, and tests, but no new dependency, workflow, secret handling, package source, or command-execution surface. What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against ad1e14af532f; fix evidence: commit d0ec3d1f09b0, main fix timestamp 2026-05-01T12:10:09Z. |
…ase 1) Thread continuation in Slack channels with requireMention was fragile because participation state lived only in a 24-hour in-memory cache that was lost on process restart. This caused threads to silently drop after gateway updates. Phase 1 changes: - Add persistent thread participation store (JSON file alongside session store) so restart no longer loses thread continuation - Add structured diagnostic logs at mention gating, thread continuation decisions, and participation recording to make silent failures traceable - Elevate thread_ts resolution failure from verbose-only to always-visible warning since it causes silent degradation - Pass storePath through PreparedSlackMessage so dispatch can persist participation without re-resolving the path Note: pre-commit tsgo blocked by pre-existing matrix test errors (5 errors in extensions/matrix/ unrelated to this change). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…on gating (Phase 2) Phase 2A — thread_ts resolution improvement: - Add conversations.replies as bounded fallback when conversations.history fails to resolve thread_ts for messages with parent_user_id - Mark unresolvable messages as _ambiguousThreadReply instead of silently treating them as root messages - Log ambiguous thread reply status in mention skip reasons Phase 2B — gating separation: - Split implicitMention into two distinct signals: - implicitMention: bot is the thread parent (root-message signal) - threadContinuation: bot has previously replied (participation signal) - Add threadContinuation as a first-class parameter to mention gating so root-message gating and thread-continuation gating are independently visible in logs and code - Backward compatible: threadContinuation is optional, existing callers (Teams, Google Chat, etc.) are unaffected Note: pre-commit tsgo blocked by pre-existing matrix test errors (5 errors in extensions/matrix/ unrelated to this change). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
e57b7e9 to
c825064
Compare
|
I rebased the branch onto current Since this PR was closed as superseded before the updated head propagated, I opened a clean follow-up here: #75630. |
Summary
requireMentionchannels was lost on gateway restart because participation state was in-memory only (24h TTL cache). Now writes to a JSON file alongside the session store, surviving restarts.conversations.historyfails to resolvethread_tsfor messages withparent_user_id, triesconversations.repliesas a second attempt before marking the message as ambiguous (instead of silently treating it as a root message).implicitMentionintoimplicitMention(parent is bot) andthreadContinuation(bot has previously replied), making root-message gating and thread-continuation gating independently visible and controllable. Backward compatible for other channels.Test plan
thread-participation-store.test.ts: 9 tests (persist, read, TTL, key isolation, restart survival)prepare.test.ts: 23 tests pass (no regressions)mention-gating.test.ts: 5 tests passmessage-handler.test.ts,sent-thread-cache.test.ts,missing-thread-ts.test.ts: all passrequireMention=truechannel thread continuation survivesdocker compose restartgateway.logfor thread replies