Skip to content

fix(slack): persist thread participation cache to survive gateway restarts [claude, human developer oversight]#33845

Closed
zeroaltitude wants to merge 22 commits intoopenclaw:mainfrom
zeroaltitude:fix/slack-thread-participation-persist
Closed

fix(slack): persist thread participation cache to survive gateway restarts [claude, human developer oversight]#33845
zeroaltitude wants to merge 22 commits intoopenclaw:mainfrom
zeroaltitude:fix/slack-thread-participation-persist

Conversation

@zeroaltitude
Copy link
Copy Markdown
Contributor

@zeroaltitude zeroaltitude commented Mar 4, 2026

Summary

Slack thread participation is tracked in-memory so the bot can keep replying in a thread after its first response (without requiring repeated @mentions). That cache was lost on every gateway restart, so the bot would silently stop responding to non-mention thread messages until someone re-@mentioned it. This PR persists the Slack thread participation cache to disk so participation survives restarts.

Review focus: src/slack/sent-thread-cache.ts only; file format is JSON {key→timestamp}.

Before / After

Before: restart gateway → thread cache cleared → bot ignores non-mention messages in previously-participated threads
After: restart gateway → cache restored → bot continues participating normally (within TTL)

What changed

src/slack/sent-thread-cache.ts

  • Persist to $STATE_DIR/slack-thread-participation.json
  • Lazy-load on first access (no Slack startup cost)
  • Debounced best-effort writes (≤1 write / 5s; timer unref() so it won't hold the process open)
  • TTL filter on load (expired entries dropped)
  • Missing/corrupt file handled gracefully (starts empty)
  • clearSlackThreadParticipationCache() now persists empty state so "clear" survives restarts (also wipes stale persist files)

src/slack/sent-thread-cache.test.ts

  • Adds persistence + clear/restart tests (round-trip, TTL expiry, missing/corrupt file, clear-before-load wipe)
  • Tests isolated via temp persist path so local runs don't touch real $STATE_DIR

Risk

Low. This cache is non-critical: worst case is the current behavior (one extra @mention to re-establish participation). Persistence is best-effort; failures are swallowed and the bot falls back to in-memory behavior.

Testing

vitest unit coverage for persistence + clear semantics, plus manual restart verification.

AI-Assisted Development 🤖

This PR was developed with AI assistance (Claude/OpenClaw agent "Tank").

  • Degree of testing: Well-tested — 14 unit tests. pnpm check clean. Deployed locally and in daily use for ~4 weeks.
  • Human oversight: All changes reviewed and directed by Eddie Abrams
  • Understanding confirmed: The contributor understands the persistence mechanism and data model of the Slack thread cache

@openclaw-barnacle openclaw-barnacle Bot added channel: slack Channel integration: slack size: S labels Mar 4, 2026
@zeroaltitude zeroaltitude changed the title fix(slack): persist thread participation cache to survive gateway restarts fix(slack): persist thread participation cache to survive gateway restarts [claude, human developer oversight] Mar 4, 2026
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Greptile Summary

This PR persists the Slack thread participation cache to $STATE_DIR/slack-thread-participation.json so that the bot's auto-reply behaviour in threads survives gateway restarts. The implementation is thorough: lazy hydration with TTL/size filtering, debounced atomic writes (UUID-named temp file + wx flag + rename), 0o600/0o700 file/dir modes, a 4 MB parse guard against large files, and test-only helpers gated behind an environment check. The remaining file changes are Prettier reformats with no functional impact.

Confidence Score: 5/5

This PR is safe to merge — the persistence path is non-critical (fall-through to in-memory behavior on any failure) and the security hardening is comprehensive.

No P0 or P1 findings. The logic is carefully guarded at every layer: bounded file size before parsing, schema validation, future-timestamp rejection, atomic write semantics (CWE-59 mitigation), and correct insertion-order maintenance for the eviction loop. All previous review concerns have been addressed and regression-tested. Formatting changes in other files are purely cosmetic.

No files require special attention.

Reviews (6): Last reviewed commit: "Merge remote-tracking branch 'origin/mai..." | Re-trigger Greptile

Comment thread extensions/slack/src/sent-thread-cache.ts Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Additional Comments (1)

src/slack/sent-thread-cache.test.ts
afterEach leaves loaded as true, risking real state-dir reads in tests

The existing test suite's afterEach calls clearSlackThreadParticipationCache(), which clears the in-memory map but does not reset the loaded flag. As a result, the very first test to run in this suite will invoke loadFromDisk() and attempt to read from the real $STATE_DIR/slack-thread-participation.json (the default path, because persistPathOverride is undefined). Subsequent tests in the suite are unaffected because loaded stays true.

If someone runs these tests on a machine that has live Slack thread data in its state directory, the first test picks up that data, making the suite environment-dependent. The persistence suite correctly uses _resetForTests(undefined) which resets loaded and persistPathOverride. Using the same helper here would keep things consistent and safe:

  afterEach(() => {
    _resetForTests(undefined);
    vi.restoreAllMocks();
  });
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/slack/sent-thread-cache.test.ts
Line: 14-17

Comment:
**`afterEach` leaves `loaded` as `true`, risking real state-dir reads in tests**

The existing test suite's `afterEach` calls `clearSlackThreadParticipationCache()`, which clears the in-memory map but does **not** reset the `loaded` flag. As a result, the very first test to run in this suite will invoke `loadFromDisk()` and attempt to read from the real `$STATE_DIR/slack-thread-participation.json` (the default path, because `persistPathOverride` is `undefined`). Subsequent tests in the suite are unaffected because `loaded` stays `true`.

If someone runs these tests on a machine that has live Slack thread data in its state directory, the first test picks up that data, making the suite environment-dependent. The persistence suite correctly uses `_resetForTests(undefined)` which resets `loaded` and `persistPathOverride`. Using the same helper here would keep things consistent and safe:

```suggestion
  afterEach(() => {
    _resetForTests(undefined);
    vi.restoreAllMocks();
  });
```

How can I resolve this? If you propose a fix, please make it concise.

zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Mar 4, 2026
Comment thread extensions/slack/src/sent-thread-cache.test.ts Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0a210d0aa0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/slack/sent-thread-cache.ts Outdated
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Mar 4, 2026
- clearSlackThreadParticipationCache now always persists empty state,
  even before the cache has been loaded (wipes stale persist files)
- Added 2 tests: clear-then-restart survival, clear-before-load wipes

Addresses greptile + codex-connector review on PR openclaw#33845.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Confidence Score: 5/5

The only gap from the previous review has been fully addressed. You've added two excellent tests for the clearSlackThreadParticipationCache() persistence fix:

  1. "clear persists empty state so entries do not return after restart" - tests the primary restart scenario
  2. "clear before first load still wipes existing persist file" - tests the edge case where clear happens before any cache access

The test suite is now comprehensive (6 persistence tests total), the implementation is solid, and there's complete regression protection for the clear functionality. No changes needed — this is ready to merge.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Confidence Score: 5/5

The only gap from the previous review has been fully addressed. You've added two excellent tests for the clearSlackThreadParticipationCache() persistence fix:

  1. "clear persists empty state so entries do not return after restart" - tests the primary restart scenario
  2. "clear before first load still wipes existing persist file" - tests the edge case where clear happens before any cache access

The test suite is now comprehensive (6 persistence tests total), the implementation is solid, and there's complete regression protection for the clear functionality. No changes needed — this is ready to merge.

@greptile-apps: if you could update your main score comment with this change, it would be somewhat easier for reviewers. Thank you in advance!

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 4, 2026

Additional Comments (1)

src/slack/sent-thread-cache.test.ts, line 55
Unit test silently writes to real $STATE_DIR on clear

clearSlackThreadParticipationCache() now synchronously calls persistToDisk(), and the first describe block never sets a persistPathOverride. This means when this test calls clearSlackThreadParticipationCache() at line 52, it resolves the real $STATE_DIR path and overwrites (or creates) slack-thread-participation.json with {}.

On a developer's machine running pnpm test / vitest, this silently wipes any real Slack thread participation cache that may have existed in their ~/.openclaw/ state dir. In CI it's harmless since there's no state dir.

The other tests in this block are safe — schedulePersist() timers are always cancelled by _resetForTests() in afterEach before they can fire — but persistToDisk() in clearSlackThreadParticipationCache() is immediate and synchronous.

The straightforward fix is to add path isolation to this describe block's afterEach, mirroring the persistence suite:

describe("slack sent-thread-cache", () => {
  let tempDir: string;

  beforeEach(() => {
    tempDir = fs.mkdtempSync(path.join(os.tmpdir(), "slack-thread-cache-test-"));
    _resetForTests(path.join(tempDir, "slack-thread-participation.json"));
  });

  afterEach(() => {
    _resetForTests(undefined);
    vi.restoreAllMocks();
    try { fs.rmSync(tempDir, { recursive: true, force: true }); } catch { /* ignore */ }
  });
  ...

Alternatively, if keeping the two suites cleanly separated is preferred, the "clears all entries" test could be moved into the persistence suite (where setup() is already called), since it now inherently tests a persisted-state concern.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

Just a note to reviewers that greptile added it's 5/5 score as a secondary comment in this PR.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

CI/secrets failing due to pnpm-audit-prod on transitive deps (@buape/carbon -> @hono/node-server -> hono); PR does not modify package.json/lockfile. Happy to send a separate dependency bump/override PR if that’s the preferred fix

@zeroaltitude zeroaltitude force-pushed the fix/slack-thread-participation-persist branch from 6084bfb to 78bcd3e Compare March 7, 2026 17:57
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Mar 7, 2026
zeroaltitude added a commit to zeroaltitude/openclaw that referenced this pull request Mar 7, 2026
- clearSlackThreadParticipationCache now always persists empty state,
  even before the cache has been loaded (wipes stale persist files)
- Added 2 tests: clear-then-restart survival, clear-before-load wipes

Addresses greptile + codex-connector review on PR openclaw#33845.
@zeroaltitude zeroaltitude force-pushed the fix/slack-thread-participation-persist branch from 78bcd3e to 6619dc6 Compare March 7, 2026 19:49
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@greptile-apps: I'd apprecate you renewing your greptile summary as the 4/5 score there seems to be superceded by your 5/5 score in comments.

@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@joshavant — small channel/gateway fix, would appreciate your eyes on when you have a moment! thanks in advance!

@zeroaltitude zeroaltitude force-pushed the fix/slack-thread-participation-persist branch from 6619dc6 to f313a70 Compare March 8, 2026 19:57
CI for the previous commit (ecf48d4) flagged two real issues:

1. oxfmt: extensions/slack/src/sent-thread-cache.test.ts had a
   trailing-comma / line-break inconsistency. Fixed by running
   `pnpm format`.

2. oxlint: 7 `if-without-braces` errors introduced by my
   defense-in-depth commit. Six in the test file (env-var save/restore
   blocks for the test-helper-guard tests), one in the source
   (insertion-cap eviction loop). Wrapped each single-line `if`
   body in braces.

Note on remaining 7 lint errors visible in CI: those are PRE-EXISTING
on origin/main (qr-runtime missing module, provider-catalog/tools
type narrowing, and OutputRuntimeEnv 'error' type). Unrelated to this
PR — confirmed by stashing my changes and running the same command
on a clean main checkout, where the same 7 errors appear.

Tests: 23/23 still pass.
opus-4-7 independent review found three issues missed by earlier
review rounds:

P1 — _persistPathOverride was still module-local
  Despite the Codex thread reply claiming the override moved into the
  shared persistState global, it was still a module-local 'let'. Tests
  passed by accident: in-memory dedupe shared via persistState.entries,
  but disk side effects from re-imported module copies leaked into
  STATE_DIR instead of the per-test temp dir. Moved into
  PersistState.persistPathOverride on the global singleton.

P1 — CWE-59 symlink-follow on tmpPath (Aisle medium #2, unaddressed)
  tmpPath was ${filePath}.${process.pid}.tmp + default 'w' flag. A
  predictable name combined with a truncating O_CREAT would let an
  attacker who can write to the state dir pre-create a symlink at the
  tmp path pointing to e.g. /etc/passwd, and our write would follow it.
  Now uses randomUUID() for the name AND flag 'wx' (O_CREAT | O_EXCL)
  so any existing entry — file or symlink — fails the write safely.

P2 — hydration ignored MAX_ENTRIES (Aisle medium #1)
  The 4 MB file-size cap permits ~250K entries through, blowing past
  the 5K in-memory MAX_ENTRIES on first load. Now collects valid
  candidates first, sorts DESC by ts, and slices to MAX_ENTRIES before
  inserting into the dedupe cache. Bounds memory regardless of file
  size up to MAX_PERSIST_FILE_BYTES.

Tests: 26/26 pass (3 new tests covering each fix).

Architectural note (per opus review): Land standalone. Don't migrate
to plugin-sdk/persistent-dedupe — its hasRecent reads JSON on every
miss, adding ~2-10ms to every Slack message in a thread the bot
hasn't replied to (the dominant path for hasSlackThreadParticipation
on the message-dispatch hot path). Warm-cache + debounced-write is
correct here. Don't block on openclaw#70626.
gpt-5.5 independent review (running in parallel, no shared context with
opus) confirmed both opus P1s and opus P2-A. It also surfaced three
additional findings worth landing in this PR:

P2-B — Duplicated persist logic between schedulePersist callback and
  _flushPersist. Maintenance trap: any future hardening had to be
  mirrored in two places. Extracted into a shared persistToDisk()
  function called from both. Single source of truth for prune/cap/
  serialize/atomic-write.

P2-C / P3-C — Future timestamps from a crafted persist file (or a
  backwards system-clock jump) survive forever. The previous guard
  was 'now - ts < TTL_MS' which trivially holds for ts > now.
  Added 'ts > 0 && ts <= now' to the hydration filter so:
   * future-near (now + 60s)        -> rejected
   * future-far (MAX_SAFE_INTEGER)  -> rejected
   * zero / negative                -> rejected
   * monotonic-clock-skew tolerance -> small backwards jumps still
     reject the affected entries on next hydrate; corrected on next
     write cycle. Trade-off vs. monotonic-clock complexity is
     intentional.

P2-D — mkdirSync called with no mode flag -> directory inherits umask
  (often 0o755 — group/other readable). Now passes mode 0o700 to
  match the SDK helper and complement the 0o600 file mode. Multi-tenant
  hosts can no longer enumerate or symlink-squat the persist directory.

P3-B — Iteration-order inversion after over-capacity slice. When >
  MAX_ENTRIES candidates were sorted DESC + sliced, the resulting
  Map insertion order was ts-DESC, but the eviction loop in
  recordSlackThreadParticipation assumes Map insertion order = ts-ASC
  (oldest first). Without this fix, the first post-hydration eviction
  drops the NEWEST entry instead of the oldest. Added an ASC re-sort
  on both hydration and serialize paths so insertion order always
  matches recency.

Tests: 29/29 pass (4 new tests covering each finding).

P3-A (TOCTOU stat->read race) deferred to follow-up: low-impact given
the 4 MB file cap already bounds the read, and a fix requires
fs.openSync + fstatSync + readSync changes that don't align with the
current stat-then-read structure. Worth its own PR.

Per gpt-5.5 architectural call: keep this PR standalone; converge
with plugin-sdk/persistent-keyed-store (openclaw#70626) as a follow-up after
this lands. The sync->async migration is straightforward (3 call sites,
all in async contexts) but should be coordinated with maintainers.
…icipation-persist

# Conflicts:
#	extensions/slack/api.ts
#	extensions/slack/src/sent-thread-cache.ts
@openclaw-barnacle openclaw-barnacle Bot added app: web-ui App: web-ui gateway Gateway runtime channel: feishu Channel integration: feishu labels Apr 27, 2026
@zeroaltitude
Copy link
Copy Markdown
Contributor Author

@greptile-apps: can I get a review?

…icipation-persist

# Conflicts:
#	extensions/slack/src/sent-thread-cache.test.ts
@openclaw-barnacle openclaw-barnacle Bot added the commands Command implementations label Apr 28, 2026
@zeroaltitude zeroaltitude force-pushed the fix/slack-thread-participation-persist branch from c322244 to 1c99594 Compare April 29, 2026 18:15
Revert files unrelated to Slack thread-participation persistence back to
origin/main:
- extensions/feishu/src/outbound.{ts,test.ts}: prettier quote/wrap noise
- extensions/slack/src/monitor/context.ts: import order only
- src/commands/agent/session.test.ts: prettier line wrap
- src/gateway/server-chat.stream-text-merge.test.ts: prettier line wrap
- src/gateway/server/health-state.test.ts: prettier line wrap
- ui/src/ui/app-channels.test.ts: prettier line wrap

In-scope diff vs origin/main is now:
- extensions/slack/src/sent-thread-cache.{ts,test.ts}: the persistence fix
- extensions/slack/api.ts: 4-line security comment about explicit named
  exports keeping _flushPersist/_resetForTests test-only
@zeroaltitude zeroaltitude force-pushed the fix/slack-thread-participation-persist branch from 1c99594 to b76b013 Compare April 29, 2026 18:16
@openclaw-barnacle openclaw-barnacle Bot removed app: web-ui App: web-ui gateway Gateway runtime commands Command implementations channel: feishu Channel integration: feishu labels Apr 29, 2026
…icipation-persist

# Conflicts:
#	extensions/slack/src/sent-thread-cache.test.ts
#	extensions/slack/src/sent-thread-cache.ts
@openclaw-barnacle openclaw-barnacle Bot added channel: line Channel integration: line commands Command implementations agents Agent runtime and tooling size: XS and removed size: L labels May 1, 2026
@clawsweeper clawsweeper Bot closed this May 1, 2026
@zeroaltitude zeroaltitude deleted the fix/slack-thread-participation-persist branch May 1, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agents Agent runtime and tooling channel: line Channel integration: line channel: slack Channel integration: slack commands Command implementations size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant