Skip to content

fix(feishu): persist dedup cache across gateway restarts via warmup#31605

Merged
Takhoffman merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/feishu-dedup-persist-31541
Mar 2, 2026
Merged

fix(feishu): persist dedup cache across gateway restarts via warmup#31605
Takhoffman merged 2 commits intoopenclaw:mainfrom
Sid-Qin:fix/feishu-dedup-persist-31541

Conversation

@Sid-Qin
Copy link
Contributor

@Sid-Qin Sid-Qin commented Mar 2, 2026

Summary

  • Problem: Feishu uses at-least-once delivery, re-pushing recent unacknowledged messages on gateway reconnect. The persistent dedup writes message IDs to disk but never loads that state back into memory on startup, so restarts cause duplicate message processing.
  • Why it matters: Users see their last message re-processed (and re-replied to) every time the gateway restarts.
  • What changed:
    1. Added warmup(namespace) method to PersistentDedupe that pre-reads the disk file into the in-memory cache.
    2. On disk write failure, the key is now recorded in the in-memory cache so same-process retries are caught.
    3. Added warmupDedupFromDisk() export in Feishu dedup module.
    4. Called warmup in monitorSingleAccount before registering event handlers.
  • What did NOT change: Dedup file format, TTL, max entries, disk error fallback semantics (still returns true to avoid blocking on unrecoverable disk failure).

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • After gateway restart, previously processed Feishu messages are reliably deduplicated and not re-processed.
  • Startup logs show feishu[<accountId>]: dedup warmup loaded N entries from disk when entries are loaded.

Security Impact (required)

  • New permissions/capabilities? No
  • Secrets/tokens handling changed? No
  • New/changed network calls? No
  • Command/tool execution surface changed? No
  • Data access scope changed? No — reads existing dedup files from ~/.openclaw/feishu/dedup/

Repro + Verification

Environment

  • OS: macOS / Linux
  • Runtime: Node.js 22+
  • Channel: Feishu DM

Steps

  1. Chat with agent via Feishu DM
  2. Restart gateway (openclaw gateway restart)
  3. Observe whether the last message is re-processed

Expected

  • After restart, last message is deduplicated (skipped)

Actual

  • Before: Last message re-processed as new
  • After: Warmup loads dedup entries, message skipped

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)
npx vitest run src/plugin-sdk/persistent-dedupe.test.ts
# 6 tests pass (3 new: warmup loads entries, warmup empty file, warmup skips expired)

Human Verification (required)

  • Verified scenarios: warmup with existing disk file, warmup with no file, warmup with expired entries, disk error memory fallback
  • Edge cases checked: empty namespace, concurrent checkAndRecord during warmup, TTL expiration
  • What you did not verify: Live Feishu gateway restart (unit tests only)

Compatibility / Migration

  • Backward compatible? Yes — additive; warmup is a new call that safely reads existing files
  • Config/env changes? No
  • Migration needed? No

Failure Recovery (if this breaks)

  • How to disable: Remove warmupDedupFromDisk call in monitor.account.ts
  • Files to restore: src/plugin-sdk/persistent-dedupe.ts, extensions/feishu/src/dedup.ts, extensions/feishu/src/monitor.account.ts

Risks and Mitigations

  • Risk: Large dedup files could slow startup
    • Mitigation: Max 10,000 entries per file, expired entries skipped; warmup is async and non-blocking
  • Risk: Warmup reads stale data and blocks legitimate re-sends
    • Mitigation: TTL (24h) naturally expires old entries; same behavior as disk-based persistent check

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 2, 2026

Greptile Summary

This PR fixes duplicate Feishu message processing after gateway restarts by adding a warmup() method to PersistentDedupe that pre-loads the disk dedup cache into memory before event handlers are registered. The core approach is sound — timestamps from disk are preserved correctly so TTL continues counting from original receipt time, scoping is consistent with checkAndRecord, and warmup is awaited before any events can arrive. It also fixes a secondary issue where disk-write failures previously lost the key from the in-memory cache.

Key issue found:

  • The onError / error-logging callback wired through warmupwarmupDedupFromDiskmonitorSingleAccount is dead code. readJsonFileWithFallback catches all exceptions internally (including non-ENOENT errors like permission denied or a corrupt file) and returns the fallback without re-throwing, so the catch block in warmup can never execute. This means a real disk read failure at startup would silently return 0, re-exposing the duplicate-message bug this PR aims to fix, with no diagnostic output. The onError parameter and its wiring should either be removed (if silent failure is intentional) or the error-surface of readJsonFileWithFallback should be revised to re-throw non-ENOENT errors.

Confidence Score: 3/5

  • Safe to merge if the dead onError path is an acceptable trade-off; needs attention before merging if reliable error surfacing on startup is required.
  • The warmup logic itself (TTL handling, scoping, insertion order, call placement) is correct and well-tested. The score is lowered because the onError callback is silently unreachable: readJsonFileWithFallback absorbs all errors, so a real disk failure at startup would produce no log output and return 0, re-exposing the duplicate-message bug. This is a functional gap that undermines the PR's stated goal of reliable deduplication across restarts.
  • src/plugin-sdk/persistent-dedupe.ts — the warmup function's catch block and onError parameter are dead code due to readJsonFileWithFallback's error-swallowing behavior.

Last reviewed commit: b400af4

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +136 to +156
async function warmup(namespace = "global", onError?: (error: unknown) => void): Promise<number> {
const filePath = options.resolveFilePath(namespace);
const now = Date.now();
try {
const { value } = await readJsonFileWithFallback<PersistentDedupeData>(filePath, {});
const data = sanitizeData(value);
let loaded = 0;
for (const [key, ts] of Object.entries(data)) {
if (ttlMs > 0 && now - ts >= ttlMs) {
continue;
}
const scopedKey = `${namespace}:${key}`;
memory.check(scopedKey, ts);
loaded++;
}
return loaded;
} catch (error) {
onError?.(error);
return 0;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onError callback is unreachable dead code

readJsonFileWithFallback catches every error internally — including non-ENOENT cases like permission denied or a corrupt file — and returns the fallback {} without re-throwing. This means the catch block in warmup can never execute, so onError is silently never called.

Looking at json-store.ts:

  • ENOENT → returns fallback, no throw
  • Any other error → also returns fallback, no throw

As a result:

  • The error logging callback wired up in warmupDedupFromDisk (dedup.ts) will never fire.
  • If the dedup file is unreadable at startup (e.g. a permission issue), warmup silently returns 0 with no diagnostic, and messages that should be deduplicated will be re-processed — the exact problem this PR aims to fix.

Two options to address this:

Option A — Use the exists flag returned by readJsonFileWithFallback to distinguish "file absent" from a real read failure, and surface the latter via onError. Since readJsonFileWithFallback already swallows errors, you would need a prior existence probe (e.g. fs.access) or refactor readJsonFileWithFallback to re-throw non-ENOENT errors.

Option B — If disk errors at warmup time are intentionally silent, remove the onError parameter and all associated wiring to avoid a misleading API surface (callers reasonably expect the callback to fire on failure).

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/plugin-sdk/persistent-dedupe.ts
Line: 136-156

Comment:
**`onError` callback is unreachable dead code**

`readJsonFileWithFallback` catches every error internally — including non-ENOENT cases like permission denied or a corrupt file — and returns the fallback `{}` without re-throwing. This means the `catch` block in `warmup` can **never** execute, so `onError` is silently never called.

Looking at `json-store.ts`:
- `ENOENT` → returns fallback, no throw
- Any other error → also returns fallback, no throw

As a result:
- The error logging callback wired up in `warmupDedupFromDisk` (`dedup.ts`) will never fire.
- If the dedup file is unreadable at startup (e.g. a permission issue), warmup silently returns 0 with no diagnostic, and messages that should be deduplicated will be re-processed — the exact problem this PR aims to fix.

Two options to address this:

**Option A** — Use the `exists` flag returned by `readJsonFileWithFallback` to distinguish "file absent" from a real read failure, and surface the latter via `onError`. Since `readJsonFileWithFallback` already swallows errors, you would need a prior existence probe (e.g. `fs.access`) or refactor `readJsonFileWithFallback` to re-throw non-ENOENT errors.

**Option B** — If disk errors at warmup time are intentionally silent, remove the `onError` parameter and all associated wiring to avoid a misleading API surface (callers reasonably expect the callback to fire on failure).

How can I resolve this? If you propose a fix, please make it concise.

SidQin-cyber and others added 2 commits March 2, 2026 17:25
Feishu uses at-least-once delivery, re-pushing recent messages on
gateway reconnect. The persistent dedup writes to disk but never
loads that state back on startup, so restarts cause duplicate
processing until the persistent check re-reads the file.

Add a warmup() method to PersistentDedupe that pre-reads the disk
file into the in-memory cache on startup. Also record keys in memory
when disk writes fail, so same-process retries are caught.

Call warmupDedupFromDisk() in monitorSingleAccount before registering
event handlers to pre-seed the dedup cache.

Closes openclaw#31541

Made-with: Cursor
@Takhoffman Takhoffman force-pushed the fix/feishu-dedup-persist-31541 branch from b400af4 to bb95a66 Compare March 2, 2026 23:30
@Takhoffman Takhoffman merged commit 481da21 into openclaw:main Mar 2, 2026
9 checks passed
@Takhoffman
Copy link
Contributor

PR #31605 - fix(feishu): persist dedup cache across gateway restarts via warmup (#31605)

Merged after verification.

  • Merge commit: 481da21
  • Verified: pnpm install --frozen-lockfile, pnpm build, pnpm check, pnpm test:macmini (with unrelated baseline failure in src/config/config.legacy-config-detection.rejects-routing-allowfrom.test.ts)
  • Autoland updates:
    M\tCHANGELOG.md
  • Rationale:
    Rebased onto latest main and preserved dedup warmup behavior while integrating with previously landed inbound debounce and per-chat queue changes.
  • Changelog: CHANGELOG.md updated=true required=true opt_out=false

dawi369 pushed a commit to dawi369/davis that referenced this pull request Mar 3, 2026
…penclaw#31605) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini (fails on unrelated baseline test: src/config/config.legacy-config-detection.rejects-routing-allowfrom.test.ts)

Co-authored-by: Sid-Qin <201593046+Sid-Qin@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
OWALabuy pushed a commit to kcinzgg/openclaw that referenced this pull request Mar 4, 2026
…penclaw#31605) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini (fails on unrelated baseline test: src/config/config.legacy-config-detection.rejects-routing-allowfrom.test.ts)

Co-authored-by: Sid-Qin <201593046+Sid-Qin@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
zooqueen pushed a commit to hanzoai/bot that referenced this pull request Mar 6, 2026
…penclaw#31605) thanks @Sid-Qin

Verified:
- pnpm install --frozen-lockfile
- pnpm build
- pnpm check
- pnpm test:macmini (fails on unrelated baseline test: src/config/config.legacy-config-detection.rejects-routing-allowfrom.test.ts)

Co-authored-by: Sid-Qin <201593046+Sid-Qin@users.noreply.github.com>
Co-authored-by: Tak Hoffman <781889+Takhoffman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

channel: feishu Channel integration: feishu size: S

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feishu] Duplicate message delivery after gateway restart

2 participants