Skip to content

fix(infra): lazy prune in DedupeCache to resolve Discord 33-122s message delays#52762

Open
Erprabhat8423 wants to merge 2 commits intoopenclaw:mainfrom
Erprabhat8423:fix/discord-message-listener-perf
Open

fix(infra): lazy prune in DedupeCache to resolve Discord 33-122s message delays#52762
Erprabhat8423 wants to merge 2 commits intoopenclaw:mainfrom
Erprabhat8423:fix/discord-message-listener-perf

Conversation

@Erprabhat8423
Copy link
Copy Markdown

Fixes #4453

Root Cause

The dedupe fix (#51950) added recentInboundMessages.check() to the
Discord message handler. check() called prune() on every message,
which did a full linear scan of up to 5000 cache entries each time.

On active servers this blocked the event loop cumulatively, producing
the exact 33-122s delays reported in the issue.

Changes

  • src/infra/dedupe.ts — only prune when cache.size > maxSize
  • src/infra/map-size.ts — fixed corrupted duplicate code + rewrote
    pruneMapToMaxSize to delete in a single forward pass

Testing

  • src/infra/dedupe.test.ts done (5 tests)
  • src/infra/map-size.test.ts done (2 tests)
  • extensions/discord/src/monitor/message-handler.queue.test.ts done (11 tests)

The dedupe fix in openclaw#51950 introduced recentInboundMessages.check() in
the Discord message handler. This called prune() on every single
message, which did a full linear scan of up to 5000 cache entries on
each invocation.

On active Discord servers, this blocked the event loop for tens of
milliseconds per message, stacking up to the 33-122s delays reported
in openclaw#4453.

Also fixed map-size.ts which had corrupted duplicate code.

Changes:
- dedupe.ts: only call prune() when cache.size exceeds maxSize
- map-size.ts: rewrite pruneMapToMaxSize to delete in a single pass
  instead of calling map.keys().next() in a while loop

Fixes openclaw#4453
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 23, 2026

Greptile Summary

This PR fixes a real performance regression introduced by #51950: calling prune() on every check() caused a full O(n) linear scan over up to 5,000 cache entries, blocking the Node.js event loop and producing the 33–122 s message delays on active Discord servers.

Changes:

  • src/infra/dedupe.ts: prune() is now called lazily — only when cache.size > maxSize — eliminating the per-message O(n) scan. The newly-touched key is always placed at the tail of the Map, so it is safe from the subsequent pruneMapToMaxSize call.
  • src/infra/map-size.ts: The while-loop is replaced with a cleaner single forward-pass that pre-computes deleteCount = map.size - limit and breaks after exactly that many deletions. Mutating Map entries for already-yielded keys during for…of iteration is well-defined per the ECMAScript spec.

Minor behavioral trade-off: TTL-expired entries are no longer eagerly removed on every call; they linger until the cache grows beyond maxSize. For the dedupe use-case this is fine — individual lookups still perform per-key TTL checks in hasUnexpired, and a full prune fires whenever the cache tips over the size limit. Maximum retained entry count remains bounded at maxSize + 1.

The fix is well-tested (5 dedupe tests + 2 map-size tests) and logically sound.

Confidence Score: 5/5

  • Safe to merge — targeted, correct fix with no regressions and good test coverage.
  • The root cause is well understood, the fix is minimal and scoped entirely to the hot path, the Map-mutation-during-iteration pattern is spec-compliant, and the intentional TTL trade-off is acceptable for a bounded dedupe cache. All edge cases (maxSize=0, fractional maxSize, duplicate-key promotion, explicit clear) are covered by tests.
  • No files require special attention.

Reviews (1): Last reviewed commit: "fix(infra): lazy prune in DedupeCache to..." | Re-trigger Greptile

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dd2088bb5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/dedupe.ts
Comment on lines +60 to +61
if (cache.size > maxSize) {
prune(now);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid full-cache prune once the dedupe window saturates

check() still calls the full prune(now) path whenever cache.size > maxSize, so this only helps until the live working set reaches the cap. In the Discord handler the window is ttlMs=5*60_000 with maxSize=5000 (extensions/discord/src/monitor/message-handler.ts:34-35), which means any account receiving more than about 17 distinct inbound messages/sec will keep the cache saturated; after that, every new message still pays the same O(n) scan through ~5,000 entries, so the delay regression this patch is meant to fix will continue on the busiest servers.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed review! The saturated-cache edge case is a valid follow-up optimization. For now this fix addresses the core regression for the majority of servers. Happy to open a separate issue for the high-throughput case if that's useful.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c7de07837c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/infra/map-size.ts
Comment on lines +10 to +14
const deleteCount = map.size - limit;
let i = 0;
for (const key of map.keys()) {
if (i++ >= deleteCount) break;
map.delete(key);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve NaN/invalid-limit behavior in pruneMapToMaxSize

When maxSize is NaN at runtime, this loop now deletes every entry instead of leaving the map untouched. That matters because several in-repo callers still normalize with Math.max(..., Math.floor(...)), which also yields NaN for malformed numeric input (for example src/plugin-sdk/webhook-request-guards.ts:95-115 and src/plugin-sdk/webhook-memory-guards.ts:60-97). After this change, a limiter created with an invalid numeric config immediately forgets its active keys, so duplicate in-flight requests or rate-limited keys are no longer tracked at all; before this commit, pruneMapToMaxSize(..., NaN) was effectively a no-op and those guards kept working.

Useful? React with 👍 / 👎.

@openclaw-barnacle
Copy link
Copy Markdown

This pull request has been automatically marked as stale due to inactivity.
Please add updates or it will be closed.

@openclaw-barnacle openclaw-barnacle Bot added the stale Marked as stale due to inactivity label Apr 29, 2026
@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: needs changes before merge.

Summary
The branch changes createDedupeCache().check() to prune only when the cache exceeds maxSize and rewrites pruneMapToMaxSize() to delete old entries with a counted forward pass.

Reproducibility: yes. Source inspection plus read-only Node simulations reproduce the PR defects: the proposed size-gated prune still scans millions of entries after a 5000-entry cache saturates, and the map-size rewrite deletes all entries for NaN.

Next step before merge
A replacement or revision is narrow enough for one repair PR: fix the shared pruning algorithm, preserve invalid-limit semantics, add focused tests, and include the changelog entry.

Security
Needs attention: The diff does not change CI, dependencies, secrets, or package resolution, but the shared map-pruning rewrite can weaken webhook guard state for invalid numeric limits.

Review findings

  • [P1] Avoid full-cache prune after saturation — src/infra/dedupe.ts:57-59
  • [P2] Preserve invalid-limit behavior in map pruning — src/infra/map-size.ts:10-14
Review details

Best possible solution:

Revise or replace this with a bounded pruning strategy that avoids saturated-cache full scans, preserves or explicitly validates invalid numeric limit handling, adds focused helper and Discord replay tests, and includes the required changelog entry.

Do we have a high-confidence way to reproduce the issue?

Yes. Source inspection plus read-only Node simulations reproduce the PR defects: the proposed size-gated prune still scans millions of entries after a 5000-entry cache saturates, and the map-size rewrite deletes all entries for NaN.

Is this the best way to solve the issue?

No. The direction is useful, but this patch is not the best fix because it delays the O(n) scan until saturation and changes shared SDK helper semantics instead of making pruning bounded and stable.

Full review comments:

  • [P1] Avoid full-cache prune after saturation — src/infra/dedupe.ts:57-59
    check() still calls the full prune(now) path whenever cache.size > maxSize. With Discord's five-minute, 5000-entry replay guard, a saturated unique-message stream still scans about the whole cache on every new key, so the targeted delay can continue on busy servers.
    Confidence: 0.91
  • [P2] Preserve invalid-limit behavior in map pruning — src/infra/map-size.ts:10-14
    When maxSize is NaN, deleteCount becomes NaN and this loop never breaks, deleting every entry. Current behavior is effectively a no-op for NaN, and SDK guard callers can pass NaN after numeric normalization, so active guard state can be cleared.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.9

Security concerns:

  • [medium] Preserve webhook guard state for invalid limits — src/infra/map-size.ts:10
    The counted loop deletes every tracked entry when maxSize is NaN; webhook in-flight and rate-limit helpers can normalize malformed maxTrackedKeys input to NaN before pruning, which can clear active request or rate-limit tracking.
    Confidence: 0.84

Acceptance criteria:

  • pnpm test src/infra/dedupe.test.ts src/infra/map-size.test.ts extensions/discord/src/monitor/message-handler.queue.test.ts
  • pnpm test src/plugin-sdk/webhook-request-guards.test.ts src/plugin-sdk/webhook-memory-guards.test.ts
  • pnpm check:changed in Testbox before merge because shared plugin SDK/runtime helpers are affected

What I checked:

Likely related people:

  • Takhoffman: Merged PR fix(discord): dedupe inbound message deliveries #51950 added the Discord inbound replay guard and queue coverage that put createClaimableDedupe() on the Discord inbound message path. (role: introduced related Discord inbound dedupe behavior; confidence: high; commits: 0e825b225f5a; files: extensions/discord/src/monitor/message-handler.ts, extensions/discord/src/monitor/inbound-dedupe.ts, extensions/discord/src/monitor/message-handler.queue.test.ts)
  • steipete: The PR timeline shows steipete was mentioned/subscribed after the stale marker was removed, so this shared Discord/SDK helper change likely needs maintainer routing through him if replacement work proceeds. (role: recent maintainer routing signal; confidence: low; files: src/infra/dedupe.ts, src/infra/map-size.ts, src/plugin-sdk/webhook-request-guards.ts)

Codex review notes: model gpt-5.5, reasoning high; reviewed against a3f6f24b79a5.

@clawsweeper
Copy link
Copy Markdown
Contributor

clawsweeper Bot commented Apr 29, 2026

Codex review: found issues before merge.

What this changes:

The PR changes createDedupeCache().check() to prune only after the cache exceeds maxSize and rewrites pruneMapToMaxSize() to delete old entries with a counted forward pass.

Maintainer follow-up before merge:

This is a small but shared public-SDK/helper change with unresolved functional and security-relevant semantic findings, so maintainers should decide whether to ask for a revision, open a replacement, or close only after replacement work exists.

Review findings:

  • [P1] Avoid full-cache prune once the cache saturates — src/infra/dedupe.ts:57-59
  • [P2] Preserve invalid-limit behavior in map pruning — src/infra/map-size.ts:10-14
Review details

Best possible solution:

Land or replace this with a focused shared-helper fix that removes per-message full scans in both normal and saturated cache states, preserves or explicitly validates invalid numeric limit handling, keeps public SDK helper semantics stable, and proves the helper plus Discord replay path with targeted tests and changed-gate validation.

Full review comments:

  • [P1] Avoid full-cache prune once the cache saturates — src/infra/dedupe.ts:57-59
    This still calls the full prune(now) path whenever cache.size > maxSize. With Discord's 5-minute, 5000-entry replay guard, a server above roughly 17 unique inbound messages per second keeps the cache saturated, so every new message still scans about 5000 entries and the delay this PR targets can continue.
    Confidence: 0.91
  • [P2] Preserve invalid-limit behavior in map pruning — src/infra/map-size.ts:10-14
    When maxSize is NaN, deleteCount becomes NaN and the loop never breaks, deleting every entry. Current behavior is effectively a no-op for NaN; several SDK guard callers can pass NaN after Math.max(..., Math.floor(...)), so this can make in-flight or rate-limit guards forget active keys.
    Confidence: 0.88

Overall correctness: patch is incorrect
Overall confidence: 0.89

Acceptance criteria:

  • pnpm test src/infra/dedupe.test.ts src/infra/map-size.test.ts extensions/discord/src/monitor/message-handler.queue.test.ts
  • pnpm test src/plugin-sdk/webhook-request-guards.test.ts src/plugin-sdk/webhook-memory-guards.test.ts
  • pnpm check:changed in Testbox before merge because shared plugin SDK/runtime helpers are affected

What I checked:

  • current-main-eager-prune: createDedupeCache().check() still records a missed key with touch(key, now) and immediately calls prune(now), so current main has not implemented the PR's lazy guard. (src/infra/dedupe.ts:67, e46dccb35374)
  • discord-uses-shared-dedupe: Discord's inbound replay guard uses createClaimableDedupe() with a 5-minute TTL and memoryMaxSize of 5000, and that helper stores committed claims through createDedupeCache(). (extensions/discord/src/monitor/inbound-dedupe.ts:5, e46dccb35374)
  • public-sdk-surface: The affected helpers are public plugin SDK surfaces: dedupe-runtime, collection-runtime, and persistent-dedupe are package exports and documented SDK entrypoints. (package.json:292, e46dccb35374)
  • map-size-current-semantics: Current pruneMapToMaxSize() floors the limit and uses while (map.size > limit), which leaves maps untouched for runtime NaN limits because comparisons with NaN are false. (src/infra/map-size.ts:1, e46dccb35374)
  • guard-callers-invalid-limit-risk: Webhook in-flight, fixed-window, and bounded-counter guards normalize maxTrackedKeys with Math.max(..., Math.floor(...)) and then call pruneMapToMaxSize(), so malformed numeric input can propagate as NaN into the shared helper. (src/plugin-sdk/webhook-request-guards.ts:104, e46dccb35374)
  • latest-release-still-unfixed: The latest release tag v2026.4.26 still calls prune(now) on every dedupe miss and still uses the old map-size loop, so the PR's intended behavior is not shipped. (src/infra/dedupe.ts:67, be8c24633aaa)

Likely related people:

  • Takhoffman: The provided merged PR context for fix(discord): dedupe inbound message deliveries #51950 shows Takhoffman added the Discord inbound replay guard and queue coverage that this PR is trying to optimize. (role: introduced related Discord inbound dedupe behavior; confidence: high; commits: 0e825b225f5a; files: extensions/discord/src/monitor/message-handler.ts, extensions/discord/src/monitor/inbound-dedupe.ts, extensions/discord/src/monitor/message-handler.queue.test.ts)
  • steipete: The local grafted checkout attributes the current snapshot for the affected files to Peter Steinberger's current-main commit, and the latest release commit is also authored by Peter Steinberger. (role: recent maintainer and release-history owner; confidence: medium; commits: 34d11d57579d, be8c24633aaa; files: src/infra/dedupe.ts, src/infra/map-size.ts, extensions/discord/src/monitor/inbound-dedupe.ts)

Remaining risk / open question:

  • The proposed lazy guard still runs the full TTL and size prune path whenever the cache is saturated, so high-throughput Discord servers can keep paying the O(n) scan the PR is meant to remove.
  • The map-size rewrite changes runtime NaN behavior from no-op to deleting every entry, which can weaken webhook in-flight, rate-limit, and anomaly guard tracking if malformed numeric config reaches those helpers.
  • No tests were run because this was a read-only cleanup review; validation should be done before merge.

Codex review notes: model gpt-5.5, reasoning high; reviewed against e46dccb35374.

@openclaw-barnacle openclaw-barnacle Bot removed the stale Marked as stale due to inactivity label Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Discord MessageListener performance regression (33-122s delays)

1 participant