fix(infra): lazy prune in DedupeCache to resolve Discord 33-122s message delays#52762
fix(infra): lazy prune in DedupeCache to resolve Discord 33-122s message delays#52762Erprabhat8423 wants to merge 2 commits intoopenclaw:mainfrom
Conversation
The dedupe fix in openclaw#51950 introduced recentInboundMessages.check() in the Discord message handler. This called prune() on every single message, which did a full linear scan of up to 5000 cache entries on each invocation. On active Discord servers, this blocked the event loop for tens of milliseconds per message, stacking up to the 33-122s delays reported in openclaw#4453. Also fixed map-size.ts which had corrupted duplicate code. Changes: - dedupe.ts: only call prune() when cache.size exceeds maxSize - map-size.ts: rewrite pruneMapToMaxSize to delete in a single pass instead of calling map.keys().next() in a while loop Fixes openclaw#4453
Greptile SummaryThis PR fixes a real performance regression introduced by #51950: calling Changes:
Minor behavioral trade-off: TTL-expired entries are no longer eagerly removed on every call; they linger until the cache grows beyond The fix is well-tested (5 dedupe tests + 2 map-size tests) and logically sound. Confidence Score: 5/5
Reviews (1): Last reviewed commit: "fix(infra): lazy prune in DedupeCache to..." | Re-trigger Greptile |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6dd2088bb5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if (cache.size > maxSize) { | ||
| prune(now); |
There was a problem hiding this comment.
Avoid full-cache prune once the dedupe window saturates
check() still calls the full prune(now) path whenever cache.size > maxSize, so this only helps until the live working set reaches the cap. In the Discord handler the window is ttlMs=5*60_000 with maxSize=5000 (extensions/discord/src/monitor/message-handler.ts:34-35), which means any account receiving more than about 17 distinct inbound messages/sec will keep the cache saturated; after that, every new message still pays the same O(n) scan through ~5,000 entries, so the delay regression this patch is meant to fix will continue on the busiest servers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Thanks for the detailed review! The saturated-cache edge case is a valid follow-up optimization. For now this fix addresses the core regression for the majority of servers. Happy to open a separate issue for the high-throughput case if that's useful.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c7de07837c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const deleteCount = map.size - limit; | ||
| let i = 0; | ||
| for (const key of map.keys()) { | ||
| if (i++ >= deleteCount) break; | ||
| map.delete(key); |
There was a problem hiding this comment.
Preserve NaN/invalid-limit behavior in pruneMapToMaxSize
When maxSize is NaN at runtime, this loop now deletes every entry instead of leaving the map untouched. That matters because several in-repo callers still normalize with Math.max(..., Math.floor(...)), which also yields NaN for malformed numeric input (for example src/plugin-sdk/webhook-request-guards.ts:95-115 and src/plugin-sdk/webhook-memory-guards.ts:60-97). After this change, a limiter created with an invalid numeric config immediately forgets its active keys, so duplicate in-flight requests or rate-limited keys are no longer tracked at all; before this commit, pruneMapToMaxSize(..., NaN) was effectively a no-op and those guards kept working.
Useful? React with 👍 / 👎.
|
This pull request has been automatically marked as stale due to inactivity. |
|
Codex review: needs changes before merge. Summary Reproducibility: yes. Source inspection plus read-only Node simulations reproduce the PR defects: the proposed size-gated prune still scans millions of entries after a 5000-entry cache saturates, and the map-size rewrite deletes all entries for Next step before merge Security Review findings
Review detailsBest possible solution: Revise or replace this with a bounded pruning strategy that avoids saturated-cache full scans, preserves or explicitly validates invalid numeric limit handling, adds focused helper and Discord replay tests, and includes the required changelog entry. Do we have a high-confidence way to reproduce the issue? Yes. Source inspection plus read-only Node simulations reproduce the PR defects: the proposed size-gated prune still scans millions of entries after a 5000-entry cache saturates, and the map-size rewrite deletes all entries for Is this the best way to solve the issue? No. The direction is useful, but this patch is not the best fix because it delays the O(n) scan until saturation and changes shared SDK helper semantics instead of making pruning bounded and stable. Full review comments:
Overall correctness: patch is incorrect Security concerns:
Acceptance criteria:
What I checked:
Likely related people:
Codex review notes: model gpt-5.5, reasoning high; reviewed against a3f6f24b79a5. |
|
Codex review: found issues before merge. What this changes: The PR changes Maintainer follow-up before merge: This is a small but shared public-SDK/helper change with unresolved functional and security-relevant semantic findings, so maintainers should decide whether to ask for a revision, open a replacement, or close only after replacement work exists. Review findings:
Review detailsBest possible solution: Land or replace this with a focused shared-helper fix that removes per-message full scans in both normal and saturated cache states, preserves or explicitly validates invalid numeric limit handling, keeps public SDK helper semantics stable, and proves the helper plus Discord replay path with targeted tests and changed-gate validation. Full review comments:
Overall correctness: patch is incorrect Acceptance criteria:
What I checked:
Likely related people:
Remaining risk / open question:
Codex review notes: model gpt-5.5, reasoning high; reviewed against e46dccb35374. |
Fixes #4453
Root Cause
The dedupe fix (#51950) added
recentInboundMessages.check()to theDiscord message handler.
check()calledprune()on every message,which did a full linear scan of up to 5000 cache entries each time.
On active servers this blocked the event loop cumulatively, producing
the exact 33-122s delays reported in the issue.
Changes
src/infra/dedupe.ts— only prune whencache.size > maxSizesrc/infra/map-size.ts— fixed corrupted duplicate code + rewrotepruneMapToMaxSize to delete in a single forward pass
Testing