-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][ml] Fix ledger trimming race causing cursor to point to deleted ledgers #24855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][ml] Fix ledger trimming race causing cursor to point to deleted ledgers #24855
Conversation
… ledgers Problem: - getNumberOfEntriesInBacklog(isPrecise=false) could return negative values - Cursors could point to deleted ledgers after trim operations - Root cause: persistence ordering race between cursor advancement and ledger trimming Root Cause: When messages are acknowledged: 1. Cursor position advances in memory immediately 2. Cursor state persists to BookKeeper asynchronously 3. If ledger trimming occurs during the persistence delay, it uses the in-memory position 4. Ledgers get deleted before the cursor state is durably saved 5. On topic reload, cursor reverts to old persistent position pointing to deleted ledgers The Fix: Changed maybeUpdateCursorBeforeTrimmingConsumedLedger() to use the persistent cursor position (getPersistentMarkDeletedPosition) instead of the in-memory position (getMarkDeletedPosition) when determining which ledgers can be safely trimmed. This ensures ledgers are only deleted after the cursor advancement has been durably persisted to BookKeeper, preventing the cursor from pointing to deleted ledgers. Test Coverage: Added testCursorPointsToDeletedLedgerAfterTrim() which: - Simulates BookKeeper persistence delay (30 seconds) - Acknowledges messages asynchronously during the delay - Triggers ledger trimming - Verifies ledgers are NOT trimmed when persistent position hasn't advanced 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@codelipenghui Please add the following content to your PR description and select a checkbox: |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #24855 +/- ##
=============================================
+ Coverage 38.36% 74.30% +35.93%
- Complexity 13123 33451 +20328
=============================================
Files 1856 1913 +57
Lines 145070 149281 +4211
Branches 16836 17325 +489
=============================================
+ Hits 55662 110923 +55261
+ Misses 81843 29521 -52322
- Partials 7565 8837 +1272
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, good work @codelipenghui
|
@codelipenghui Does this problem apply to 3.0.x and 3.3.x? |
|
@lhotari Yes, I think so. |
… ledgers (#24855) Co-authored-by: Claude <noreply@anthropic.com>
… ledgers (apache#24855) Co-authored-by: Claude <noreply@anthropic.com> (cherry picked from commit d56758f)
… ledgers (apache#24855) Co-authored-by: Claude <noreply@anthropic.com> (cherry picked from commit d56758f)
Summary
Fixes a critical race condition where ledger trimming could delete ledgers while cursors still pointed to them, causing:
getNumberOfEntriesInBacklogHere is an example topic internal-stats to show the issue
{ "entriesAddedCounter" : 947356, "numberOfEntries" : 14437764, "totalSize" : 7582862344, "currentLedgerEntries" : 402806, "currentLedgerSize" : 221336947, "lastLedgerCreatedTimestamp" : "2025-10-14T16:05:55.171Z", "waitingCursorsCount" : 0, "pendingAddEntriesCount" : 3, "lastConfirmedEntry" : "339371:402802", "state" : "LedgerOpened", "ledgers" : [ { "ledgerId" : 333163, "entries" : 76809, "size" : 173537945, "offloaded" : false, "underReplicated" : false }, ... { "ledgerId" : 339362, "entries" : 98219, "size" : 47366060, "offloaded" : false, "underReplicated" : false }, { "ledgerId" : 339365, "entries" : 544553, "size" : 288444337, "offloaded" : false, "underReplicated" : false }, { "ledgerId" : 339371, "entries" : 0, "size" : 0, "offloaded" : false, "underReplicated" : false } ], "cursors" : { "subscription" : { "markDeletePosition" : "333160:740812", "readPosition" : "339371:402809", "waitingReadOp" : true, "pendingReadOps" : 0, "messagesConsumedCounter" : 947397, "cursorLedger" : 339366, "cursorLedgerLastEntry" : 1044, "individuallyDeletedMessages" : "[(333160:742256..333160:742354],(333163:-1..333163:76808],(335756:-1..335756:153],(335761:-1..335761:16],(335765:-1..335765:193],(335770:-1..335770:16],(335773:-1..335773:13],(335776:-1..335776:32],(335779:-1..335779:98],(335783:-1..335783:19],(335787:-1..335787:120],(335792:-1..335792:350],(335795:-1..335795:40],(335797:-1..335797:71],(335799:-1..335799:161],(335801:-1..335801:32],(335803:-1..335803:100],(335805:-1..335805:293],(335807:-1..335807:30],(335809:-1..335809:58],(335811:-1..335811:1126],(335813:-1..335813:250],(335815:-1..335815:175],(335817:-1..335817:153],(335819:-1..335819:4853],(335821:-1..335821:291],(335823:-1..335823:121],(335825:-1..335825:369],(335827:-1..335827:8149],(335829:-1..335829:208],(335833:-1..335833:5],(335835:-1..335835:16138],(336566:-1..336566:0],(339193:-1..339193:281215],(339197:-1..339197:528770],(339201:-1..339201:567213],(339205:-1..339205:526913],(339209:-1..339209:221685],(339213:-1..339213:513837],(339217:-1..339217:218062],(339221:-1..339221:528247],(339225:-1..339225:281650],(339229:-1..339229:407551],(339233:-1..339233:414568],(339238:-1..339238:412851],(339242:-1..339242:72557],(339249:-1..339249:530444],(339257:-1..339257:523422],(339261:-1..339261:523159],(339265:-1..339265:552096],(339279:-1..339279:61660],(339283:-1..339283:529604],(339289:-1..339289:556833],(339293:-1..339293:541437],(339297:-1..339297:534261],(339306:-1..339306:294508],(339307:-1..339307:188358],(339314:-1..339314:81424],(339319:-1..339319:198963],(339323:-1..339323:521405],(339327:-1..339327:521142],(339342:-1..339342:379129],(339346:-1..339346:531631],(339350:-1..339350:519627],(339354:-1..339354:464938],(339358:-1..339358:252548],(339362:-1..339362:98218],(339365:-1..339365:544552],(339371:-1..339371:402720]]", "lastLedgerSwitchTimestamp" : "2025-10-14T15:55:55.155Z", "state" : "Open", "active" : true, "numberOfEntriesSinceFirstNotAckedMessage" : 14437771, "totalNonContiguousDeletedMessagesRange" : 69, "subscriptionHavePendingRead" : true, "subscriptionHavePendingReplayRead" : false, "properties" : { } } }, "schemaLedgers" : [ ], "compactedLedger" : { "ledgerId" : -1, "entries" : -1, "size" : -1, "offloaded" : false, "underReplicated" : false } }Root Cause
The issue occurs when:
The Fix
Changed
maybeUpdateCursorBeforeTrimmingConsumedLedger()inManagedLedgerImpl.java:2704-2705to use the persistent cursor position instead of the in-memory position to keep it consistent with the mark delete entry update in CursorContainer from cursor.delete().This ensures ledgers are only deleted after cursor advancement has been durably persisted to BookKeeper.
Test Coverage
Added
testCursorPointsToDeletedLedgerAfterTrim()inManagedLedgerTest.javawhich:Verification
Without the fix, the test fails because:
With the fix, the test passes because:
Documentation
docdoc-requireddoc-not-neededdoc-complete