Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Aug 5, 2025

Motivation

If more the one task named expireMessage were executed at the same time, the mechanism expirationCheckInProgress prevents them from executing concurrently, but it is not well enough.

task: expire msg 1 task: expire msg 2
expirationCheckInProgress -> true[1]
find target position[2]
start to reset cursor[3]
expirationCheckInProgress -> falsesup>[3]
expirationCheckInProgress -> true[1]
find target position[2]
start to reset cursor[3]
get an error: cursor reset in progress - ignoring mark delete on position [{}] for cursor [{}]
end resetting cursor
  • [1]: PersistentSubscription.expireMessages
  • [2]: PersistentMessageExpiryMonitor.expireMessages
  • [3]: PersistentMessageExpiryMonitor.findEntryComplete

The above concurrency wasted a finding target position, which is a heavy operation.

The logs that were printed when our cluster encountered the performance issue

2025-07-28T09:13:38,645+0000 [pulsar-msg-expiry-monitor-OrderedScheduler-0-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitor - [persistent://{tenant}/{ns}/{topic}][{subscription}] Starting message expiry check, ttl= 86400 seconds
2025-07-28T09:13:38,745+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] INFO  org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitor - [persistent://{tenant}/{ns}/{topic}][{subscription}] Expiring all messages until position 2719761:33957
2025-07-28T09:13:38,745+0000 [BookKeeperClientWorker-OrderedExecutor-1-0] WARN  org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitor - [persistent://{tenant}/{ns}/{topic}][{subscription}] Message expiry failed - mark delete failed
org.apache.bookkeeper.mledger.ManagedLedgerException: Reset cursor in progress - unable to mark delete position 2719761:33957

Modifications

Release the lock expirationCheckInProgress after resetting the cursor successfully.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.1.0 milestone Aug 5, 2025
@poorbarcode poorbarcode self-assigned this Aug 5, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 5, 2025
@3pacccccc
Copy link
Contributor

After further investigation, I found that this PR may not address the root cause. The cursor reset in progress error does not appear to be caused by concurrent calls to expireMessage or ManagedCursor.asyncMarkDelete. Instead, the issue likely stems from another code path that triggers ManagedCursorImpl.asyncResetCursor.

@poorbarcode poorbarcode requested a review from 3pacccccc August 6, 2025 04:07
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. However, this looks like a bug fix instead of an improvement.

@3pacccccc
Copy link
Contributor

They are different cases, the current PR focuses on multiple expiring messages executing concurrently, I will submit a proposal to solve the race condition between reset cursor and expiring messages later

yes, They are different cases.I'm just saying this change might not really fix the issue you mentioned. But I'm +1 with this change.

@codecov-commenter
Copy link

codecov-commenter commented Aug 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.32%. Comparing base (d272825) to head (c1ba490).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24607      +/-   ##
============================================
+ Coverage     74.21%   74.32%   +0.10%     
+ Complexity    33142    32654     -488     
============================================
  Files          1881     1881              
  Lines        146770   146778       +8     
  Branches      16859    16857       -2     
============================================
+ Hits         108922   109089     +167     
+ Misses        29181    29014     -167     
- Partials       8667     8675       +8     
Flag Coverage Δ
inttests 26.68% <0.00%> (?)
systests 23.28% <0.00%> (-0.08%) ⬇️
unittests 73.80% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ice/persistent/PersistentMessageExpiryMonitor.java 65.83% <100.00%> (+4.81%) ⬆️

... and 108 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@poorbarcode poorbarcode merged commit 66b512c into apache:master Aug 7, 2025
140 of 145 checks passed
gaozhangmin pushed a commit to gaozhangmin/pulsar that referenced this pull request Aug 13, 2025
Technoboy- pushed a commit to Technoboy-/pulsar that referenced this pull request Aug 14, 2025
lhotari pushed a commit to lhotari/pulsar that referenced this pull request Aug 14, 2025
@lhotari
Copy link
Member

lhotari commented Aug 14, 2025

@poorbarcode Please handle backporting to branch-3.3 . I tried that, but the test fails

my backport attempt to branch-3.3: branch-3.3...lhotari:pulsar:lh-pr24607-branch-3.3, backport commit 4485218

[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 23.75 s <<< FAILURE! - in org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest
[ERROR] org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest.testConcurrentlyExpireMessages  Time elapsed: 6.277 s  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<3>
        at org.testng.AssertJUnit.fail(AssertJUnit.java:65)
        at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:467)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:88)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:318)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:328)
        at org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest.testConcurrentlyExpireMessages(PersistentMessageExpiryMonitorTest.java:135)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
        at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
        at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
        at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

poorbarcode added a commit to poorbarcode/pulsar that referenced this pull request Aug 14, 2025
Technoboy- pushed a commit that referenced this pull request Aug 18, 2025
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 21, 2025
manas-ctds added a commit to datastax/pulsar that referenced this pull request Aug 22, 2025
manas-ctds added a commit to datastax/pulsar that referenced this pull request Aug 25, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 26, 2025
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 26, 2025
@lhotari
Copy link
Member

lhotari commented Aug 28, 2025

@poorbarcode Please handle backporting to branch-3.3 . I tried that, but the test fails

my backport attempt to branch-3.3: branch-3.3...lhotari:pulsar:lh-pr24607-branch-3.3, backport commit 4485218

[ERROR] Tests run: 2, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 23.75 s <<< FAILURE! - in org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest
[ERROR] org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest.testConcurrentlyExpireMessages  Time elapsed: 6.277 s  <<< FAILURE!
java.lang.AssertionError: expected:<1> but was:<3>
        at org.testng.AssertJUnit.fail(AssertJUnit.java:65)
        at org.testng.AssertJUnit.failNotEquals(AssertJUnit.java:467)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:88)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:318)
        at org.testng.AssertJUnit.assertEquals(AssertJUnit.java:328)
        at org.apache.pulsar.broker.service.persistent.PersistentMessageExpiryMonitorTest.testConcurrentlyExpireMessages(PersistentMessageExpiryMonitorTest.java:135)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:569)
        at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
        at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
        at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
        at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

@poorbarcode ping

manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Aug 28, 2025
@poorbarcode
Copy link
Contributor Author

@lhotari I removed the label release/3.3.9 since it is outdated

srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 12, 2025
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants