Skip to content

Conversation

@dao-jun
Copy link
Member

@dao-jun dao-jun commented Aug 22, 2025

Motivation

In our cluster, we encountered an exception on a change_events partition:
image
Thousands of get last message id failed was thrown due to No such ledger exception.

In the SystemTopicBasedTopicPoliciesService, it will create thousands of Reader instances.

For some reason, all the Readers didn't closed successfully(maybe we were unload the ns at the time), after the change_events partition transferred to a new broker, all the Readers was reconnected to the new broker, and never be closed.
image

There was more than 20k consumers of the partition of change_events:
image

Modifications

Don't clean the readerCache and close the Reader unless:

  1. The consumer was already closed
  2. All ns bundles were removed from the broker
  3. The reader creation was failed
  4. TopicPoliciesService was closed

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@dao-jun dao-jun added this to the 4.1.0 milestone Aug 22, 2025
@dao-jun dao-jun self-assigned this Aug 22, 2025
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Aug 22, 2025
@dao-jun dao-jun requested a review from nodece August 22, 2025 05:57
@codelipenghui
Copy link
Contributor

For some reason, all the Readers didn't closed successfully(maybe we were unload the ns at the time), after the change_events partition transferred to a new broker, all the Readers was reconnected to the new broker, and never be closed.

@dao-jun Do you see the logs from https://github.com/apache/pulsar/pull/24658/files#diff-9d2948d863c111e4be6d508a1c573667a1326b98c4314e917ba9e344bb61dc27L546 ? or what is the reason for the reader close failure? The behavior you mentioned seems not expected. If user triggered the reader.close(), it should not reconnect again for any reason.

@dao-jun
Copy link
Member Author

dao-jun commented Aug 23, 2025

For some reason, all the Readers didn't closed successfully(maybe we were unload the ns at the time), after the change_events partition transferred to a new broker, all the Readers was reconnected to the new broker, and never be closed.

@dao-jun Do you see the logs from https://github.com/apache/pulsar/pull/24658/files#diff-9d2948d863c111e4be6d508a1c573667a1326b98c4314e917ba9e344bb61dc27L546 ? or what is the reason for the reader close failure? The behavior you mentioned seems not expected. If user triggered the reader.close(), it should not reconnect again for any reason.

I don't have a chance to find out the pulsar client's bug, @nodece will handle it.

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2025

Codecov Report

❌ Patch coverage is 68.08511% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.20%. Comparing base (e5e7981) to head (249d6d0).
⚠️ Report is 24 commits behind head on master.

Files with missing lines Patch % Lines
.../service/SystemTopicBasedTopicPoliciesService.java 68.08% 12 Missing and 3 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24658      +/-   ##
============================================
+ Coverage     74.01%   74.20%   +0.19%     
+ Complexity    33224    32832     -392     
============================================
  Files          1858     1885      +27     
  Lines        146500   146977     +477     
  Branches      16880    16930      +50     
============================================
+ Hits         108425   109070     +645     
+ Misses        29394    29209     -185     
- Partials       8681     8698      +17     
Flag Coverage Δ
inttests 26.71% <42.55%> (-0.21%) ⬇️
systests 22.67% <6.38%> (?)
unittests 73.70% <68.08%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../service/SystemTopicBasedTopicPoliciesService.java 73.04% <68.08%> (-5.49%) ⬇️

... and 174 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@BewareMyPower BewareMyPower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a new parameter and keeping two overloads of cleanCacheAndCloseReader (now cleanCache) makes code hard to read.

Method invocations like clean(ns, true, false) or clean(ns, true, true) really harms the readability. Since you're already touching this part, could you do a further refactoring by splitting the original cleanCacheAndCloseReader method into three methods? For example:

  • cleanWriterCache
  • cleanReaderCache
  • cleanPolicyInitCache

@dao-jun dao-jun requested a review from BewareMyPower August 25, 2025 04:17
@nodece
Copy link
Member

nodece commented Aug 26, 2025

@Technoboy- Makes sense. I agree this is an improvement to avoid creating the subscription multiple times, but we still need to find the root cause of the issue.

@dao-jun
Copy link
Member Author

dao-jun commented Aug 26, 2025

@Technoboy-

The root cause is No such ledger exception happened while reading compacted ledger.

At broker 1, the ledger ensemble is:
image
After a few minutes, change_events-partition-0 transferred to another broker, at the time, the ensemble is :
image

# Conflicts:
#	pulsar-broker/src/main/java/org/apache/pulsar/broker/service/SystemTopicBasedTopicPoliciesService.java
@lhotari
Copy link
Member

lhotari commented Aug 26, 2025

There seems to be a remaining race condition issue: a namespace might be getting removed while it is being initialized again. One possible solution to this would be to break down the logic into another class. In the map for namespaces, the value could be an instance of this class that is in different states. This way it would be possible to properly handle different cases and avoid race conditions.

@dao-jun
Copy link
Member Author

dao-jun commented Sep 1, 2025

There seems to be a remaining race condition issue: a namespace might be getting removed while it is being initialized again. One possible solution to this would be to break down the logic into another class. In the map for namespaces, the value could be an instance of this class that is in different states. This way it would be possible to properly handle different cases and avoid race conditions.

The concurrency control is based on policyCacheInitMap, and I remove the init future in the last step. I believe it's OK

@coderzc coderzc modified the milestones: 4.1.0, 4.2.0 Sep 1, 2025
@dao-jun dao-jun requested a review from Technoboy- September 8, 2025 09:23
@nodece nodece dismissed Technoboy-’s stale review September 9, 2025 03:31

This is an improvement, not a bugfix.

@nodece nodece merged commit 0cda4f4 into apache:master Sep 9, 2025
139 of 145 checks passed
@dao-jun dao-jun deleted the dev/optimize_reader_cache branch September 11, 2025 02:53
nodece pushed a commit to nodece/pulsar that referenced this pull request Sep 16, 2025
…pache#24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>

(cherry picked from commit 0cda4f4)
Signed-off-by: Zixuan Liu <nodeces@gmail.com>
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
@lhotari
Copy link
Member

lhotari commented Dec 15, 2025

Cherry-picking to branch-4.0 and branch-4.1 since #24980 depends on this PR.

lhotari pushed a commit that referenced this pull request Dec 15, 2025
…24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>
(cherry picked from commit 0cda4f4)
lhotari pushed a commit that referenced this pull request Dec 15, 2025
…24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>
(cherry picked from commit 0cda4f4)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…pache#24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>
(cherry picked from commit 0cda4f4)
(cherry picked from commit 0868e21)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…pache#24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>
(cherry picked from commit 0cda4f4)
(cherry picked from commit 0868e21)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…pache#24658)

Co-authored-by: Zixuan Liu <nodeces@gmail.com>
(cherry picked from commit 0cda4f4)
(cherry picked from commit 0868e21)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants