Skip to content

Conversation

@TakaHiR07
Copy link
Contributor

@TakaHiR07 TakaHiR07 commented Nov 14, 2025

Fixes #24977

Motivation

As shown in the issue, fix two problem: 1. cleanCacheAndCloseReader() executed twice cause concurrent error, which result in too many orphan reader remain in SystemTopicBasedTopicPoliciesService 2. double update in policyCacheInitMap cause recursive update error

Modifications

  1. do cleanPoliciesCacheInitMap only once when throw exception
  2. avoid double update in policyCacheInitMap. use putIfAbsent instead of computeIfAbsent. It is not appropriate to add so many operation into compute().
  3. add two test, to simulate if throw exception in createReader, initPolicyCache, readMorePolicy of prepareInitPoliciesCacheAsync. By the way, it seems lack of unittest in SystemTopicBasedTopicPoliciesService.
  4. "newReader()" remove some logic, it is confused when readCompletableFuture throw exception.
  5. not remove cleanPoliciesCacheInitMap() in initPolicesCache() when closed.get()==true, since broker is closed, clean twice is ok.

There is one point should be consider in this pr

  1. When use putIfAbsent, if too many getTopicPolicy() trigger prepareInitPoliciesCacheAsync, it would generate many empty completableFuture. Further more, we can use double check in the code to avoid this object gc.(the code would be ugly).

Besides, this case still exist: if failed to close reader in cleanCacheAndCloseReader(), this closing reader maybe have chance to reconnect and become orphan reader. This is not this pr's work.

Verifying this change

  • Make sure that the change passes the CI checks.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the great work @TakaHiR07

@codecov-commenter
Copy link

codecov-commenter commented Nov 14, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.29%. Comparing base (6fdb4b9) to head (33ae945).
⚠️ Report is 10 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #24980   +/-   ##
=========================================
  Coverage     74.29%   74.29%           
- Complexity    34026    34066   +40     
=========================================
  Files          1920     1920           
  Lines        150252   150252           
  Branches      17428    17428           
=========================================
+ Hits         111634   111636    +2     
- Misses        29706    29735   +29     
+ Partials       8912     8881   -31     
Flag Coverage Δ
inttests 26.17% <75.00%> (-0.39%) ⬇️
systests 22.87% <67.85%> (-0.02%) ⬇️
unittests 73.84% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../service/SystemTopicBasedTopicPoliciesService.java 77.86% <100.00%> (+0.19%) ⬆️

... and 78 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues in SystemTopicBasedTopicPoliciesService.prepareInitPoliciesCacheAsync related to duplicate cleanup execution and recursive update errors when exceptions occur during policy cache initialization.

Key Changes:

  • Replaced computeIfAbsent with putIfAbsent to avoid recursive update errors when modifying the map during computation
  • Consolidated exception handling to ensure cleanPoliciesCacheInitMap is called only once per exception
  • Removed redundant cleanup calls from initPolicesCache method to prevent double cleanup
  • Simplified reader creation logic in newReader by removing special exception handling that is now redundant

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
SystemTopicBasedTopicPoliciesService.java Refactored prepareInitPoliciesCacheAsync to use putIfAbsent instead of computeIfAbsent, consolidated exception handling to prevent double cleanup, simplified newReader logic, and changed cleanPoliciesCacheInitMap visibility for testing
SystemTopicBasedTopicPoliciesServiceTest.java Added two comprehensive test cases to verify correct cleanup behavior when exceptions occur during reader creation and policy cache initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Technoboy-
Copy link
Contributor

Which logic could cause this issue ?

Request-1: policyCacheInitMap put future1
Request-1: create reader1
Request-1: readerCaches put reader1
reader1 read error
Request-1: first time cleanCacheAndCloseReader(), include:
        remove reader1 in readerCaches
        close reader1
        remove future1 in policyCacheInitMap

Request-2: policyCacheInitMap put future2
Request-1: second time cleanCacheAndCloseReader(), only remove future2 in policyCacheInitMap
Request-2: create reader2
Request-2: readerCaches put reader2

Request-3: policyCacheInitMap put future3
Request-3: create reader3
Request-3: readerCaches put reader3

@Technoboy-
Copy link
Contributor

Which logic could cause this issue ?

Request-1: policyCacheInitMap put future1
Request-1: create reader1
Request-1: readerCaches put reader1
reader1 read error
Request-1: first time cleanCacheAndCloseReader(), include:
        remove reader1 in readerCaches
        close reader1
        remove future1 in policyCacheInitMap

Request-2: policyCacheInitMap put future2
Request-1: second time cleanCacheAndCloseReader(), only remove future2 in policyCacheInitMap
Request-2: create reader2
Request-2: readerCaches put reader2

Request-3: policyCacheInitMap put future3
Request-3: create reader3
Request-3: readerCaches put reader3

Is this bug existed in the 3.0.x , not the latest version ?

@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Nov 19, 2025

Which logic could cause this issue ?

@Technoboy- restart broker with version-3.0.x. Restart broker-1, and after a few time restart broker-2. When load topic and getTopicPolicy on broker-1, the corresponding __change_event topic on broker-2 is unload.

I don't use the latest version. Maybe in latest version, this concurrent case is avoid by pr-24658. But it still catch the exception and cleanCacheAndPolicyMap twice, this is dangerous.

@Technoboy-
Copy link
Contributor

Which logic could cause this issue ?

@Technoboy- restart broker with version-3.0.x. Restart broker-1, and after a few time restart broker-2. When load topic and getTopicPolicy on broker-1, the corresponding __change_event topic on broker-2 is unload.

I don't use the latest version. Maybe in latest version, this concurrent case is avoid by pr-24658. But it still catch the exception and cleanCacheAndPolicyMap twice, this is dangerous.

How could the latest code cause the issue ? I'm not understand

@TakaHiR07
Copy link
Contributor Author

TakaHiR07 commented Nov 19, 2025

How could the latest code cause the issue ? I'm not understand

You can see the code in branch-3.0. Latest code is a bit different, the concurrent case is found on branch-3.0

@Technoboy-
Copy link
Contributor

How could the latest code cause the issue ? I'm not understand

You can see the code in branch-3.0. Latest code is a bit different, the concurrent case is found on branch-3.0

So it's better to fix it to branch-3.0, for the master branch, I don't think it's needed.

@TakaHiR07
Copy link
Contributor Author

So it's better to fix it to branch-3.0, for the master branch, I don't think it's needed.

@Technoboy- I think it is better to also fix in master branch. Since the current code in master branch is for improvement, not a true fix, and still have risk.

@lhotari
Copy link
Member

lhotari commented Dec 10, 2025

As shown in the issue, fix two problem: 1. cleanCacheAndCloseReader() executed twice cause concurrent error, which result in too many orphan reader remain in SystemTopicBasedTopicPoliciesService 2. double update in policyCacheInitMap cause recursive update error

I think that this problem exists also in master branch and therefore merging this PR and cherry-picking it to maintenance branches makes sense.

@lhotari lhotari merged commit 47b8d5d into apache:master Dec 10, 2025
52 checks passed
@lhotari
Copy link
Member

lhotari commented Dec 15, 2025

Depends on #24658 for branch-4.0 and branch-4.1

lhotari pushed a commit that referenced this pull request Dec 15, 2025
…picPoliciesService (#24980)

Co-authored-by: fanjianye <fanjianye@bigo.sg>
(cherry picked from commit 47b8d5d)
lhotari pushed a commit that referenced this pull request Dec 15, 2025
…picPoliciesService (#24980)

Co-authored-by: fanjianye <fanjianye@bigo.sg>
(cherry picked from commit 47b8d5d)
@lhotari
Copy link
Member

lhotari commented Dec 16, 2025

Flaky test #25081, please take a look

ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…picPoliciesService (apache#24980)

Co-authored-by: fanjianye <fanjianye@bigo.sg>
(cherry picked from commit 47b8d5d)
(cherry picked from commit 9ca5241)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…picPoliciesService (apache#24980)

Co-authored-by: fanjianye <fanjianye@bigo.sg>
(cherry picked from commit 47b8d5d)
(cherry picked from commit 9ca5241)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 19, 2025
…picPoliciesService (apache#24980)

Co-authored-by: fanjianye <fanjianye@bigo.sg>
(cherry picked from commit 47b8d5d)
(cherry picked from commit 9ca5241)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] [broker] Concurrent error in SystemTopicBasedTopicPoliciesService#prepareInitPoliciesCacheAsync

5 participants