Skip to content

Conversation

@lhotari
Copy link
Member

@lhotari lhotari commented Oct 8, 2025

Motivation

PulsarDecoder unnecessarily keeps a reference to the last parsed message. It's better to clear it immediately to avoid unnecessary memory growth when there are multiple connections.
This mainly impacts GET_TOPICS_OF_NAMESPACE_RESPONSE which can be large.

While making this change, it was noticed that this revealed a major reliability issue in Pulsar. Lookup requests and Partition metadata requests were getting reused and potentially mixed up in cases where there would be multiple concurrent lookup requests in progress on the same connection. The impact of this has been that lookup or partition metadata requests have failed and also that they have returned incorrect information.
The problem in these cases has been a violation of the rule documented in #18998. The implementations of handle* methods shouldn't keep references to the command after the method returns. When the command is handle asynchronously, a copy is usually needed if the asynchronous handler makes a reference to the command.

Modifications

  • call cmd.clear() in the finally block of channelRead.
  • fix bugs in handleLookup, handlePartitionMetadataRequest and handleCommandWatchTopicList
    • make copy of command before handling it asynchronously

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added this to the 4.2.0 milestone Oct 8, 2025
@lhotari lhotari requested a review from codelipenghui October 8, 2025 17:51
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 8, 2025
@lhotari lhotari force-pushed the lh-fix-memory-leak-in-PulsarDecoder branch from ad79165 to b66dee8 Compare October 9, 2025 07:14
@lhotari lhotari changed the title [fix][client] Fix temporary memory leak in PulsarDecoder/ClientCnx/ServerCnx [fix] Fix temporary memory leak in PulsarDecoder/ClientCnx/ServerCnx and mixed lookup/partition metadata requests Oct 9, 2025
@lhotari lhotari requested a review from poorbarcode October 9, 2025 07:40
@lhotari lhotari changed the title [fix] Fix temporary memory leak in PulsarDecoder/ClientCnx/ServerCnx and mixed lookup/partition metadata requests [fix] Fix mixed lookup/partition metadata requests and temporary memory leak in PulsarDecoder/ClientCnx/ServerCnx Oct 9, 2025
@lhotari lhotari changed the title [fix] Fix mixed lookup/partition metadata requests and temporary memory leak in PulsarDecoder/ClientCnx/ServerCnx [fix] Fix mixed lookup/partition metadata requests causing reliability issues and incorrect responses Oct 9, 2025
@lhotari lhotari requested a review from BewareMyPower October 9, 2025 07:41
@codecov-commenter
Copy link

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.24%. Comparing base (9cc15dd) to head (8f830e1).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #24832       +/-   ##
=============================================
+ Coverage     30.69%   74.24%   +43.55%     
- Complexity       51    33787    +33736     
=============================================
  Files          1854     1913       +59     
  Lines        144987   149178     +4191     
  Branches      16818    17304      +486     
=============================================
+ Hits          44502   110760    +66258     
+ Misses        93567    29584    -63983     
- Partials       6918     8834     +1916     
Flag Coverage Δ
inttests 26.62% <66.66%> (+0.18%) ⬆️
systests 22.82% <83.33%> (+0.01%) ⬆️
unittests 73.76% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...va/org/apache/pulsar/broker/service/ServerCnx.java 72.29% <100.00%> (+37.02%) ⬆️
.../java/org/apache/pulsar/client/impl/ClientCnx.java 69.83% <100.00%> (+24.78%) ⬆️
...g/apache/pulsar/common/protocol/PulsarDecoder.java 64.72% <100.00%> (+10.82%) ⬆️

... and 1498 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@nodece nodece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Although part of the issue has been addressed, I'm still concerned that asynchronous handling may lead to accessing variables after cmd have been cleared (for example, in the interceptor).

@lhotari lhotari marked this pull request as draft October 9, 2025 11:11
BewareMyPower
BewareMyPower previously approved these changes Oct 9, 2025
@BewareMyPower BewareMyPower dismissed their stale review October 9, 2025 11:12

draft again

@BewareMyPower
Copy link
Contributor

I'm still concerned that asynchronous handling may lead to accessing variables after cmd have been cleared (for example, in the interceptor).

Do you mean cmd is published to a different thread? If so, we should not do that. In the interceptor case, the interceptor should be responsible to make a deep copy on cmd. A better approach is to copy necessary fields to a new object.

BTW, I checked the code in Pulsar that cmd is never accessed out of the channelRead method. Please correct me if I'm wrong.

@nodece
Copy link
Member

nodece commented Oct 9, 2025

Deep copy is a good idea.

@lhotari lhotari marked this pull request as ready for review October 9, 2025 11:34
@lhotari lhotari merged commit 4457b08 into apache:master Oct 9, 2025
51 checks passed
lhotari added a commit that referenced this pull request Oct 10, 2025
…y issues and incorrect responses (#24832)

(cherry picked from commit 4457b08)
lhotari added a commit that referenced this pull request Oct 10, 2025
…y issues and incorrect responses (#24832)

(cherry picked from commit 4457b08)
lhotari added a commit that referenced this pull request Oct 10, 2025
…y issues and incorrect responses (#24832)

(cherry picked from commit 4457b08)
lhotari added a commit that referenced this pull request Oct 10, 2025
…y issues and incorrect responses (#24832)

(cherry picked from commit 4457b08)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 15, 2025
…y issues and incorrect responses (apache#24832)

(cherry picked from commit 4457b08)
(cherry picked from commit 4600efe)
manas-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 15, 2025
…y issues and incorrect responses (apache#24832)

(cherry picked from commit 4457b08)
(cherry picked from commit 5c4ee53)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Oct 30, 2025
…y issues and incorrect responses (apache#24832)

(cherry picked from commit 4457b08)
(cherry picked from commit 4600efe)
nodece pushed a commit to ascentstream/pulsar that referenced this pull request Oct 30, 2025
…y issues and incorrect responses (apache#24832)

(cherry picked from commit 4457b08)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Nov 6, 2025
…y issues and incorrect responses (apache#24832)

(cherry picked from commit 4457b08)
(cherry picked from commit 5c4ee53)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants