Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Sep 10, 2025

Motivation

Our pulsar cluster encountered an issue where the response of the command pulsar-admin topics stats --get-precise-backlog --get-subscription-backlog-size <topic> has a negative backlog.

And topic internal stats show that a consumer acknowledged messages that do not exist in topic.ledgers

  • ledgerInfo: {"ledgerId":1140377,"entries":5,"size":45573,"offloaded":false,"metadata":null,"underReplicated":false}
  • But the acknowledged more entries: (1140377:-1..1140377:6]

The issue happened this way:

  • Configurations
    • managedLedgerDefaultEnsembleSize=2
    • managedLedgerDefaultWriteQuorum=2
    • managedLedgerDefaultAckQuorum=2
  • Topic state
    • ledgers: [3: {entries: 1}]
  • Scenarios
    • bookie-2 is slow
    • ZK connections are unstable, which may cause a race condition on owning topics
time broker-1 broker-2 bookie-1 bookie-2
1 owned the topic
2 start publish message 3:1(has not completed yet)
3 receives the write request 3:1 receives the write request 3:1
4 write disk finished write disk finished
5 responds to broker-1
2 start publish message 3:2(has not sent to Bookies yet)
6 metadata store connection unstable
7 onwed the topic
8 assumes itself is still the owner of the topic
9 start to close the ledger 3
10 receives the close request of the ledger 3 receives the close request of the ledger 3
11 fence & close the ledger 3
12 receives the write request 3:2 receives the write request 3:2
13 responds fenced error since the ledger has been closed
14 received a fenced error
15 switch ledger to 4 and write topic metadata with [3: {entries: 1}] since bookie-2 has not responded to the request of writing 3:1
16 responds to broker-1 for the writing 3:1
17 fence & close the ledger 3
18 closed ledger 3, and it has 2 entries
19 received the response of writing 3:1

Highlight: The issue also leads to messages being lost.

You can reproduce the issue with the new test testConcurrentlyCloseCurrentLedger

Modifications

Re-check the ledger entries from the metadata store if received a fenced error when writing Bookies

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@poorbarcode poorbarcode added this to the 4.2.0 milestone Sep 10, 2025
@poorbarcode poorbarcode self-assigned this Sep 10, 2025
@poorbarcode poorbarcode added type/bug The PR fixed a bug or issue reported a bug cherry-picked/branch-4.0 labels Sep 10, 2025
@github-actions
Copy link

@poorbarcode Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member

lhotari commented Sep 10, 2025

@poorbarcode Regarding the title "Negative backlog & acked positions does not exist & message lost when concurrently occupying topic owner" is describing the current symptoms and impact. Perhaps it should be emphasized in the title that this PR is addressing fundamentals of how Pulsar Brokers should be handling topic ownership? It's surprising that this hasn't already been implemented properly by now. Thanks for addressing this issue!

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions and comments.

@poorbarcode poorbarcode requested a review from lhotari September 11, 2025 08:42
@codecov-commenter
Copy link

codecov-commenter commented Sep 11, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.24%. Comparing base (7547fab) to head (5425c2f).
⚠️ Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 95.23% 0 Missing and 1 partial ⚠️
...org/apache/bookkeeper/mledger/impl/OpAddEntry.java 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #24722      +/-   ##
============================================
+ Coverage     74.16%   74.24%   +0.08%     
- Complexity    33498    33636     +138     
============================================
  Files          1896     1900       +4     
  Lines        148109   148420     +311     
  Branches      17163    17211      +48     
============================================
+ Hits         109844   110199     +355     
+ Misses        29488    29453      -35     
+ Partials       8777     8768       -9     
Flag Coverage Δ
inttests 26.29% <7.69%> (-0.10%) ⬇️
systests 22.72% <15.38%> (+0.08%) ⬆️
unittests 73.77% <92.30%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...che/bookkeeper/mledger/impl/ManagedLedgerImpl.java 81.15% <95.23%> (+0.21%) ⬆️
...org/apache/bookkeeper/mledger/impl/OpAddEntry.java 77.20% <80.00%> (-0.15%) ⬇️

... and 120 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check the review comments.

@poorbarcode poorbarcode requested a review from lhotari September 12, 2025 01:41
@Technoboy- Technoboy- merged commit e417aca into apache:master Sep 16, 2025
51 checks passed
@poorbarcode poorbarcode deleted the fix/msg_lost branch September 16, 2025 03:22
poorbarcode added a commit that referenced this pull request Sep 16, 2025
… lost when concurrently occupying topic owner (#24722)

(cherry picked from commit e417aca)
poorbarcode added a commit that referenced this pull request Sep 16, 2025
… lost when concurrently occupying topic owner (#24722)

(cherry picked from commit e417aca)
ganesh-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 18, 2025
… lost when concurrently occupying topic owner (apache#24722)

(cherry picked from commit e417aca)
(cherry picked from commit ec9039d)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Sep 18, 2025
… lost when concurrently occupying topic owner (apache#24722)

(cherry picked from commit e417aca)
(cherry picked from commit ec9039d)
KannarFr pushed a commit to CleverCloud/pulsar that referenced this pull request Sep 22, 2025
walkinggo pushed a commit to walkinggo/pulsar that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants