[WIP] Fix client can OOM when there are some bookies slow #4556

dao-jun · 2025-02-20T11:09:21Z

Related to:
apache/pulsar#12169
apache/pulsar#9562
apache/pulsar#10439
#3139
apache/pulsar#14861
and etc.

Background

Our customer has 12 nodes bookie and 12 nodes broker cluster.
Pulsar version: 2.6.3
Bookkeeper: 4.11.1

They enabled bookkeeper client addEntryTimeout feature and set addEntryTimeoutSec to 30

At first, their EWA is 332, and they encountered Broker OOM exception.
According to apache/pulsar#12169, we recommended them set EWA to 222 and observe for a period of time

After a few days, they also encountered broker OOM exception.

So we suspect that the broker may have a memory leak and let them to enable Netty ByteBuf leak detector (Add -Dpulsar.allocator.leak_detection=Paranoid to their broker vm args and restart).

But when search LEAK keyword in their broker logs, their is no related logs which means no mem leaks in their broker.

We found some logs New ensemble: [aaa,bbb] is not adhering to Placement Policy. quarantinedBookies: [xxx] in their logs, and quarantinedBookies is always same.

We have observed the monitoring of this bookie and found that there has been no traffic entering for a long time(weeks), so we tried to restart the bookie, but it can be shutdown for a long time, until we kill -9, which means this bookie maybe ran into thread blocking or sth else so that it can not respond requests.

After we restart the bookie, there is no more broker OOM happened, brokers goes well.

When I analyze the broker heap dump, I found some Netty channels held a big number of DirectMemory, and all this channels connected to that quarantinedBookie:

There are 6 channels retained over 100MB DirectMemories each.

Due to our customer enabled addEntryTimeout feature, so broker Backpressure won't work in this case.
Enable failfast can prevent the situation from escalating, but it will not solve the root cause.
If we set EWA to 332, and there is 1 bookie is SLOW or HANGING, OOM can also have a chance to happen.
If we set EWA to 222 and disable addEntryTimeout, and there is 1 bookie is SLOW or HANGING, broker maybe can not serve requests.

The key point is if there is a bookie is slow or hanging and we don't enable failfast, client will keep sending data to it, even though the data cannot send-out. All the data will backlog in the client.

Motivation

Fix bookkeeper client can be OOM if there is a bookie is SLOW or HANGING in the ensemble.

Changes

Close all the channel which connected to a quarantined Bookie to release memories.

dlg99

I don't think it is a good approach. There is a client backpressure (see the PRs that you linked) that should address the problem.

Quarantined bookies aren't necessary dead, it's a soft state where we are trying to not chose it for requests unless there are no other options. E.g. the bookie could be in a long GC and will come back even though a request timed out. Disconencting channels means longer process of re-connecting later.

Fix client can OOM when there are some bookies slow

eaf8dee

dlg99 reviewed Mar 31, 2025

View reviewed changes

dao-jun mentioned this pull request Jun 18, 2025

[improve][broker] Part-1 of PIP-434: Expose Netty channel configuration WRITE_BUFFER_WATER_MARK to pulsar conf and pause receive requests when channel is unwritable apache/pulsar#24423

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Fix client can OOM when there are some bookies slow #4556

[WIP] Fix client can OOM when there are some bookies slow #4556

Uh oh!

dao-jun commented Feb 20, 2025 •

edited

Loading

Uh oh!

dlg99 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Fix client can OOM when there are some bookies slow #4556

Are you sure you want to change the base?

[WIP] Fix client can OOM when there are some bookies slow #4556

Uh oh!

Conversation

dao-jun commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Motivation

Changes

Uh oh!

dlg99 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dao-jun commented Feb 20, 2025 •

edited

Loading