-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[fix][broker]Fix never recovered metadata store bad version issue if received a large response from ZK #24580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...r-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
lhotari
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that there's another way with ZooKeeper which this fix is targeting.
In current implementation the async version of "multi" is used:
pulsar/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKMetadataStore.java
Lines 195 to 202 in 4bfdcd8
| protected void batchOperation(List<MetadataOp> ops) { | |
| try { | |
| zkc.multi(ops.stream().map(this::convertOp).collect(Collectors.toList()), (rc, path, ctx, results) -> { | |
| if (results == null) { | |
| Code code = Code.get(rc); | |
| if (code == Code.CONNECTIONLOSS) { | |
| // There is the chance that we caused a connection reset by sending or requesting a batch | |
| // that passed the max ZK limit. |
The ZooKeeper client also contains a synchronous version:
https://javadoc.io/doc/org.apache.zookeeper/zookeeper/latest/org/apache/zookeeper/ZooKeeper.html#multi(java.lang.Iterable)
Using the async multi API causes the problem in the first place. It would be possible to retry the reads by splitting the batch etc until it succeeds.
|
/pulsarbot rerun-failure-checks |
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
I checked this and it won't be useful since it would require that pipelining wouldn't be used, which would impact performance a lot. It seems that stats/estimation based approach is the way to go to mitigate the problem. @poorbarcode I noticed that ZooKeeper has Another source of problems in Pulsar is the ZNode data size for ml cursors (ManagedCursorInfo). I guess that stats/estimation based approach would work very well for that. It would be possible to set the estimated size high so that such operations would be executed in small batches. |
...r-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...r-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/test/java/org/apache/pulsar/broker/service/ZKMetadataStoreBatchIOperationTest.java
Show resolved
Hide resolved
pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/AbstractMetadataStore.java
Outdated
Show resolved
Hide resolved
pulsar-metadata/src/main/java/org/apache/pulsar/metadata/api/MetadataStoreConfig.java
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Outdated
Show resolved
Hide resolved
...roker/src/main/java/org/apache/pulsar/zookeeper/MaxValueMetadataNodePayloadLenEstimator.java
Show resolved
Hide resolved
In Pulsar, we have added |
codelipenghui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a online sync with @poorbarcode , here is the summary
- This PR will not 100% fix the issue, but it will provide a better metadata batch abstraction and implementation to handle large response from batch get or list request
- For this PR, it's better to have a MetadataBatchStrategy abstraction instead of MetadataNodePayloadLenEstimator. So that we can have different batch strategy implementations.
- The MetadataBatchStrategy implementation can leverage the children size of the znode to decide batch the list request or not, and for get request we can continue use the metrics (max response size of the znode path) to decide how many get request will be batched
For further improvement or fix, we can also consider follow potential solution (need to double confirm)
- Use different session for list and get request for non-ephemeral nodes to avoid the write impact due to large response size
- Add session ID to the znode, so that the broker can know the request was succeed and no need to retry
- And we can also consider to have hierarchical metadata organization for managed ledgers, partitioned topic for the super big Pulsar cluster (optional since it will bring more complexity to Pulsar)
|
I have rewritten with a new solution, could you please review again? |
codelipenghui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, please fix the checkstyle issue @poorbarcode
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #24580 +/- ##
=============================================
+ Coverage 36.00% 74.21% +38.21%
- Complexity 12799 33604 +20805
=============================================
Files 1825 1900 +75
Lines 142779 148342 +5563
Branches 16393 17202 +809
=============================================
+ Hits 51402 110097 +58695
+ Misses 84275 29469 -54806
- Partials 7102 8776 +1674
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…received a large response from ZK (apache#24580) (cherry picked from commit 1cb64a9) (cherry picked from commit 28e1874)
…received a large response from ZK (apache#24580) (cherry picked from commit 1cb64a9) (cherry picked from commit 28e1874)
…received a large response from ZK (apache#24580) (cherry picked from commit 1cb64a9) (cherry picked from commit 9b27ba2)
…received a large response from ZK (apache#24580) (cherry picked from commit 1cb64a9) (cherry picked from commit 9b27ba2)
…received a large response from ZK (apache#24580) (cherry picked from commit 1cb64a9) (cherry picked from commit 9b27ba2)
…received a large response from ZK (apache#24580)
…received a large response from ZK (apache#24580)
Motivation
Background: how ZK metadata store handles the node exists error when creating a new node
ZK metadata store throws a
BadVersionExceptionerror if it receives aNODEEXISTSerror when creating a new node.https://github.com/apache/pulsar/blob/v4.0.5/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKMetadataStore.java#L269-L272
Background: how ZK metadata store handles disconnect error when executing operations
ZK metadata store will retry the operation if it receives a
CONNECTIONLOSSerror when executing operations.https://github.com/apache/pulsar/blob/v4.0.5/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/ZKMetadataStore.java#L200-L227
Background: how ZK clients/servers limit the max length of packet
jute.maxbufferto change the maximum packet length, which controls the maximum packet length for both requests and receives.Issue: ZK client throws an error and reconnects when the max length of the packet is exceeded by the server
GetorGET_CHILDRENwill receive a response from the server, and the response packet may be larger than the max length of packet length.BadVersionExceptionerror mentioned in the background "How ZK metadata store handles the node exists error when creating a new node".You can reproduce the issue by running the new test
testReceivedHugeResponseModifications
Add a mechanism to limit
get/getChildrenoperations of the ZK metadata store. This mechanism merely avoids problems with as few resources as possible, but it cannot prevent all problems. To solve all the problems, there will be a PIP later to make the Settings configurable, see also the section "Next things to do"Next things to do
I will submit a PIP to do the following things
MetadataNodePayloadLenEstimatorconfigurableThe Metadata store can not confirm whether the bad version error is expected or not after a reconnection.
Putrequest.Putrequest was responded.Putrequest is sent again, but the node already exists(or already modified), so Pulsar get aBadVersionExceptionerror.Documentation
docdoc-requireddoc-not-neededdoc-completeMatching PR in forked repository
PR in forked repository: x