CURATOR-644. CURATOR-645. Fix livelock in LeaderLatch #430

tisonkun · 2022-07-13T02:22:54Z

Livelock in details

Here we have two race conditions to cause livelock:

Case 1. Suppose there are two participants, p0 and p1:

T0. p1 is going to watch on preceding node belongs to p0.
T1. p0 gets reconnected, and thus reset its node, and create a new node to prepare watch on p1's node.
T2. p1 find preceding node has gone, and reset itself.

At the moment, p0 and p1 can be in the livelock that never see each other's node and infinitely reset themselves. This is the case reported by CURATOR-645.

Case 2. The similar case can happen even if there is only one participant:

~~If we still call reset when preceding node deleted by latter set new node, it's a live lock.~~

I cannot find a live lock here. If only the background in the same client are executed in serial, there are always three nodes to create, while with this patch, there are two nodes to create. But it should not create millions of nodes. If it's in case 1, it's possible since there's no guarantee between different clients.

This is the case reported by CURATOR-644.

Solution

I make two significant changes to resolve these livelock cases:

Call getChildren instead of reset when preceding node not found in callback. This is previously reported in ff4ec29#r31770630. I don't find a reason we perform different between callback and watcher for the same condition. And concurrent resets are the trigger for these livelock.
Call getChildren instead of reset when recovered from connection loss. The reason is similar to 1, while if a connection loss or session expire cause our node to be deleted, when checkLeadership we can see the condition and call reset.

These changes should fix CURATOR-645 and CURATOR-644.

I'm trying to add test cases and such changes must involve more eyes.

…llback Signed-off-by: tison <wander4096@gmail.com>

… or session expire Signed-off-by: tison <wander4096@gmail.com>

Signed-off-by: tison <wander4096@gmail.com>

This reverts commit 037bf81.

This reverts commit 8a76593.

This reverts commit 0d4a7b7.

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

tisonkun · 2022-07-13T03:41:30Z

Incorrect analyze

However, as long as there's possibility to generate concurrent checkLeadership a participant can race itself. I ever thought we can use a checkLeadershipLock here but since all client request are handled in callbacks, the lock can protect little.

If you have an idea to fix one participant multiple threads race condition, please comment.

The race condition is:

T0. In thread 0 (th0) p is going to getChildren
T1. In thread 1 (th1) p gets reconnected and calls getChildren
T2. Supposed the node has gone due to session expire, th0 cannot find ourPath and reset
T3. th1 gets a children set without the new node th0 created, or even reach checkLeadership before th0 creates new node. Then th1 reset.

The problem here is that we create node, get children all asynchronously, so even we add a lock in checkLeadership the callback can overwrite status. Even we lock callbacks, the creation can be uncompleted. The root cause should be that we should no have two competing threads for one participant.

UPDATE 1 - I notice that background are running serially and the order is the same as request sending. Perhaps we can setup invariant based on this premise. This is the case of CURATOR-644. CURATOR-645 is about 2 participants.

UPDATE 2 - It seems with the assumption CURATOR-644 should be fixed also. The race condition shown above last cannot happen because if th1 getChildren after th0 create, th1 can find the node exists; if th1 getChildren before th0 create, th0 will be the leader first and th1 will be the leader later. Although it causes once more leader switch, it converges eventually determinately.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2022-07-14T01:45:10Z

cc @rikimberley @yuri-tceretian

curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java

XComp

What's the motivation behind putting the solution for both Jira issues into a single PR? Wouldn't it more reasonable to split it up into two PRs analgously to the Jira issues?

tisonkun · 2022-08-24T12:21:21Z

@XComp they're logically resolved simultaneously. That is, if you resolve CURATOR-644, you resolve CURATOR-645 - they're the same sort.

In another word, you can check out the diff and tell me how to split it up into two PRs.

XComp · 2022-08-24T12:42:41Z

Fair enough. I did go through some of the commits in my IDE but just stop (and started writing my comment) before noticing that most of the commits get reverted again. That made me think that there's more stuff needed for CURATOR-644. Never mind...

XComp

I guess, we have to reiterate over the test once more. It succeeded even with the fix in the production code being reverted.

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java

XComp · 2022-08-29T13:52:45Z

they're logically resolved simultaneously. That is, if you resolve CURATOR-644, you resolve CURATOR-645 - they're the same sort.

In another word, you can check out the diff and tell me how to split it up into two PRs.

I was thinking about it once more. CURATOR-645 could be covered separately in my opinion. CURATOR-645 was identified in FLINK-27078 where we run almost no logic before revoking the leadership by calling LeaderLatch#close. That caused the current leader's LeaderLatch instance to trigger its child node deletion while other LeaderLatch instances were right within setting up the watcher for its child node's predecessor.

Hence, I see CURATOR-645 being not that tightly related with the reconnect issue covered in CURATOR-644. CURATOR-645 just needs to be resolved before CURATOR-644 can be resolved.

Anyway, the changes are not that big in the end that we couldn't resolve both in the same PR. ¯_(ツ)_/¯

tisonkun · 2022-08-30T12:53:12Z

Thanks for your inputs @XComp!

I'll try to integrate your comment this week or the next.

Since we merge several fixes into the master branch, and there're users asking for a new release, if we don't have a consensus on this patch, I'll push the changes on debug logging first so that reporters of CURATOR-644 and CURATOR-645 can use the new version to provide an exact error log output :)

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

Signed-off-by: tison <wander4096@gmail.com> Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

tisonkun · 2022-09-11T16:11:52Z

Updated. I believe this patch is ready to merge.

Please help with reviewing @eolivelli @Randgalt @cammckenzie

XComp

I went over my proposed test once more and added a few comments. Please see them below.

curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java

Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

XComp

Going over the PR once more with the new test proposal for CURATOR-645, I would vote for splitting up the two Jira issues instead of handling them in a single PR. CURATOR-645 is now clearly defined by the change you proposed and the test case we came up with together.

~~We still have to revisit CURATOR-644 (see my comment below on the production code change). Additionally,~~ we have to come up with a test case for that CURATOR-644, still. Anyway, blocking CURATOR-645 on CURATOR-644 just because the latter one is based on the former one to solve both in a single PR is not needed and also might cause confusion later on. WDYT?

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

Signed-off-by: tison <wander4096@gmail.com>

XComp

Thanks, @tisonkun for applying my comments. I have a few minor things to add. ...mostly cosmetic comments. I'd be curious about your opinion.

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

curator-test-zk35/pom.xml

…ed if we haven't lost the child node after a reconnect. (#2)

tisonkun · 2022-09-21T02:08:33Z

@eolivelli @cammckenzie @Randgalt Perhaps we can release a 5.4.0 later this month and I'd ask for a review on this patch for a consensus whether we include it or only the debug logging part.

XComp

Looks good from my end. Both changes make sense in my opinion and are covered by tests now. 👍

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

…ipes/leader/LeaderLatch.java Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java

Randgalt

I don't see any issues, however it's been a very long time since I've looked at this code.

Signed-off-by: tison <wander4096@gmail.com>

tisonkun · 2022-09-27T03:01:16Z

Merging...

I'm looking into a related change #398 and then prepare the next release.

ImagineBrain · 2024-11-25T09:42:33Z

It caused a new issue.
CURATOR-724

tisonkun · 2024-11-25T11:14:39Z

@ImagineBrain It may be fixed by #500. Would you upgrade to Curator 5.7.0 and recheck?

ImagineBrain · 2024-11-25T15:14:58Z

@ImagineBrain It may be fixed by #500. Would you upgrade to Curator 5.7.0 and recheck?

@tisonkun I've tried 5.7.1, it helped nothing if the leaderPath didn't come back after reconnected. But I found a workaround, adding a ConnectionStateListener to create the leaderPath node, then all things will be fine.

tisonkun added 8 commits July 12, 2022 22:46

call getChildren instead of reset when preceding node not found in ca…

d45e257

…llback Signed-off-by: tison <wander4096@gmail.com>

call getChildren instead of reset when recovered from connection loss…

a2060c1

… or session expire Signed-off-by: tison <wander4096@gmail.com>

synchronize checkLeadership

0d4a7b7

Signed-off-by: tison <wander4096@gmail.com>

reformat

8a76593

Signed-off-by: tison <wander4096@gmail.com>

lock by leadershipLock object

037bf81

Signed-off-by: tison <wander4096@gmail.com>

Revert "lock by leadershipLock object"

7b080a3

This reverts commit 037bf81.

Revert "reformat"

b172f5f

This reverts commit 8a76593.

Revert "synchronize checkLeadership"

99a9009

This reverts commit 0d4a7b7.

tisonkun commented Jul 13, 2022

View reviewed changes

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java Show resolved Hide resolved

tisonkun commented Jul 13, 2022

View reviewed changes

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java Show resolved Hide resolved

log debug and add test case testOurPathDeletedOnReconnect

425598c

Signed-off-by: tison <wander4096@gmail.com>

tisonkun requested review from Randgalt and cammckenzie July 14, 2022 01:45

tisonkun commented Jul 14, 2022

View reviewed changes

curator-recipes/src/test/java/org/apache/curator/framework/recipes/leader/TestLeaderLatch.java Outdated Show resolved Hide resolved

tisonkun requested a review from eolivelli July 14, 2022 06:53

XComp reviewed Aug 24, 2022

View reviewed changes

XComp suggested changes Aug 29, 2022

View reviewed changes

XComp reviewed Aug 30, 2022

View reviewed changes

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java Outdated Show resolved Hide resolved

tisonkun and others added 3 commits September 12, 2022 00:02

Merge branch 'master' into leader-latch-state-machine

60285d7

testLeadershipElectionWhenNodeDisappearsAfterChildrenAreRetrieved

50a6515

Signed-off-by: tison <wander4096@gmail.com> Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

print id in debug logs

2775855

Signed-off-by: tison <wander4096@gmail.com> Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

tisonkun requested a review from XComp September 11, 2022 16:11

XComp reviewed Sep 13, 2022

View reviewed changes

Apply suggestions from code review

54dae88

Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

XComp suggested changes Sep 14, 2022

View reviewed changes

tisonkun added 2 commits September 17, 2022 10:00

print id in log

cc4c148

Signed-off-by: tison <wander4096@gmail.com>

update test

ec84f75

Signed-off-by: tison <wander4096@gmail.com>

tisonkun requested a review from XComp September 17, 2022 02:02

tisonkun added 2 commits September 17, 2022 11:13

harden tests

bfec232

Signed-off-by: tison <wander4096@gmail.com>

add dependencies

ccf5e52

Signed-off-by: tison <wander4096@gmail.com>

XComp reviewed Sep 19, 2022

View reviewed changes

tisonkun requested a review from XComp September 21, 2022 02:06

[CURATOR-644][test] Extends test to verify that no new child is creat…

b7e0e8e

…ed if we haven't lost the child node after a reconnect. (#2)

XComp approved these changes Sep 21, 2022

View reviewed changes

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java Outdated Show resolved Hide resolved

Update curator-recipes/src/main/java/org/apache/curator/framework/rec…

bb72ecf

…ipes/leader/LeaderLatch.java Co-authored-by: Matthias Pohl <matthias.pohl@aiven.io>

Randgalt reviewed Sep 26, 2022

View reviewed changes

curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java Outdated Show resolved Hide resolved

Randgalt approved these changes Sep 26, 2022

View reviewed changes

tisonkun added 2 commits September 27, 2022 07:28

use @VisibleForTesting

781664f

Signed-off-by: tison <wander4096@gmail.com>

revert looser condition

e19e8f1

Signed-off-by: tison <wander4096@gmail.com>

tisonkun merged commit 4b96f18 into apache:master Sep 27, 2022

tisonkun deleted the leader-latch-state-machine branch September 27, 2022 03:01

tisonkun mentioned this pull request Sep 27, 2022

CURATOR-653: fix potential double leader for LeaderLatch #398

Closed

tisonkun mentioned this pull request May 7, 2023

[fix][client] Java Client's Seek Logic Not Threadsafe #1 apache/pulsar#20242

Merged

15 tasks

gianm mentioned this pull request May 9, 2024

2 Coordinators Elected Leader apache/druid#16411

Closed

tisonkun mentioned this pull request May 10, 2024

CURATOR-696. Fix double leader for LeaderLatch #500

Merged

kezhuw mentioned this pull request Dec 19, 2024

CURATOR-724. Fix LeaderLatch recover on reconnected and missing leaderPath #515

Merged

CURATOR-644. CURATOR-645. Fix livelock in LeaderLatch #430

CURATOR-644. CURATOR-645. Fix livelock in LeaderLatch #430

Conversation

tisonkun commented Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Livelock in details

Solution

Uh oh!

Uh oh!

Uh oh!

tisonkun commented Jul 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tisonkun commented Jul 14, 2022

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Aug 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

XComp commented Aug 24, 2022

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XComp commented Aug 29, 2022

Uh oh!

tisonkun commented Aug 30, 2022

Uh oh!

Uh oh!

tisonkun commented Sep 11, 2022

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XComp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tisonkun commented Sep 21, 2022

Uh oh!

XComp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Randgalt left a comment

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Sep 27, 2022

Uh oh!

ImagineBrain commented Nov 25, 2024

Uh oh!

tisonkun commented Nov 25, 2024

Uh oh!

ImagineBrain commented Nov 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

tisonkun commented Jul 13, 2022 •

edited

Loading

tisonkun commented Jul 13, 2022 •

edited

Loading

tisonkun commented Aug 24, 2022 •

edited

Loading

XComp left a comment •

edited

Loading