Addressing 2 Coordinators Elected As Leader (#16411) #16528

razinbouzar · 2024-05-31T16:59:37Z

Description

Introduces a new private method handleConnectionStateChanged to handle connection state changes, which recreates the leader latch if the connection state is SUSPENDED or LOST.

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

MyFoo
OurBar
TheirBaz

This PR has:

Added listener method that tracks ZK leader state

razinbouzar · 2024-05-31T17:00:15Z

@kfaraz I took an attempt at resolving this if you can take a look and provide feedback it would be appreciated! Thank you!

kfaraz · 2024-05-31T17:23:16Z

Thanks a lot, @razinbouzar ! You beat me to it. 🙂

I have been doing some investigation on this. Overall, your solution makes sense to me. But I am still going to try it out in my local setup that I have been using for testing.

One important thing that I have noticed is that the isLeader() method may be called on the leader latch listener even after the latch has been closed. So we would need to make sure that if this method is called on a closed latch, we just ignore that event.

I will share more details here in a bit.

kfaraz

I have left some suggestions. Overall approach makes sense to me.

I still need to test out the changes in my local setup.
An alternative solution I was thinking of was just to recreate the latch in notLeader() method. We would need to evaluate if one is better than the other.

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java

kfaraz · 2024-05-31T17:50:40Z

One important thing that I have noticed is that the isLeader() method may be called on the leader latch listener even after the latch has been closed. So we would need to make sure that if this method is called on a closed latch, we just ignore that event.

For more information, I encountered this in my local testing when I was trying to recreate the latch in notLeader().

Timeline:

[curator-thread] Leader loses connection
Leader gets notified with notLeader()
[curator-thread] Connection is re-established
Create new latch, close old one. This immediately causes some other node to become leader.
Start new latch after a delay to allow other nodes to become leader.
isLeader() is called on old latch which is already closed. This leads to double leaders if not properly handled. This seems like another curator bug since there is no point in calling isLeader() if there is already another leader. (step 4 above).

An example sequence of events that I observed

2024-05-30T14:19:11,556 WARN [main-SendThread(localhost:2181)] org.apache.zookeeper.ClientCnxn - Session 0x1002399b0fc0000 for server localhost/127.0.0.1:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException.
2024-05-30T14:19:11,670 INFO [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - [http://localhost:8081][1] Giving up leadership
2024-05-30T14:19:11,754 INFO [main-SendThread(localhost:2183)] org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost/0:0:0:0:0:0:0:1:2183, session id = 0x1002399b0fc0000, negotiated timeout = 30000
2024-05-30T14:19:11,766 INFO [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - [http://localhost:8081][1] Recreating leader latch to allow other nodes to become leader.
2024-05-30T14:19:13,379 INFO [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - [http://localhost:8081][1] Now starting the latch
2024-05-30T14:19:13,379 INFO [LeaderSelector[/druid/overlord/_OVERLORD]] org.apache.druid.curator.discovery.CuratorDruidLeaderSelector - [http://localhost:8081][1] I am now the leader. Latch state[CLOSED]

Just for clarification, the scenario described above is slightly different from the one we are trying to solve.
In the original problem, the connection is not re-established until the znodes created by the current leader are just about to expire and it tries to reacquire leadership.

This reverts commit 90ea5e8.

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java

- Cleanup file formatting and comments - Reduce complexity of the first go by calling the recreateLeaderLatch in the notLeader() method

…rDruidLeaderSelector.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java

kfaraz

Please add a unit test that verifies the new behaviour.
Also the case state = CLOSED still needs to be handled in the isLeader() method of the listener.

- Remove handleConnectionStateChagned method - Remove duplicate code and use recreate leader latch method - Handle LeaderLatch.State.CLOSED in the isLeader() function, log a warning.

Remove unused import

kfaraz

I have done some basic testing changes with these changes and they look good.
I have a follow up PR #16544 to this where I intend to add tests to verify the behaviour and fix other race conditions.

Thanks a lot for the changes, @razinbouzar !

Razin Bouzar added 2 commits May 31, 2024 12:34

Addressing apache#16411

b95b944

Added listener method that tracks ZK leader state

Eliminate whitespace

c4ba9ff

kfaraz self-requested a review May 31, 2024 17:20

kfaraz reviewed May 31, 2024

View reviewed changes

kfaraz requested a review from gianm May 31, 2024 17:35

Razin Bouzar and others added 2 commits May 31, 2024 17:56

Format cleanup

90ea5e8

Revert "Format cleanup"

1b55b69

This reverts commit 90ea5e8.

kfaraz reviewed Jun 1, 2024

View reviewed changes

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java Show resolved Hide resolved

razinbouzar and others added 2 commits June 2, 2024 01:07

Incorporate feedback

2bd922f

- Cleanup file formatting and comments - Reduce complexity of the first go by calling the recreateLeaderLatch in the notLeader() method

Update server/src/main/java/org/apache/druid/curator/discovery/Curato…

a037faf

…rDruidLeaderSelector.java Co-authored-by: Kashif Faraz <kashif.faraz@gmail.com>

kfaraz reviewed Jun 2, 2024

View reviewed changes

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java Outdated Show resolved Hide resolved

kfaraz reviewed Jun 2, 2024

View reviewed changes

server/src/main/java/org/apache/druid/curator/discovery/CuratorDruidLeaderSelector.java Show resolved Hide resolved

kfaraz reviewed Jun 2, 2024

View reviewed changes

razinbouzar added 3 commits June 5, 2024 10:33

Updates on feedback

d4ff87c

- Remove handleConnectionStateChagned method - Remove duplicate code and use recreate leader latch method - Handle LeaderLatch.State.CLOSED in the isLeader() function, log a warning.

Merge branch 'master' of https://github.com/razinbouzar/druid

77f2ca6

Style check.

4b63a23

Remove unused import

cryptoe mentioned this pull request Jun 6, 2024

Pin Curator dependencies to 5.3.0 util CURATOR-696 has been resolved #16444

Closed

10 tasks

kfaraz approved these changes Jun 7, 2024

View reviewed changes

kfaraz merged commit 844b217 into apache:master Jun 7, 2024

kfaraz mentioned this pull request Jun 18, 2024

Curator 5.7.1 Upgrade #16617

Closed

10 tasks

kfaraz added this to the 31.0.0 milestone Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Addressing 2 Coordinators Elected As Leader (#16411) #16528

Addressing 2 Coordinators Elected As Leader (#16411) #16528

Uh oh!

razinbouzar commented May 31, 2024

Uh oh!

razinbouzar commented May 31, 2024

Uh oh!

kfaraz commented May 31, 2024

Uh oh!

kfaraz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz commented May 31, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Uh oh!

kfaraz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Addressing 2 Coordinators Elected As Leader (#16411) #16528

Addressing 2 Coordinators Elected As Leader (#16411) #16528

Uh oh!

Conversation

razinbouzar commented May 31, 2024

Description

Fixed the bug ...

Renamed the class ...

Added a forbidden-apis entry ...

Release note

Key changed/added classes in this PR

Uh oh!

razinbouzar commented May 31, 2024

Uh oh!

kfaraz commented May 31, 2024

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz commented May 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kfaraz commented May 31, 2024 •

edited

Loading