KAFKA-9184: Redundant task creation and periodic rebalances after zombie Connect worker rejoins the group by kkonstantine · Pull Request #7771 · apache/kafka

kkonstantine · 2019-12-03T21:36:13Z

Zombie workers, defined as workers that lose connectivity with the Kafka broker coordinator and get kicked out of the group but don't experience a jvm restart, have been keeping their tasks running. This side-effect is more disrupting with the new Incremental Cooperative rebalance protocol. When such workers return:
a) they join the group with existing assignment and this leads to redundant tasks running in the Connect cluster, and
b) they interfere with the computation of lost tasks, which before this fix would lead to the scheduled rebalance delay not being reset correctly back to 0. This results in periodic rebalances.

This fix focuses on resolving the above side-effects as follows:

Each worker now tracks its connectivity with the broker coordinator in an unblocking manner. This allows the worker to detect that the broker coordinator is unreachable. The timeout is set to be equal to the heartbeat interval. If during this time the connection remains inactive, the worker will proactively stop all its connectors and tasks and will keep attempting to connect to the coordinator.
The incremental cooperative assignor will keep the delay to a positive value as long as it can detect lost tasks. If the set of tasks that are computed as lost becomes empty, the delay will be set to zero and no additional rebalancing will be scheduled.

Besides the test included in this PR, the improvements are being tested with a framework that deploys a Connect cluster on docker images and introduces network partitions between all or selected workers and the Kafka brokers.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

gharris1727

Great log messages.
Two nits, and one logic question.

gharris1727 · 2019-12-04T01:21:33Z

What is the significance of this check?
If a worker has one failed task and one running task, will the snapshot be reset?
Will the running task be stopped?

This is a standard check we have when we check the assignmentSnapshot. I prefer to err on the safe side and use the second condition too.

The snapshot refers to your complete assignment (sets of connectors, tasks, revoked connectors and revoked tasks). The failure here corresponds to whether the assignment was successful or not as a whole. Currently the options are NO_ERROR and CONFIG_MISMATCH, in which case as the comments explain in the code the worker needs to read to the end of the config log and rejoin. A failed assignment does not have assigned tasks (the sets are empty). Even if there's an issue with assignment overwrites elsewhere, what we solve here remains unaffected.

Thanks for explaining the meaning of a failed snapshot, I wasn't clear on that before. You explanation sounds reasonable.

rhauch

Thanks, @kkonstantine. One question below, and is there any way to easily test this from within IncrementalCooperativeAssignorTest or RebalanceSourceConnectorsIntegrationTest?

We'd probably have to enable EmbeddedConnectCluster to shut down the Kafka cluster for a period of time, perhaps by stopping the Kafka cluster and then bring it back up using the same ports (which might actually be problematic if other tests are run in parallel and their Kafka clusters happen to reuse the same ports). Not sure if this is really feasible. Thoughts?

rhauch · 2019-12-04T04:12:11Z

Will all of these methods to clear state within the assignment snapshot appear to happen atomically w/r/t when the assignment snapshot is read elsewhere in this class?

Good point. We don't lock access to this collection. Setting to null to avoid any concurrent access issues (from the threads that export metrics potentially).

…ker coordinator is up from WorkerGroupMember

…nd stop tasks if coordinator is unreachable

…ntegration test

kkonstantine · 2019-12-04T16:07:28Z

Thanks for the reviews @rhauch @gharris1727 !
I believe I've addressed all the comments.
The latest build succeeded on JDK8/Scala 2.11 and JDK 11/Scala 2.12
On JDK 11/Scala 2.13 it failed on a single, unrelated test:
kafka.admin.ReassignPartitionsClusterTest shouldTriggerReassignmentWithZnodePrecedenceOnControllerStartup FAILED

gharris1727 · 2019-12-04T17:26:30Z

+        connect.kafka().startOnlyKafkaOnSamePorts();
+
+        // Allow for the workers to discover that the coordinator is unavailable
+        Thread.sleep(TimeUnit.SECONDS.toMillis(10));


Shouldn't the workers discover that the coordinator is unavailable while it is down?
I'm imagining this test going like this:

steady-state workers are running

brokers stop

workers discover the coordinator is unavailable

workers stop their tasks

brokers start

workers discover the next coordinator

workers start their tasks

workers are running unaffected

That was a bit misplaced, because I actually need 3 explicit delays (due to current lack of appropriate handles from the kafka and connect embedded clusters).

Bring kafka down, allow workers to discover it's down (heartbeat * 2 + 4 sec)

Allow for Kafka to come back up

Allow for worker cluster to stabilize after the very last rebalance (delay = 5sec)

Added another commit.

…of the reference

rhauch · 2019-12-04T20:44:52Z

Two of the 3 jobs passed with the latest commits, and the other is still running:

Waiting for the 3rd job to complete (almost done):

https://builds.apache.org/job/kafka-pr-jdk11-scala2.12/9727/

(This, despite the fact that Jenkins previously showed the status of all of these jobs as failed, but apparently they were still running. Evidently others have seen similar behavior recently.)

rhauch · 2019-12-04T20:50:39Z

Now, all 3 jobs passed!

kkonstantine · 2019-12-04T20:52:24Z

Thanks @rhauch for keep track of the build progress. Looks green to me too 😄
Just a note that this fix applies to 2.3, 2.4 and trunk release branches.

If we'd like to merge the zombie fix further back, it's better to issue a separate PR.
Cheers!

rhauch

LGTM. Thanks for all the hard work on this PR, @kkonstantine. And thanks to @gharris1727 and @ncliang for help with testing to identify this.

rhauch · 2019-12-04T22:01:45Z

Run with Connect system tests all passed: https://jenkins.confluent.io/job/system-test-kafka-branch-builder/3614/

gharris1727

LGTM.

Thanks @kkonstantine !

…bie Connect worker rejoins the group (#7771) Check connectivity with broker coordinator in intervals and stop tasks if coordinator is unreachable by setting `assignmentSnapshot` to null and resetting rebalance delay when there are no lost tasks. And, because we're now sometimes setting `assignmentSnapshot` to null and reading it from other methods and thread, made this member volatile and used local references to ensure consistent reads. Adapted existing unit tests to verify additional debug calls, added more specific log messages to `DistributedHerder`, and added a new integration test that verifies the behavior when the brokers are stopped and restarted only after the workers lose their heartbeats with the broker coordinator. Author: Konstantine Karantasis <konstantine@confluent.io> Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>

…nces after zombie Connect worker rejoins the group (apache#7771) Check connectivity with broker coordinator in intervals and stop tasks if coordinator is unreachable by setting `assignmentSnapshot` to null and resetting rebalance delay when there are no lost tasks. And, because we're now sometimes setting `assignmentSnapshot` to null and reading it from other methods and thread, made this member volatile and used local references to ensure consistent reads. Adapted existing unit tests to verify additional debug calls, added more specific log messages to `DistributedHerder`, and added a new integration test that verifies the behavior when the brokers are stopped and restarted only after the workers lose their heartbeats with the broker coordinator. Author: Konstantine Karantasis <konstantine@confluent.io> Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>

…nces after zombie Connect worker rejoins the group (#7771) (#7783) Check connectivity with broker coordinator in intervals and stop tasks if coordinator is unreachable by setting `assignmentSnapshot` to null and resetting rebalance delay when there are no lost tasks. And, because we're now sometimes setting `assignmentSnapshot` to null and reading it from other methods and thread, made this member volatile and used local references to ensure consistent reads. Adapted existing unit tests to verify additional debug calls, added more specific log messages to `DistributedHerder`, and added a new integration test that verifies the behavior when the brokers are stopped and restarted only after the workers lose their heartbeats with the broker coordinator. Author: Konstantine Karantasis <konstantine@confluent.io> Reviewers: Greg Harris <gregh@confluent.io>, Randall Hauch <rhauch@gmail.com>

gharris1727 reviewed Dec 4, 2019

View reviewed changes

rhauch reviewed Dec 4, 2019

View reviewed changes

kkonstantine added 6 commits December 4, 2019 00:26

KAFKA-9184: Add a lightweight method to confirm connection to the bro…

de1c093

…ker coordinator is up from WorkerGroupMember

KAFKA-9184: Check connectivity with broker coordinator in intervals a…

ce4f95e

…nd stop tasks if coordinator is unreachable

KAFKA-9184: Reset rebalance delay when there are no lost tasks

9b37c17

KAFKA-9184: Remove unused import

1005a61

KAFKA-9184: Set assignmentSnapshot to null after stopping tasks

9560d0b

KAFKA-9184: Extend embedded kafka cluster to allow restarts and add i…

01caf29

…ntegration test

kkonstantine force-pushed the kafka-9184 branch from fd95e90 to 01caf29 Compare December 4, 2019 08:45

kkonstantine added 4 commits December 4, 2019 00:47

KAFKA-9184: Fix checkstyle

6fdae3a

KAFKA-9184: Remove unused method

6e6eb0a

KAFKA-9184: Adapt unit test to the additional debug calls

4e9a616

KAFKA-9184: Add more specific logs to DistributedHerder

f7ded4f

rhauch reviewed Dec 4, 2019

View reviewed changes

Comment thread ...ct/runtime/src/main/java/org/apache/kafka/connect/runtime/distributed/WorkerCoordinator.java

gharris1727 reviewed Dec 4, 2019

View reviewed changes

kkonstantine added 2 commits December 4, 2019 09:41

KAFKA-9184: Declare assignmentSnapshot volatile and use local copies …

6135111

…of the reference

KAFKA-9184: Add some more time between phases in the integration test

a031894

rhauch approved these changes Dec 4, 2019

View reviewed changes

gharris1727 approved these changes Dec 4, 2019

View reviewed changes

rhauch merged commit 0e57a39 into apache:trunk Dec 4, 2019

kkonstantine mentioned this pull request Dec 5, 2019

KAFKA-9184 (port on 2.3): Redundant task creation and periodic rebalances after zombie Connect worker rejoins the group #7783

Merged

3 tasks

kkonstantine deleted the kafka-9184 branch December 5, 2019 02:38

kkonstantine added the connect label Oct 16, 2020

Conversation

kkonstantine commented Dec 3, 2019

Committer Checklist (excluded from commit message)

Uh oh!

gharris1727 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gharris1727 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

kkonstantine Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

rhauch Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

kkonstantine Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

kkonstantine commented Dec 4, 2019

Uh oh!

Uh oh!

gharris1727 Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

kkonstantine Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

rhauch commented Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhauch commented Dec 4, 2019

Uh oh!

kkonstantine commented Dec 4, 2019

Uh oh!

rhauch left a comment

Choose a reason for hiding this comment

Uh oh!

rhauch commented Dec 4, 2019

Uh oh!

gharris1727 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gharris1727 left a comment •

edited

Loading

rhauch commented Dec 4, 2019 •

edited

Loading