Skip to content

KAFKA-8869: Remove task configs for deleted connectors from config snapshot#8444

Merged
kkonstantine merged 2 commits into
apache:trunkfrom
C0urante:kafka-8869
May 21, 2020
Merged

KAFKA-8869: Remove task configs for deleted connectors from config snapshot#8444
kkonstantine merged 2 commits into
apache:trunkfrom
C0urante:kafka-8869

Conversation

@C0urante

@C0urante C0urante commented Apr 8, 2020

Copy link
Copy Markdown
Contributor

Jira

Currently, if a connector is deleted, its task configurations will remain in the config snapshot tracked by the KafkaConfigBackingStore. This causes issues with incremental cooperative rebalancing, which utilizes that config snapshot to determine which connectors and tasks need to be assigned across the cluster. Specifically, it first checks to see which connectors are present in the config snapshot, and then, for each of those connectors, queries the snapshot for that connector's task configs.

The lifecycle of a connector is for its configuration to be written to the config topic, that write to be picked up by the workers in the cluster and trigger a rebalance, the connector to be assigned to and started by a worker, task configs to be generated by the connector and then written to the config topic, that write to be picked up by the workers in the cluster and trigger a second rebalance, and finally, the tasks to be assigned to and started by workers across the cluster.

There is a brief period in between the first time the connector is started and when the second rebalance has completed during which those stale task configs from a previously-deleted version of the connector will be used by the framework to start tasks for that connector.

This fix aims to eliminate that window by preemptively clearing the task configs from the config snapshot for a connector whenever it has been deleted.

An existing unit test is modified to verify this behavior, and should provide sufficient guarantees that the bug has been fixed, since the cause of the behavior has been narrowed down to incorrect values in the taskConfigs field for the ClusterConfigState provided by the KafkaConfigBackingStore.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@C0urante

C0urante commented Apr 8, 2020

Copy link
Copy Markdown
Contributor Author

@ncliang, @gharris1727, @MichaelDrogalis, would you mind taking a look at this when you have a chance?

@ncliang ncliang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @C0urante . LGTM in general, just a question about keeping connectorTaskCounts in sync with taskConfigs.

@rhauch rhauch added the connect label Apr 8, 2020

@ncliang ncliang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @C0urante !

@C0urante

C0urante commented Apr 9, 2020

Copy link
Copy Markdown
Contributor Author

@rhauch, @kkonstantine would one of you mind taking a look when you have a chance?

@kkonstantine

Copy link
Copy Markdown
Contributor

ok to test

@kkonstantine kkonstantine left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kkonstantine

Copy link
Copy Markdown
Contributor

jdk8: success
jdk11: success
jdk14: single failure on known flaky test: org.apache.kafka.streams.integration.EosBetaUpgradeIntegrationTest.shouldUpgradeFromEosAlphaToEosBeta[true]

@kkonstantine kkonstantine merged commit 82f5efa into apache:trunk May 21, 2020
kkonstantine pushed a commit that referenced this pull request May 21, 2020
…apshot (#8444)

Currently, if a connector is deleted, its task configurations will remain in the config snapshot tracked by the KafkaConfigBackingStore. This causes issues with incremental cooperative rebalancing, which utilizes that config snapshot to determine which connectors and tasks need to be assigned across the cluster. Specifically, it first checks to see which connectors are present in the config snapshot, and then, for each of those connectors, queries the snapshot for that connector's task configs.

The lifecycle of a connector is for its configuration to be written to the config topic, that write to be picked up by the workers in the cluster and trigger a rebalance, the connector to be assigned to and started by a worker, task configs to be generated by the connector and then written to the config topic, that write to be picked up by the workers in the cluster and trigger a second rebalance, and finally, the tasks to be assigned to and started by workers across the cluster.

There is a brief period in between the first time the connector is started and when the second rebalance has completed during which those stale task configs from a previously-deleted version of the connector will be used by the framework to start tasks for that connector. This fix aims to eliminate that window by preemptively clearing the task configs from the config snapshot for a connector whenever it has been deleted.

An existing unit test is modified to verify this behavior, and should provide sufficient guarantees that the bug has been fixed.

Reviewers: Nigel Liang <nigel@nigelliang.com>, Konstantine Karantasis <konstantine@confluent.io>
kkonstantine pushed a commit that referenced this pull request May 21, 2020
…apshot (#8444)

Currently, if a connector is deleted, its task configurations will remain in the config snapshot tracked by the KafkaConfigBackingStore. This causes issues with incremental cooperative rebalancing, which utilizes that config snapshot to determine which connectors and tasks need to be assigned across the cluster. Specifically, it first checks to see which connectors are present in the config snapshot, and then, for each of those connectors, queries the snapshot for that connector's task configs.

The lifecycle of a connector is for its configuration to be written to the config topic, that write to be picked up by the workers in the cluster and trigger a rebalance, the connector to be assigned to and started by a worker, task configs to be generated by the connector and then written to the config topic, that write to be picked up by the workers in the cluster and trigger a second rebalance, and finally, the tasks to be assigned to and started by workers across the cluster.

There is a brief period in between the first time the connector is started and when the second rebalance has completed during which those stale task configs from a previously-deleted version of the connector will be used by the framework to start tasks for that connector. This fix aims to eliminate that window by preemptively clearing the task configs from the config snapshot for a connector whenever it has been deleted.

An existing unit test is modified to verify this behavior, and should provide sufficient guarantees that the bug has been fixed.

Reviewers: Nigel Liang <nigel@nigelliang.com>, Konstantine Karantasis <konstantine@confluent.io>
kkonstantine pushed a commit that referenced this pull request May 21, 2020
…apshot (#8444)

Currently, if a connector is deleted, its task configurations will remain in the config snapshot tracked by the KafkaConfigBackingStore. This causes issues with incremental cooperative rebalancing, which utilizes that config snapshot to determine which connectors and tasks need to be assigned across the cluster. Specifically, it first checks to see which connectors are present in the config snapshot, and then, for each of those connectors, queries the snapshot for that connector's task configs.

The lifecycle of a connector is for its configuration to be written to the config topic, that write to be picked up by the workers in the cluster and trigger a rebalance, the connector to be assigned to and started by a worker, task configs to be generated by the connector and then written to the config topic, that write to be picked up by the workers in the cluster and trigger a second rebalance, and finally, the tasks to be assigned to and started by workers across the cluster.

There is a brief period in between the first time the connector is started and when the second rebalance has completed during which those stale task configs from a previously-deleted version of the connector will be used by the framework to start tasks for that connector. This fix aims to eliminate that window by preemptively clearing the task configs from the config snapshot for a connector whenever it has been deleted.

An existing unit test is modified to verify this behavior, and should provide sufficient guarantees that the bug has been fixed.

Reviewers: Nigel Liang <nigel@nigelliang.com>, Konstantine Karantasis <konstantine@confluent.io>
@kkonstantine

Copy link
Copy Markdown
Contributor

Merged to trunk and backported to 2.5, 2.4 and 2.3

Kvicii pushed a commit to Kvicii/kafka that referenced this pull request May 22, 2020
* 'trunk' of github.com:apache/kafka:
  KAFKA-9980: Fix bug where alterClientQuotas could not set default client quotas (apache#8658)
  KAFKA-9780: Deprecate commit records without record metadata (apache#8379)
  MINOR: Deploy VerifiableClient in constructor to avoid test timeouts (apache#8651)
  MINOR: Added unit tests for ConnectionQuotas (apache#8650)
  MINOR: Correct MirrorMaker2 integration test configs for Connect internal topics (apache#8653)
  KAFKA-9855 - return cached Structs for Schemas with no fields (apache#8472)
  KAFKA-9950: Construct new ConfigDef for MirrorTaskConfig before defining new properties (apache#8608)
  KAFKA-8869: Remove task configs for deleted connectors from config snapshot (apache#8444)
  KAFKA-9409: Supplement immutability of ClusterConfigState class in Connect (apache#7942)
@C0urante C0urante deleted the kafka-8869 branch February 27, 2022 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants