Skip to content

[LI-HOTFIX] Event based fetcher part 3/3: Adopting the event based fetcher model#124

Merged
gitlw merged 5 commits into
linkedin:2.4-lifrom
gitlw:event_based_fetcher_3
Apr 12, 2021
Merged

[LI-HOTFIX] Event based fetcher part 3/3: Adopting the event based fetcher model#124
gitlw merged 5 commits into
linkedin:2.4-lifrom
gitlw:event_based_fetcher_3

Conversation

@gitlw

@gitlw gitlw commented Mar 23, 2021

Copy link
Copy Markdown

This PR starts using the event based fetcher model, which replaces the existing lock based classes AbstractFetcherThread
and ReplicaFetcherThread.

To make implementation easier, I've decided to disable the following features for now

  1. the JBOD logic implemented via ReplicaAlterLogDirsManager for now.
    JBOD is not used within LinkedIn, so this change should be safe.

  2. support for dynamically changing the fetcher thread pool size,
    I checked the clusters in a given production fabric, including krc, queuing, metrics, venice, tracking, kac, data-deployment
    and don't see a ZooKeeper override for the "num.replica.fetchers" config.

  3. Many tests within the ConsumerBounceTest class are flaky, which is echod by the open source tickets
    https://issues.apache.org/jira/browse/KAFKA-7893?jql=text%20~%20%22ConsumerBounceTest%22
    Thus I'm disabling the entire test class for now.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@gitlw gitlw requested review from smccauliff and xiowu0 March 30, 2021 23:16
@gitlw gitlw marked this pull request as ready for review March 30, 2021 23:16
@xiowu0

xiowu0 commented Mar 30, 2021

Copy link
Copy Markdown

would someone take down the broker by sending "ALTER_REPLICA_LOG_DIRS" requests to the clusters?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function is called on every leaderAndIsr request as part of "becomeLeaderOrFollower". So how much delay can be introduced by sending "getPartitionsCount" request to event manager and waiting for result?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In upstream Kafka, the request handler thread needs to compete for the partitionMapLock in order to get the partitions count. And the fetcher.idle variable is a hotfix made internally by us.
When switching to the event-based model, I feel it's cleaner to request the partition count using an event + future.
I'm not sure about the performance difference yet, but I have plans to eventually merge the remove partition, add partition and get partition count into one event if necessary. So ultimately the extra delay introduced by the GetPartitionCount event may not matter.

@gitlw gitlw force-pushed the event_based_fetcher_3 branch from a04854e to 324a3d2 Compare April 1, 2021 18:01
@gitlw

gitlw commented Apr 1, 2021

Copy link
Copy Markdown
Author

ALTER_REPLICA_LOG_DIRS

I've added a change in the latest commit to disallow the AlterReplicaLogDirs request in the authorization step,
which will make sure we are not vulnerable to this request.

Comment thread core/src/test/scala/unit/kafka/server/EventBasedFetcherTest.scala Outdated

@xiowu0 xiowu0 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Do you plan to re-enable JBOD in the future?

@gitlw

gitlw commented Apr 12, 2021

Copy link
Copy Markdown
Author

@xiowu0 Thanks for the review. I'll revisit JBOD only if it turns out to be needed within LI.

@gitlw gitlw merged commit 8005824 into linkedin:2.4-li Apr 12, 2021
@gitlw gitlw deleted the event_based_fetcher_3 branch April 12, 2021 18:27
gitlw added a commit to gitlw/kafka that referenced this pull request Feb 24, 2022
…tcher model (linkedin#124)

TICKET = KAFKA-10734
LI_DESCRIPTION = Part 3 of 3 PRs to change the fetcher into an event-based model
EXIT_CRITERIA = When KAFKA-10734 is closed and the changes are pulled in as a part of a release
lmr3796 pushed a commit to lmr3796/kafka that referenced this pull request Mar 25, 2022
…tcher model (linkedin#124)

TICKET = KAFKA-10734
LI_DESCRIPTION = Part 3 of 3 PRs to change the fetcher into an event-based model
EXIT_CRITERIA = When KAFKA-10734 is closed and the changes are pulled in as a part of a release
lmr3796 pushed a commit to lmr3796/kafka that referenced this pull request Jun 2, 2022
…tcher model (linkedin#124)

TICKET = KAFKA-10734
LI_DESCRIPTION = Part 3 of 3 PRs to change the fetcher into an event-based model
EXIT_CRITERIA = When KAFKA-10734 is closed and the changes are pulled in as a part of a release
earlcoder added a commit that referenced this pull request Apr 28, 2026
The async/event-based replica fetcher series (TransferLeaderManager,
AbstractAsyncFetcher, AsyncReplicaFetcher, FetcherEventBus,
FetcherEventManager — PRs #121/#123/#124/#143/#144/#403/#406) was
removed in the PR #538 squash. The li.async.fetcher.enable config key
was left behind in KafkaConfig as dead surface area: it parsed and
validated, but had no consumers.

Going with audit P1 #6 option (a): retire the feature, remove the dead
config key. Operators with li.async.fetcher.enable=<anything> in their
server.properties may see an 'Unknown configuration' warning at broker
startup; this is harmless — the broker still starts and the value would
have been a no-op anyway.

Removes:
- Defaults.LiAsyncFetcherEnabled
- KafkaConfig.LiAsyncFetcherEnableProp
- The brokerConfigDef.define(...) registration
- KafkaConfig.liAsyncFetcherEnable accessor

Audit option (b) — porting the entire async fetcher series forward
into 3.6-li — was deferred. If the optimization is needed in the
future, it would need to be reimplemented from scratch on top of
upstream's current ReplicaFetcherThread / ReplicaFetcherManager.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants