Disable delete snapshot file during topic deletion and fix bug in gro… by kehuum · Pull Request #7 · linkedin/kafka

kehuum · 2019-03-27T17:34:44Z

Problem:

Snapshot files are always deleted during topic deletion, since it involves disk IOs and we do delete for each topic partition, this can slow down the deletion process thus blocking the controller from picking up other requests when there are large number of topic partitions. Since snapshot files are only used for transaction, which is not used in LinkedIn, it can be disabled.
During replica offline, controller is expected to send only one batched STOP_REPLICA request to destination broker with callback set to null. But currently the callback is set to (,) => (), which is not empty, thus preventing the grouping of message, so we end up sending one STOP_REPLICA request for each partition.

Testing:
Verified the fix in cert2 cluster by creating and deleting topics at the same time. With deleting snapshot files during topic deletion, we see ~1.5min delay in controller sending/receiving LEADER_AND_ISR request, and deleting snapshot files takes ~500ms for each partition, which contributes most to the broker processing time of STOP_REPLICA request; without the deleting, the processing time drops to ~10ms.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

…uping stop replica request when callback is null

hzxa21 · 2019-03-27T17:50:06Z

Snapshot files are always deleted during topic deletion, since it involves disk IOs and we do delete for each topic partition, this can slow down the deletion process thus blocking the controller from picking up other requests when there are large number of topic partitions. Since snapshot files are only used for transaction, which is not used in LinkedIn, it can be disabled.

One thing to notice is that there is already a background thread periodically checkpoionting the recovery offsets and deleting the snapshot files so not deleting snapshot files in the critical path of StopReplica should be okay.

hzxa21

Thanks for the investigation and the fix! LGTM.

Let's wait for the CI to finish before pushing the change.

jonlee2

Thanks for working on this!

xiowu0

Thanks for the FIX. LGTM.
please make sure all unit tests pass.

…#7) The Partition.isOneAboveMinIsr method is still defined in 3.6-li but the gauge that exposes it as a JMX/yammer metric was lost in the PR #538 squash. Restoring matches 3.0-li ReplicaManager.scala:296. Audit also flagged the dead-code method as P1 #8 — restoring the gauge wires it back as the original caller.

Original 3.0-li addition: commit a2bf781 'Add metrics for log compaction threads alive'. The gauge counts cleaner threads that are isAlive(), complementing the already-present DeadThreadCount gauge. Required by LI ops dashboards.

Original 3.0-li commit: 0e7ab47 'Mark FetchSession cache misses'. The CACHE_MISSES constant and INCREMENTAL_FETCH_SESSION_CACHE_MISSES_PER_SEC metric name were already defined in 3.6-li FetchSession companion object, but the meter registration in FetchSessionCache and the mark() call site on session-not-found were lost in the squash. Restore: - cacheMissesMeter registration in FetchSessionCache (mirrors evictionsMeter pattern, uses 3.6-li Collections.emptyMap() vs 3.0-li Map.empty). - markCacheMiss() method on FetchSessionCache. - markCacheMiss() invocation when a fetch session lookup fails (returning Errors.FETCH_SESSION_ID_NOT_FOUND). Required by LI ops dashboards.

Original 3.0-li commit: b3489a1 'Add BytesInTotal & MessagesInTotal counters'. The full counter infrastructure was lost in the PR #538 squash: the constants, CounterWrapper, counterMetricTypeMap, accessor methods, and call sites were all gone, plus the KafkaMetricsGroup.newCounter API itself disappeared when KafkaMetricsGroup migrated from Scala trait to Java class (org.apache.kafka.server.metrics). Restore in three parts: 1. KafkaMetricsGroup (Java, server-common): add public newCounter(name) and newCounter(name, tags) methods that delegate to KafkaYammerMetrics.defaultRegistry().newCounter — mirroring the newGauge / newMeter API surface. 2. KafkaRequestHandler.scala (BrokerTopicMetrics + BrokerTopicStats): - Add CounterWrapper case class (mirrors MeterWrapper lazy-init pattern) using metricsGroup.newCounter. - Add counterMetricTypeMap[String, CounterWrapper] populated with MessagesInTotal and BytesInTotal at construction. - Add bytesInTotal and messagesInTotal accessor methods. - Add counterMetricMap test accessor. - Wire close() and closeMetric() to also close counter wrappers. - Add MessagesInTotal and BytesInTotal constants on BrokerTopicStats. 3. ReplicaManager.scala: at the four post-append call sites (per-topic + all-topics for both bytes and messages), increment the counter alongside the existing rate meter mark.

Disable delete snapshot file during topic deletion and fix bug in gro…

c3fa769

…uping stop replica request when callback is null

kehuum requested review from hzxa21, jjkoshy, jonlee2 and xiowu0 March 27, 2019 17:35

hzxa21 approved these changes Mar 27, 2019

View reviewed changes

jonlee2 approved these changes Mar 27, 2019

View reviewed changes

xiowu0 reviewed Mar 27, 2019

View reviewed changes

xiowu0 approved these changes Mar 27, 2019

View reviewed changes

kehuum merged this pull request into linkedin:2.0-li Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable delete snapshot file during topic deletion and fix bug in gro…#7

Disable delete snapshot file during topic deletion and fix bug in gro…#7
kehuum merged 1 commit into
linkedin:2.0-lifrom
kehuum:2.0-li

kehuum commented Mar 27, 2019

Uh oh!

hzxa21 commented Mar 27, 2019

Uh oh!

hzxa21 left a comment

Uh oh!

jonlee2 left a comment

Uh oh!

xiowu0 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kehuum commented Mar 27, 2019

Committer Checklist (excluded from commit message)

Uh oh!

hzxa21 commented Mar 27, 2019

Uh oh!

hzxa21 left a comment

Choose a reason for hiding this comment

Uh oh!

jonlee2 left a comment

Choose a reason for hiding this comment

Uh oh!

xiowu0 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants