KAFKA-10500: Add failed-stream-threads metric for adding + removing stream threads by lct45 · Pull Request #9614 · apache/kafka

lct45 · 2020-11-18T18:12:29Z

Per KIP-663, adding a metric to record the failed streams threads over the life of a client.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

lct45 · 2020-11-18T18:15:48Z

lct45 · 2020-11-18T18:37:29Z

        );
        checkCacheMetrics(builtInMetricsVersion);
-
+        verifyFailedStreamThreadsSensor(0.0);


We also need to verify that the metric works when there is a failed stream thread. Options are (1) to create a custom processor now and (IIRC) run the test suite twice, once with failing stream threads and once without to confirm that the metric works. I'm not sure if the custom processor will let us just fail one stream thread right before closing the app. Or (2) wait until add/remove stream threads is implemented and remove threads and test the metric after removing some threads before closing the app. WDYT?

I think you should try using the custom processor. You can find an example in StreamsUncaughtExceptionHandlerIntegrationTest.java

I would put the test whether the metric is recorded correctly in StreamThreadTest. An example for such a test is shouldLogAndRecordSkippedRecordsForInvalidTimestamps(). I do not think an integration test is needed. The test regarding the existence of the metric, i.e., checkMetricByName(listMetricThread, FAILED_STREAM_THREADS, 1); should stay here.

After looking at both test classes, I think it actually might make the most sense to put the test for this metric in StreamsUncaughtExceptionHandlerIntegrationTest, since the metric is so closely aligned with the exception handler anyways and the setup works nicely with what we're trying to test with the metric. From the size + complexity of the other test classes, I think creating an overloaded processor for one test out of 20+ tests seems tricky.

wcarlson5

I am not sure it is necessary but you might want to add an integration test that kills a few threads and check the metrics. You would need to set the old handler to get a single thread to die as of now

cadonna

@lct45 Thank you for the PR!

Here my feedback.

cadonna · 2020-11-19T14:09:34Z

                log.info("State transition from {} to {}", oldState, newState);
+                if (newState == State.DEAD) {
+                    failedStreamThreadSensor.record();
+                }


Not every dead stream thread is a failed stream thread. You should record this metric where the uncaught exception handler is called because there we now that a stream thread died unexpectedly.

Would that just be in run() of the GlobalStreamThread then?

No, that would be in StreamThread#runLoop().

cadonna · 2020-11-19T14:37:06Z

        );
        checkCacheMetrics(builtInMetricsVersion);
-
+        verifyFailedStreamThreadsSensor(0.0);


I would put the test whether the metric is recorded correctly in StreamThreadTest. An example for such a test is shouldLogAndRecordSkippedRecordsForInvalidTimestamps(). I do not think an integration test is needed. The test regarding the existence of the metric, i.e., checkMetricByName(listMetricThread, FAILED_STREAM_THREADS, 1); should stay here.

cadonna

@lct45 Thank you for the updates!

I have rather minor comments.

cadonna

LGTM!

Call for committer review and merge: @ableegoldman @mjsax @vvcephei @guozhangwang @abbccdda

mjsax

Just some nits

mjsax · 2020-12-01T01:51:48Z

+                                          final Sensor... parents) {
+        synchronized (clientLevelSensors) {
+            final String fullSensorName = CLIENT_LEVEL_GROUP + SENSOR_NAME_DELIMITER + sensorName;
+            final Sensor sensor = metrics.getSensor(fullSensorName);


Should we rewrite this the same way threadLevelSensor is written (ie, using orElseGet) for consistency?

I requested this. See my comment #9614 (comment)

I'm good either way (:

I am fine either way, too, but I prefer consistency... So should we rewrite the other method as a side cleanup?

I am fine with consistency and clean-up, but I would like to have the clean-up in a separate PR.

I changed it back for consistency and will open up a fix PR to update both of them to the new syntax

mjsax · 2020-12-01T02:00:40Z

+        metrics.removeSensor(sensorKeys.getValues().get(0));
+        metrics.removeSensor(sensorKeys.getValues().get(1));
+        expect(metrics.removeMetric(metricName1)).andReturn(mock(KafkaMetric.class));
+        expect(metrics.removeMetric(metricName2)).andReturn(mock(KafkaMetric.class));


Why did we change this from andStubReturn(null) to andReturn(mock(KafkaMetric.class))?

Must've been an accidental change when trying to get the test to work. shouldRemoveStateStoreLevelSensors uses andReturn(mock(KafkaMetric.class)) so that's where it came from, but this test works with andStubReturn(null) so I changed it back to that

lct45 · 2020-12-01T17:20:30Z


    private void setupRemoveSensorsTest(final Metrics metrics,
-                                        final String level,
-                                        final RecordingLevel recordingLevel) {


This wasn't being used so I went ahead and took it out

mjsax · 2020-12-03T19:06:27Z

Retest this please

mjsax

LGTM. Will merge after the build passed.

Leah Thomas added 3 commits November 17, 2020 15:17

adding failed stream metric

17843bf

adding testing

b32c64d

Merge branch 'trunk' into thread_metrics

845b175

lct45 commented Nov 18, 2020

View reviewed changes

wcarlson5 reviewed Nov 18, 2020

View reviewed changes

Comment thread streams/src/main/java/org/apache/kafka/streams/KafkaStreams.java Outdated

Comment thread streams/src/main/java/org/apache/kafka/streams/processor/internals/StreamThread.java

Walker's updates

44db7af

cadonna reviewed Nov 19, 2020

View reviewed changes

Updated testing and fixes

f6db8cd

lct45 force-pushed the thread_metrics branch from 65aa29a to f6db8cd Compare November 23, 2020 16:20

cadonna reviewed Nov 24, 2020

View reviewed changes

Review clean up

e395ba7

cadonna approved these changes Nov 24, 2020

View reviewed changes

mjsax reviewed Dec 1, 2020

View reviewed changes

mjsax added the streams label Dec 1, 2020

Fixes from Matthias's comments

fc6cf52

lct45 commented Dec 1, 2020

View reviewed changes

updated consistency

c5df6eb

mjsax approved these changes Dec 3, 2020

View reviewed changes

mjsax merged commit 4cc6d20 into apache:trunk Dec 4, 2020

lct45 mentioned this pull request Dec 4, 2020

MINOR: Clean up streams metric sensors #9696

Merged

lct45 mentioned this pull request Jan 26, 2021

KAFKA-10500: Add docs for failed stream thread metric #9974

Merged

3 tasks

mjsax added the kip Requires or implements a KIP label Jan 27, 2021

Conversation

lct45 commented Nov 18, 2020

Committer Checklist (excluded from commit message)

Uh oh!

lct45 commented Nov 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wcarlson5 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cadonna left a comment

Choose a reason for hiding this comment

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax commented Dec 3, 2020

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

wcarlson5 left a comment •

edited

Loading