KAFKA-9113: Unit test for ProcessorStateManager and ChangelogReader by guozhangwang · Pull Request #4 · guozhangwang/kafka

guozhangwang · 2020-01-10T23:04:02Z

Further refactoring on APIs between state manager and changelog reader per @cadonna's comments. Also renamed ProcessorStateManager to TaskStateManager per comments.
Complete unit test for ProcessorStateManager.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

guozhangwang · 2020-01-10T23:06:42Z


            checkpointFile.delete();
-        } catch (final IOException e) {
+        } catch (final IOException | RuntimeException e) {


I realized that some parsing exceptions are just runtime exceptions so catching both here.

guozhangwang · 2020-01-10T23:07:19Z

        final Map<TopicPartition, Long> changelogOffsets = new HashMap<>();
        for (final StateStoreMetadata storeMetadata : stores.values()) {
-            if (storeMetadata.changelogPartition != null && storeMetadata.offset != null) {
+            if (storeMetadata.changelogPartition != null) {


Here we still want to return the changelog -> offset entry even if it is null (indicating the current position unknown).

guozhangwang · 2020-01-10T23:08:06Z


    }

-    public void resetRestoredBatch() {


Some minor cleanup on mock functions that are not used, ditto elsewhere.

guozhangwang · 2020-01-10T23:11:27Z

@cadonna @vvcephei @ableegoldman for reviews. Note that a large portion of the change is for the renaming part, the major one to review is on TaskStateManager/Test which is around 450 loc..

guozhangwang · 2020-01-11T02:45:52Z

@cadonna Git messed with the renaming to delete / create files, so to help reviews I've reverted the renaming and will do in a later PR. Now the diff is less scary :)

abbccdda

Thanks for the PR. Left a few comments, will read again once StoreChangelogReaderTest is fixed

abbccdda · 2020-01-14T18:55:15Z

     * Register a state store for restoration.
     *
-     * @param partition the changelog topic partition to restore
+     * @param partition the state store's shcangelog partition for restoring


req: spelling

abbccdda · 2020-01-14T19:01:03Z

-        final StateStoreMetadata store = mustFindStore(changelogPartition);
+    StateStoreMetadata storeMetadata(final TopicPartition partition) {
+        for (final StateStoreMetadata storeMetadata : stores.values()) {
+            if (storeMetadata.changelogPartition != null && storeMetadata.changelogPartition.equals(partition)) {


q: Could we just do partition.equals(storeMetadata.changelogPartition)?

abbccdda · 2020-01-14T19:05:09Z

+
+    private StateStoreMetadata findStore(final TopicPartition changelogPartition) {
+        final List<StateStoreMetadata> found = stores.values().stream()
+            .filter(metadata -> metadata.changelogPartition != null &&


Same here, do we also have a concern that passed in changelogPartition could be null?

abbccdda · 2020-01-14T19:08:48Z

 import static org.junit.Assert.assertTrue;
 import static org.junit.Assert.fail;

 // TODO K9113: fix tests


req: remove this

abbccdda · 2020-01-14T20:36:13Z

    }

-    private ProcessorStateManager getStandByStateManager(final TaskId taskId) {
+    private ProcessorStateManager getStandByStateManager() {


nit: this could be merged with getActiveStateManager as getStateManager(AbstractTask.TaskType type)

abbccdda · 2020-01-14T20:59:48Z

+
+            assertThat(store.keys.size(), is(1));
+            assertTrue(store.keys.contains(key));
+            assertEquals(17, store.values.get(0).length);


nit: similarly, add a comment to explain 17

abbccdda · 2020-01-14T21:02:37Z

-        ackedOffsets.put(new TopicPartition(ProcessorStateManager.storeChangelogTopic(applicationId, "otherTopic"), 1), 789L);
+        ackedOffsets.put(persistentStorePartition, 123L);
+        ackedOffsets.put(nonPersistentStorePartition, 456L);
+        ackedOffsets.put(new TopicPartition("otherTopic", 1), 789L);


nit: s/otherTopic/nonRegisteredTopic

abbccdda · 2020-01-14T21:05:51Z

+
+        try {
+            stateMgr.initStoresFromCheckpointedOffsets();
+            fail("should have thrown procesor state exception when IO exception happens");


req: spelling

abbccdda · 2020-01-14T21:06:13Z

+
+        try {
+            stateMgr.restore(storeMetadata, singletonList(consumerRecord));
+            fail("should have thrown procesor state exception when IO exception happens");


req: same here

abbccdda · 2020-01-14T21:07:43Z

+    }
+
    @Test
    public void shouldFlushAllStoresEvenIfStoreThrowsException() {


nit: s/shouldFlushAllStoresEvenIfStoreThrowsException/shouldFlushGoodStoresEvenSomeThrowsException

guozhangwang

Added unit test coverage for StoreChangelogReader

guozhangwang · 2020-01-14T23:07:35Z

                        subscriptions.position(entry.getKey(), newPosition);
                    }
                }
+                entry.getValue().clear();


Found this bug while working on the unit tests: when a partition is paused, we should not clear its records since they would not be returned; we should only clear the corresponding records which are included in the results.

@guozhangwang There is already a PR for this issue (apache#7505) Can you maybe review and merge this PR? Thanks to @ableegoldman for pointing out the overlap.

guozhangwang · 2020-01-14T23:07:53Z

- * Thus, the task raising this exception can be cleaned up and closed as "zombie".
+ * Indicates that one or more tasks got migrated to another thread.
+ *
+ * 1) if the task field is specified, then that single task should be cleaned up and closed as "zombie" while the


This is the proposed exception semantics.

Can you elaborate on when which version of TaskMigratedException should be used? ie, when is it possible for just one task to be a zombie while all the others can/should continue as normal

It is summarized in https://confluentinc.atlassian.net/browse/KSTREAMS-3302. Note it is just one proposal and we can debate whether we agree or disagree about that :)

I see, that makes sense. I have some thoughts on the handling in that case but I'll save it for the sync

We can do this in a follow-up PR, but just to be clear we should always treat a TaskMigratedException as applying to all tasks, and close them all as zombies. There's no possible way for one task to be a zombie while the others are able to continue processing

guozhangwang · 2020-01-14T23:08:42Z

                    }
                }
-
-                // this is an optimization: if there's no buffered records so far, then we can reuse


We have to give up this optimization since the records returned from consumer is unmodifiable view, so we cannot call clear on it directly.

guozhangwang · 2020-01-14T23:09:40Z

    public void init(final ProcessorContext context,
                     final StateStore root) {
        context.register(root, stateRestoreCallback);
-        if (simulateForwardOnFlush) {


Piggybacked minor cleanup as the simulateForwardOnFlush is not called anymore.

guozhangwang · 2020-01-14T23:11:49Z

ping @abbccdda @ableegoldman @cadonna again for final reviews.

abbccdda

Took a more thorough look at StoreChangelogReaderTest. Overall I think we are still missing unit test coverage for exception states, and would be good to either do a coverage for standby/active changelog in general tests, or just specify that in the meta comment saying this is a common test.

abbccdda · 2020-01-16T17:19:04Z


            if (endOffset != null && committedOffset != null) {
+                if (changelogMetadata.restoreEndOffset != null)
+                    throw new IllegalStateException("End offset for " + partition +


req: this is not covered in unit test.

Most of the places where IllegalStateException is thrown it is a place where we assert this should never happen (i.e. it did not depend on other modules) unless there's a bug, in which case we should always shout out and fail fast. So I think we do not need to have a unit test for such cases.

Other places of IllegalState is when the caller (e.g. ProcessorStateManager) had a bug like trying to register the same store twice, and only those cases that depend on other module bugs would have a test case.

abbccdda · 2020-01-16T17:19:55Z

-        final ChangelogMetadata changelogMetadata = new ChangelogMetadata(partition, stateManager);
+        final StateStoreMetadata storeMetadata = stateManager.storeMetadata(partition);
+        if (storeMetadata == null) {
+            throw new IllegalStateException("Cannot find the corresponding state store metadata for changelog " +


req: this is not covered in the unit test

abbccdda · 2020-01-16T17:20:34Z

+        final Long currentOffset = metadata.storeMetadata.offset();
        if (endOffset == null) {
            // end offset is not initialized meaning that it is from a standby task,
            // this should never happen since we only call this function for active task in restoring phase


req: endOffset == null is not covered

Same as above, this should never happen (indicating a bug).

abbccdda · 2020-01-16T17:22:49Z

+        assertEquals(StoreChangelogReader.ChangelogState.COMPLETED, changelogReader.changelogMetadata(tp).state());
+        assertEquals(10L, (long) changelogReader.changelogMetadata(tp).endOffset());
+        assertEquals(0L, changelogReader.changelogMetadata(tp).totalRestored());
+        assertEquals(Collections.singleton(tp), changelogReader.completedChangelogs());


prop: I have seen multiple tests for completedChangelogs but no one adds multiple topic partitions. Could we do a test with some partitions completed while some are not.

Yes there's shouldRestoreMultipleChangelogs

abbccdda · 2020-01-16T17:48:01Z

+// TODO K9113: we need to consider how to handle InvalidOffsetException for consumer#poll / position
 public class StoreChangelogReader implements ChangelogReader {

    enum ChangelogState {


prop: just notice that although this enum only has 3 states, do we still want to build some state transition check for it?

The state transition is simply as registered -> restoring -?-> completed, so I did not build it, but just to be safe I will add a transition for it.

abbccdda · 2020-01-16T17:52:38Z

@@ -364,22 +423,14 @@ public void restore() {
                    final ConsumerRecord<byte[], byte[]> record = iterator.next();


nit: this iterator interface could be replaced with a for each loop

abbccdda · 2020-01-16T17:55:57Z

        for (final ChangelogMetadata changelogMetadata: newPartitionsToRestore) {
-            final TopicPartition partition = changelogMetadata.changelogPartition;
-            final StateStoreMetadata storeMetadata = changelogMetadata.stateManager.storeMetadata(partition);
+            final StateStoreMetadata storeMetadata = changelogMetadata.storeMetadata;


req: comment here for L651 for no unit test coverage

abbccdda · 2020-01-16T17:56:47Z

+        try {
+            restoreConsumer.unsubscribe();
+        } catch (KafkaException e) {
+            throw new StreamsException("Restore consumer get unexpected error unsubscribing", e);


req: this case is not tested.

abbccdda · 2020-01-16T17:58:32Z

+        assertEquals(mkSet(tp1, tp2), consumer.paused());

-        changelogReader.register(topicPartition, stateManager);
+        // transition to restore active is idempotent


q: what does this comment suggest? We are not calling transitToRestoreActive multiple times here

When the changelog reader is created it is always in restoreActive state already. I can add a few lines to make it more explicit.

abbccdda · 2020-01-16T18:04:01Z

-        expect(active.restoringTaskFor(topicPartition)).andStubReturn(task);
-        replay(active, task);
-        changelogReader.restore();
+    public void shouldThrowIfRestoreCallbackThrows() {


nit: for any task type neutral tests, I would feel more comfortable if we could allow testing for both active and standby state managers in two separate tests, or share the same skeleton somehow

I've made the test parameterized.

abbccdda

LGTM, thanks for the work!

This PR is collaborated by Guozhang Wang and John Roesler. It is a significant tech debt cleanup on task management and state management, and is broken down by several sub-tasks listed below: Extract embedded clients (producer and consumer) into RecordCollector from StreamTask. guozhangwang#2 guozhangwang#5 Consolidate the standby updating and active restoring logic into ChangelogReader and extract out of StreamThread. guozhangwang#3 guozhangwang#4 Introduce Task state life cycle (created, restoring, running, suspended, closing), and refactor the task operations based on the current state. guozhangwang#6 guozhangwang#7 Consolidate AssignedTasks into TaskManager and simplify the logic of changelog management and task management (since they are already moved in step 2) and 3)). guozhangwang#8 guozhangwang#9 Also simplified the StreamThread logic a bit as the embedded clients / changelog restoration logic has been moved into step 1) and 2). guozhangwang#10 Reviewers: A. Sophie Blee-Goldman <sophie@confluent.io>, Bruno Cadonna <bruno@confluent.io>, Boyang Chen <boyang@confluent.io>

guozhangwang added 3 commits January 10, 2020 09:28

compile unit tests

f4cace4

unit test for state manager

eb4b073

rebase from k9113-base

ec531d7

guozhangwang commented Jan 10, 2020

View reviewed changes

guozhangwang force-pushed the K9113-unit-tests-store-changelog branch from 436efb6 to ec531d7 Compare January 11, 2020 02:44

guozhangwang mentioned this pull request Jan 12, 2020

KAFKA-9113: Unit test for RecordCollector #5

Merged

3 tasks

guozhangwang added 4 commits January 13, 2020 17:20

first unit tests

1cfbef9

more unit tests

d3b5e1a

remove unnecessary unit tests

2921a2c

catch exception in restore callback

0a5e2df

abbccdda reviewed Jan 14, 2020

View reviewed changes

guozhangwang added 2 commits January 14, 2020 14:06

add exception handling

a6d7a71

finish all unit tests

024bd96

guozhangwang commented Jan 14, 2020

View reviewed changes

guozhangwang changed the title ~~KAFKA-9113: Unit test for ProcessorStateManager~~ KAFKA-9113: Unit test for ProcessorStateManager and ChangelogReader Jan 14, 2020

address comments

68799a8

abbccdda reviewed Jan 16, 2020

View reviewed changes

address comments

dc0ac04

abbccdda approved these changes Jan 16, 2020

View reviewed changes

guozhangwang merged commit 7053099 into k9113-base Jan 16, 2020

guozhangwang mentioned this pull request Jan 28, 2020

KAFKA-9113: Clean up task management and state management apache/kafka#7997

Merged

3 tasks

guozhangwang deleted the K9113-unit-tests-store-changelog branch April 24, 2020 23:43

		@@ -364,22 +423,14 @@ public void restore() {
		final ConsumerRecord<byte[], byte[]> record = iterator.next();

Conversation

guozhangwang commented Jan 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jan 10, 2020

Uh oh!

guozhangwang commented Jan 11, 2020

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jan 14, 2020

Uh oh!

abbccdda left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guozhangwang commented Jan 10, 2020 •

edited

Loading