KAFKA-6049: Add non-windowed Cogroup operator (KIP-150) by wcarlson5 · Pull Request #7538 · apache/kafka

wcarlson5 · 2019-10-16T22:35:23Z

More detailed description of your change,
if necessary. The PR title and PR message become
the squashed commit message, so use a separate
comment to ping reviewers.

Summary of testing strategy (including rationale)
for the feature or bug fix. Unit and/or integration
tests are expected for any behaviour change and
system tests should be considered for larger changes.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

wcarlson5 · 2019-10-16T22:38:38Z

it is necessary to to do an unchecked cast for the input value type. This is because cogrouped can have any type value for the group streams it intakes

wcarlson5 · 2019-10-16T22:40:27Z

A new case will be added for windowed streams

mjsax

Did an initial pass. Mostly nits.

We need to have a test that trigger repartitioning though -- I think, atm repartitioning would not work correctly. This might require an integration test.

wcarlson5 · 2019-10-24T22:54:39Z

retest this please

wcarlson5 · 2019-11-12T23:31:58Z

retest this please

mjsax

Sorry for the long wait... I'll stay in this one to hopefully merge by Wednesday.

Btw: Can you rebase your PR to trunk -- a PR was merged that changes how the gradle build works, and you should pick up this change.

mjsax · 2019-11-26T00:39:25Z

Where do you validate that this name is picked up? -- Also, we we need to actually pipe input via TDD? Seems a builder.build().describe() would be sufficient to verify the name without the need to process any data?

With regard to naming: we should also check that the store in only queryable is a name is specified via Materialized.

I was using this test to print the topology and it shows two sub topologies while it should be one (seems the reason is that you use the same StreamsBuilder as in setup() method.

Also, the naming of the operators seems to be incorrect. Also wondering if KGroupedStream#cogroup() needs on overload that takes a Named parameter? Maybe not, but the specified Named from aggregate() would need to be used for other processors, too. Atm there is this weird COGROUPKSTREAM-AGGREGATE-KSTREAM-SOURCE-0000000001test

Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [topic]) --> none Sub-topology: 1 Source: KSTREAM-SOURCE-0000000001 (topics: [one]) --> COGROUPKSTREAM-AGGREGATE-KSTREAM-SOURCE-0000000001test Processor: COGROUPKSTREAM-AGGREGATE-KSTREAM-SOURCE-0000000001test (stores: [COGROUPKSTREAM-AGGREGATE-STATE-STORE-0000000002]) --> test <-- KSTREAM-SOURCE-0000000001 Processor: test (stores: [COGROUPKSTREAM-AGGREGATE-STATE-STORE-0000000002]) --> KTABLE-TOSTREAM-0000000005 <-- COGROUPKSTREAM-AGGREGATE-KSTREAM-SOURCE-0000000001test Processor: KTABLE-TOSTREAM-0000000005 (stores: []) --> KSTREAM-SINK-0000000006 <-- test Sink: KSTREAM-SINK-0000000006 (topic: output) <-- KTABLE-TOSTREAM-0000000005

Alright, the I changed the tests so that they only use the correct number of builders. And I discovered that orElseGenerateWithPrefix exists. so that should fix these problems

I still don't see that the test verifies the names? Again the question, why do we need to process data to verify if there right names are assigned?

I added a test for this

mjsax

If you want, we can also split the PR into two, and add auto-repartitioning in a follow up PR. We also need tests for this case.

mjsax · 2019-11-27T03:14:46Z

I still don't see that the test verifies the names? Again the question, why do we need to process data to verify if there right names are assigned?

Improved JavaDocs Code reformatting Added some more tests Fixed naming and updated naming-test

mjsax · 2019-12-01T00:48:25Z

     */
-    KTable<K, VOut> aggregate(final Initializer<VOut> initializer,
-                              final Materialized<K, VOut, KeyValueStore<Bytes, byte[]>> materialized);
+    KTable<K, VOut> aggregate(final Initializer<VOut> initializer);


I reordered the method from "few parameter" to "more parameters" to make it easier to navigate within the file.

mjsax · 2019-12-01T00:48:50Z

+     * streams of this {@code CogroupedKStream}.
+     * If this is not the case, you would need to call {@link KStream#through(String)} before
+     * {@link KStream#groupByKey() grouping} the {@link KStream}, using a pre-created topic with the "correct" number of
+     * partitions.


New paragraph.

mjsax · 2019-12-01T00:49:33Z

-     * {@link KeyValue} pairs.
-     * It is an intermediate representation of one or more {@link KStream}s
-     * in order to apply one or more aggregation operations on the original {@link KStream}
-     * records.


Removed this, as it describes CogroupedKStream what is not appropritate here.

mjsax · 2019-12-01T00:49:59Z

-     * StreamsConfig#CACHE_MAX_BYTES_BUFFERING_CONFIG cache size}, and {@link
-     * StreamsConfig#COMMIT_INTERVAL_MS_CONFIG commit intervall}.
+     * To compute the aggregation the corresponding {@link Aggregator} as specified in
+     * {@link #cogroup(KGroupedStream, Aggregator) cogroup(...)} is used per input stream.


New sentence.

mjsax · 2019-12-01T00:50:46Z

+     * To compute the aggregation the corresponding {@link Aggregator} as specified in
+     * {@link #cogroup(KGroupedStream, Aggregator) cogroup(...)} is used per input stream.
+     * The specified {@link Initializer} is applied once per key, directly before the first input record per key is
+     * processed to provide an initial intermediate aggregation result that is used to process the first record.


added per key (twice)

mjsax · 2019-12-01T00:51:23Z

+     * The specified {@link Aggregator} is applied in the actual {@link CogroupedKStream#aggregate(Initializer)
+     * aggregation} step for each input record and computes a new aggregate using the current aggregate (or for the very
+     * first record per key using the initial intermediate aggregation result provided via the {@link Initializer} that
+     * is passed into {@link CogroupedKStream#aggregate(Initializer)}) and the record's value.


New paragraph

mjsax · 2019-12-01T00:51:40Z

 public class CogroupedKStreamImpl<K, VOut> extends AbstractStream<K, VOut> implements CogroupedKStream<K, VOut> {

    static final String AGGREGATE_NAME = "COGROUPKSTREAM-AGGREGATE-";
+    static final String MERGE_NAME = "COGROUPKSTREAM-MERGE-";


New default name for the merge-node

mjsax · 2019-12-01T00:51:51Z

-        this.groupPatterns = new LinkedHashMap<>();
-        this.aggregateBuilder = new CogroupedStreamAggregateBuilder<>(builder);
+        groupPatterns = new LinkedHashMap<>();
+        aggregateBuilder = new CogroupedStreamAggregateBuilder<>(builder);


remove unnecessary this.

mjsax · 2019-12-01T00:52:22Z

-        Objects.requireNonNull(materialized, "materialized can't be null");
-        final NamedInternal named = NamedInternal.empty();
-        return aggregate(initializer, named, materialized);
+        return aggregate(initializer, NamedInternal.empty(), materialized);


Unified the non-null check into a single place.

mjsax · 2019-12-01T00:53:33Z

-                    sessionMerger);
+                kGroupedStream.getValue(),
+                initializer,
+                named.suffixWithOrElseGet(


No need to pass in a Named -- we can just pass in the actual name as String directly -- otherwise we call suffixWithOrElseGet twice for no reason

mjsax · 2019-12-01T00:54:50Z

        }
-        final String mergeProcessorName = named.orElseGenerateWithPrefix(builder, CogroupedKStreamImpl.AGGREGATE_NAME);
+        final String mergeProcessorName = named.suffixWithOrElseGet(
+            "-cogroup-merge",


Changed this to generate a name <userName>-cogroup-merge to align to <userName>-cogroup-agg-<counter> instead of just <userName> for the merge node.

mjsax · 2019-12-01T00:55:38Z

-            final SessionWindows sessionWindows,
-            final Merger<? super K, VOut> sessionMerger) {
-
-        final String processorName = named.orElseGenerateWithPrefix(builder, CogroupedKStreamImpl.AGGREGATE_NAME);


Removed this and pass in the processorName as String parameter directly.

mjsax · 2019-12-01T00:56:15Z


    @Test(expected = NullPointerException.class)
    public void shouldNotHaveNullInitializerOnAggregate() {
+        cogroupedStream.aggregate(null);


Added couple of more permutations for NPE tests.

mjsax · 2019-12-01T00:56:34Z

+        final KTable<String, String> customers = groupedOne
+            .cogroup(STRING_AGGREGATOR)
+            .cogroup(groupedTwo, STRING_AGGREGATOR)
+            .aggregate(STRING_INITIALIZER, Named.as("test"), Materialized.as("store"));


Also set the store name in this test.

mjsax · 2019-12-01T00:58:46Z

@wcarlson5 I pushed a commit to fix jenkins checkstyle error -- will merge this after Jenkins is green.

We will need 4 follow up PRs:

add auto-repartitioning (including integration test)
time-windowed-cogroup
session-windowed-cogroup
update Scala API

The first three can be done in parallel IMHO. The last one only at the very end.

Follow up to PR #7538 (KIP-150) Reviewer: Matthias J. Sax <matthias@confluent.io>

Follow up to PR #7538 (KIP-150) Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>

Follow up to PR #7538 (KIP-150) Reviewer: Matthias J. Sax <matthias@confluent.io>

wcarlson5 commented Oct 16, 2019

View reviewed changes

wcarlson5 marked this pull request as ready for review October 18, 2019 16:43

mjsax changed the title ~~Kafka 6049 key value cogroup~~ KAFKA-6049: Add non-windowed CoGroup operator (KIP-150) Oct 22, 2019

mjsax added the streams label Oct 22, 2019

mjsax reviewed Oct 22, 2019

View reviewed changes

wcarlson5 mentioned this pull request Oct 24, 2019

Draft!: Broken into many PRs created cogroup option #7400

Closed

3 tasks

mjsax reviewed Nov 26, 2019

View reviewed changes

Walker Carlson and others added 21 commits November 26, 2019 11:29

first commit on Cogroup operator

b60bd7c

first commit on Cogroup operator

7bc3e00

create cogroup for key value stores

585eeb0

fixed indents

2cc8dcf

updated javadocs and naming schemes

5a94085

updated test structure

353e984

added naming for processors

03e18ce

added naming for processors

c2c4b41

added naming for processors

ccc0b6d

added tests for cogrouped with three grouped streams

eb3228c

first commit on Cogroup operator

ddd7727

first commit on Cogroup operator

62248ea

create cogroup for key value stores

b65d8b2

fixed indents

0d18e6f

updated javadocs and naming schemes

26e4c41

updated test structure

63bfff4

added naming for processors

32f5b19

added naming for processors

7d0d19d

added naming for processors

bfd0837

added tests for cogrouped with three grouped streams

5e532a4

addressed comments on the PR

515cace

mjsax changed the title ~~KAFKA-6049: Add non-windowed CoGroup operator (KIP-150)~~ KAFKA-6049: Add non-windowed Cogroup operator (KIP-150) Nov 27, 2019

mjsax reviewed Nov 27, 2019

View reviewed changes

wcarlson5 and others added 4 commits November 27, 2019 14:40

addressed comments on the PR part 2

a663636

addressed comments on the PR part 3

ad3f2bb

remove extra import

a02317e

Fix checkstyle error

3553929

Improved JavaDocs Code reformatting Added some more tests Fixed naming and updated naming-test

mjsax reviewed Dec 1, 2019

View reviewed changes

mjsax merged commit 0b8ea7e into apache:trunk Dec 1, 2019

wcarlson5 deleted the KAFKA-6049_key_value_cogroup branch December 2, 2019 21:11

mjsax pushed a commit that referenced this pull request Dec 12, 2019

KAFKA-6049: Add time window support for cogroup (#7774)

d1161bf

Follow up to PR #7538 (KIP-150) Reviewer: Matthias J. Sax <matthias@confluent.io>

mjsax pushed a commit that referenced this pull request Dec 13, 2019

KAFKA-6049: Add auto-repartitioning for cogroup (#7792)

8b57f6c

Follow up to PR #7538 (KIP-150) Reviewers: Bill Bejeck <bill@confluent.io>, Matthias J. Sax <matthias@confluent.io>

mjsax pushed a commit that referenced this pull request Dec 15, 2019

KAFKA-6049: Add session window support for cogroup (#7782)

dd8af2b

Follow up to PR #7538 (KIP-150) Reviewer: Matthias J. Sax <matthias@confluent.io>

mjsax added the kip Requires or implements a KIP label Jun 12, 2020

Conversation

wcarlson5 commented Oct 16, 2019

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wcarlson5 commented Oct 24, 2019

Uh oh!

wcarlson5 commented Nov 12, 2019

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mjsax left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!