Add support for global sequence processing to the "ordered" extension in Java SDK#32540
Merged
damccorm merged 36 commits intoapache:masterfrom Oct 9, 2024
Merged
Add support for global sequence processing to the "ordered" extension in Java SDK#32540damccorm merged 36 commits intoapache:masterfrom
damccorm merged 36 commits intoapache:masterfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Global sequence processing is used to ensure that events for a given key are only processed when it's guaranteed that they all the elements for this particular have been received.
Consider a PCollection which contains these event tuples (first element of the tuple is the global sequence number):
[1, key1, data], [2, key2, data], [3, key1, data], [4, key1, data], [5, key2, data], [7, key2, data]. Elements for key1 must be processed in the following order: 1, 3, 4. Elements for key2 must be processed in the following order: 2, 5. Event with sequence 7 can't be processed because there is a missing sequence 6.
The approach used to implement ordered processing in the presence of global sequencing:
This high level diagram illustrates the overall approach.
There are dedicated unit tests to cover both per-key and global sequence processing. Please refer to them to understand the details of use cases.
Note that the batch unit tests for global processing don't automatically run under global sequencing. This is due to the apparent incorrectness of the DirectRunner implementation (it is supposed to block processing event processing DoFns until the side input is calculated once). The batch processing tests were successfully run manually using DataflowRunner. Additional work will be needed to either fix the DirectRunner, switch to PrismRunner (when it supports all the primitives used in this transform), or enable test to run a DataflowRunner.