Use UUIDs instead of object hashes to avoid collisions#29542
Use UUIDs instead of object hashes to avoid collisions#29542AnandInguva merged 18 commits intoapache:masterfrom
Conversation
009242b to
af6f640
Compare
|
Assigning reviewers. If you would like to opt out of this review, comment R: @damccorm for label python. Available commands:
The PR bot will only process comments in the main thread (not review comments). |
There was a problem hiding this comment.
Thinking about this, we can just attach a unique id to each element. We just need unique id so that we can group transformed pcoll with untransformed pcoll with the downstream CoGroupByKey Pcoll
There was a problem hiding this comment.
Thinking about this, we can just attach a unique id to each element.
Not sure how accurate this is:
https://stackoverflow.com/questions/72989272/python-generates-same-uuid-over-multiple-docker-containers
but I was also thinking about how the seed will be initialized in docker context.
and UUID1 might be not thread-safe:
ClickHouse/clickhouse-connect#194
I'd expect the combination of uuid1 + os PID + uuid4 would be exceedingly unlikely to collide in beam/dataflow context even with the above considerations. we can and should detect collisions though, we can fail the pipeline if that happens.
A few other notes:
- Please add a known issue to Changes.MD and file a GH issue that will be fixed by this PR.
- Let's rename the ComputeAndAttachHashKey to ComputeAndAttachUniqueID
- I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.
There was a problem hiding this comment.
also i wonder what is the overhead of MLTransform compared to plain usage of TFT, although that's a separate question
There was a problem hiding this comment.
also if keys can be bytes instead of str, using raw bytes instead of hex str would cut key size in ~half
There was a problem hiding this comment.
I think we shouldn't raise an error as of now since the chance of collision is very low.
I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.
I can try this and test the performance. If this works, we can remove the CoGroupByKey altogether.
There was a problem hiding this comment.
os.getpid() might make some sense in combination with uuid1 since with uuid1 is evaluated based on hostname info, which is likely the same in all processes; with uuid4 it shouldn't add much additional entropy as in docker containers from my observation process ids for sdk harness are usually the same.
|
I will create a github issue to track >> I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step. Performance testing MLTransform should help here. |
There was a problem hiding this comment.
@AnandInguva please check that this condition makes sense. In particular, are there exactly two elements per key or two elements per each transformation?
There was a problem hiding this comment.
There will be two elements per hash key after CoGroupByKey, one element would be transformed dict and other will untransformed dict.
each value in the dict should consist of list of length 1. I changed the condition a bit.
There was a problem hiding this comment.
Thanks, I made a few clarifying edits, PTAL.
|
added some suggestions, ptal and run MLTransform integration tests. Thanks. |
sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform_test.py
Outdated
Show resolved
Hide resolved
…ave the same length.
c45fd04 to
128e92d
Compare
|
@AnandInguva please merge if tests pass and new commits look good to you. |
|
lgtm |
Fixes: #29600
Elements before CoGroupByKey with the same hash are not grouped together and treated as duplicates. Add unique id to the hash to prevent collisions.
We should be creating unique id in the first place instead of hash object
Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, commentfixes #<ISSUE NUMBER>instead.CHANGES.mdwith noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.