Use UUIDs instead of object hashes to avoid collisions by AnandInguva · Pull Request #29542 · apache/beam

AnandInguva · 2023-11-28T04:41:54Z

Elements before CoGroupByKey with the same hash are not grouped together and treated as duplicates. Add unique id to the hash to prevent collisions.

We should be creating unique id in the first place instead of hash object

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2023-11-28T17:34:46Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

AnandInguva · 2023-11-29T21:28:45Z

sdks/python/apache_beam/ml/transforms/handlers.py

Thinking about this, we can just attach a unique id to each element. We just need unique id so that we can group transformed pcoll with untransformed pcoll with the downstream CoGroupByKey Pcoll

Thinking about this, we can just attach a unique id to each element.

Not sure how accurate this is:
https://stackoverflow.com/questions/72989272/python-generates-same-uuid-over-multiple-docker-containers
but I was also thinking about how the seed will be initialized in docker context.

and UUID1 might be not thread-safe:
ClickHouse/clickhouse-connect#194

I'd expect the combination of uuid1 + os PID + uuid4 would be exceedingly unlikely to collide in beam/dataflow context even with the above considerations. we can and should detect collisions though, we can fail the pipeline if that happens.

A few other notes:

Please add a known issue to Changes.MD and file a GH issue that will be fixed by this PR.

Let's rename the ComputeAndAttachHashKey to ComputeAndAttachUniqueID

I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

also i wonder what is the overhead of MLTransform compared to plain usage of TFT, although that's a separate question

also if keys can be bytes instead of str, using raw bytes instead of hex str would cut key size in ~half

I think we shouldn't raise an error as of now since the chance of collision is very low.

I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

I can try this and test the performance. If this works, we can remove the CoGroupByKey altogether.

tvalentyn · 2023-11-30T05:03:36Z

sdks/python/apache_beam/ml/transforms/handlers.py

os.getpid() might make some sense in combination with uuid1 since with uuid1 is evaluated based on hostname info, which is likely the same in all processes; with uuid4 it shouldn't add much additional entropy as in docker containers from my observation process ids for sdk harness are usually the same.

Changed it.

AnandInguva · 2023-11-30T17:16:46Z

I will create a github issue to track >> I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

Performance testing MLTransform should help here.

tvalentyn · 2023-12-02T00:31:58Z

sdks/python/apache_beam/ml/transforms/handlers.py

@AnandInguva please check that this condition makes sense. In particular, are there exactly two elements per key or two elements per each transformation?

There will be two elements per hash key after CoGroupByKey, one element would be transformed dict and other will untransformed dict.

each value in the dict should consist of list of length 1. I changed the condition a bit.

Thanks, I made a few clarifying edits, PTAL.

tvalentyn · 2023-12-02T00:32:36Z

added some suggestions, ptal and run MLTransform integration tests. Thanks.

sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform_test.py

fix test fix test Fix test

…ave the same length.

…ByKey

tvalentyn · 2023-12-04T19:55:13Z

@AnandInguva please merge if tests pass and new commits look good to you.

sdks/python/apache_beam/ml/transforms/handlers.py

AnandInguva · 2023-12-04T21:57:39Z

lgtm

github-actions bot added python examples labels Nov 28, 2023

AnandInguva force-pushed the add_hash_suffix branch from 009242b to af6f640 Compare November 28, 2023 16:07

AnandInguva marked this pull request as ready for review November 28, 2023 17:22

github-actions bot added the Next Action: Reviewers label Nov 28, 2023

AnandInguva requested a review from tvalentyn November 29, 2023 15:13

AnandInguva commented Nov 29, 2023

View reviewed changes

tvalentyn reviewed Nov 30, 2023

View reviewed changes

AnandInguva requested a review from tvalentyn November 30, 2023 17:16

tvalentyn reviewed Dec 2, 2023

View reviewed changes

AnandInguva mentioned this pull request Dec 4, 2023

[Bug]: MLTransform drops elements if they are already transformed before. #29600

Closed

16 tasks

tvalentyn changed the title ~~Add UUID to the hex object to avoid collisions~~ Use UUIDs instead of object hashes to avoid collisions Dec 4, 2023

tvalentyn reviewed Dec 4, 2023

View reviewed changes

sdks/python/apache_beam/examples/snippets/transforms/elementwise/mltransform_test.py Outdated Show resolved Hide resolved

AnandInguva and others added 14 commits December 4, 2023 10:20

Add uuid

142e9d2

fix test fix test Fix test

Change class name resembling its functionality

f5e5e74

Add PID to the unique string

53da4eb

Change the unique id to be bytes

46ab687

remove decode

d3ce8ba

Replace hash computation with a combined uuid. Resulting key should h…

1bbfa4d

…ave the same length.

Mark internal classes as such.

9a04ff9

misc fixup.

f166603

raise RuntimeError when more than 1 element in observed while CoGroup…

51d107c

…ByKey

Add MLTransform dropping elements to known issues

b2da3fa

Remove internal use comments since it is now evident from naming.

302fc14

Remove references to hash

535f65f

Remove references to hash

0f442f5

Remove references to hash

28dcadb

tvalentyn added 3 commits December 4, 2023 10:21

Remove references to hash

04f6b3e

Edit for clarity

804866e

Clarify helper code.

128e92d

tvalentyn force-pushed the add_hash_suffix branch from c45fd04 to 128e92d Compare December 4, 2023 19:48

tvalentyn approved these changes Dec 4, 2023

View reviewed changes

tvalentyn reviewed Dec 4, 2023

View reviewed changes

sdks/python/apache_beam/ml/transforms/handlers.py Outdated Show resolved Hide resolved

yapf

caba332

AnandInguva merged commit c9c89fe into apache:master Dec 4, 2023

Conversation

AnandInguva commented Nov 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Actions Tests Status (on master branch)

Uh oh!

github-actions bot commented Nov 28, 2023

Uh oh!

AnandInguva Nov 29, 2023

Choose a reason for hiding this comment

Uh oh!

tvalentyn Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AnandInguva Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

tvalentyn Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

AnandInguva Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

AnandInguva commented Nov 30, 2023

Uh oh!

tvalentyn Dec 2, 2023

Choose a reason for hiding this comment

Uh oh!

AnandInguva Dec 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tvalentyn Dec 4, 2023

Choose a reason for hiding this comment

Uh oh!

tvalentyn commented Dec 2, 2023

Uh oh!

Uh oh!

tvalentyn commented Dec 4, 2023

Uh oh!

Uh oh!

AnandInguva commented Dec 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AnandInguva commented Nov 28, 2023 •

edited

Loading

tvalentyn Nov 30, 2023 •

edited

Loading

tvalentyn Nov 30, 2023 •

edited

Loading

tvalentyn Nov 30, 2023 •

edited

Loading

AnandInguva Dec 2, 2023 •

edited

Loading