Skip to content

Use UUIDs instead of object hashes to avoid collisions#29542

Merged
AnandInguva merged 18 commits intoapache:masterfrom
AnandInguva:add_hash_suffix
Dec 4, 2023
Merged

Use UUIDs instead of object hashes to avoid collisions#29542
AnandInguva merged 18 commits intoapache:masterfrom
AnandInguva:add_hash_suffix

Conversation

@AnandInguva
Copy link
Copy Markdown
Contributor

@AnandInguva AnandInguva commented Nov 28, 2023

Fixes: #29600

Elements before CoGroupByKey with the same hash are not grouped together and treated as duplicates. Add unique id to the hash to prevent collisions.

We should be creating unique id in the first place instead of hash object


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@AnandInguva AnandInguva marked this pull request as ready for review November 28, 2023 17:22
@github-actions
Copy link
Copy Markdown
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @damccorm for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, we can just attach a unique id to each element. We just need unique id so that we can group transformed pcoll with untransformed pcoll with the downstream CoGroupByKey Pcoll

Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, we can just attach a unique id to each element.

Not sure how accurate this is:
https://stackoverflow.com/questions/72989272/python-generates-same-uuid-over-multiple-docker-containers
but I was also thinking about how the seed will be initialized in docker context.

and UUID1 might be not thread-safe:
ClickHouse/clickhouse-connect#194

I'd expect the combination of uuid1 + os PID + uuid4 would be exceedingly unlikely to collide in beam/dataflow context even with the above considerations. we can and should detect collisions though, we can fail the pipeline if that happens.

A few other notes:

  1. Please add a known issue to Changes.MD and file a GH issue that will be fixed by this PR.
  2. Let's rename the ComputeAndAttachHashKey to ComputeAndAttachUniqueID
  3. I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i wonder what is the overhead of MLTransform compared to plain usage of TFT, although that's a separate question

Copy link
Copy Markdown
Contributor

@tvalentyn tvalentyn Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if keys can be bytes instead of str, using raw bytes instead of hex str would cut key size in ~half

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't raise an error as of now since the chance of collision is very low.

I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

I can try this and test the performance. If this works, we can remove the CoGroupByKey altogether.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.getpid() might make some sense in combination with uuid1 since with uuid1 is evaluated based on hostname info, which is likely the same in all processes; with uuid4 it shouldn't add much additional entropy as in docker containers from my observation process ids for sdk harness are usually the same.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it.

@AnandInguva
Copy link
Copy Markdown
Contributor Author

I will create a github issue to track >> I wonder if performance and pipeline cost would improve if we can find a way to pass-through columns that do not need to be processed to tft, converting them to bytes if necessary, avoiding the shuffle step.

Performance testing MLTransform should help here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AnandInguva please check that this condition makes sense. In particular, are there exactly two elements per key or two elements per each transformation?

Copy link
Copy Markdown
Contributor Author

@AnandInguva AnandInguva Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be two elements per hash key after CoGroupByKey, one element would be transformed dict and other will untransformed dict.

each value in the dict should consist of list of length 1. I changed the condition a bit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I made a few clarifying edits, PTAL.

@tvalentyn
Copy link
Copy Markdown
Contributor

added some suggestions, ptal and run MLTransform integration tests. Thanks.

@tvalentyn tvalentyn changed the title Add UUID to the hex object to avoid collisions Use UUIDs instead of object hashes to avoid collisions Dec 4, 2023
@tvalentyn
Copy link
Copy Markdown
Contributor

@AnandInguva please merge if tests pass and new commits look good to you.

@AnandInguva
Copy link
Copy Markdown
Contributor Author

lgtm

@AnandInguva AnandInguva merged commit c9c89fe into apache:master Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MLTransform drops elements if they are already transformed before.

2 participants