Reject duplicated keys in abstract join node by rui-mo · Pull Request #6016 · facebookincubator/velox

rui-mo · 2023-08-07T06:20:10Z

Reject join nodes with duplicated join keys because this case does not work in Velox, and the planner should avoid planning such join.

netlify · 2023-08-07T06:20:22Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`97323d4`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/64d336b471362a0008c98605

rui-mo · 2023-08-07T06:21:00Z

@philo-he Could you help take a review? Thanks.

philo-he · 2023-08-08T08:16:55Z

@philo-he Could you help take a review? Thanks.

LGTM. Let's wait for the community's feedback.

rui-mo · 2023-08-08T08:18:36Z

@mbasmanova Could you help review this PR? Thanks.

mbasmanova

This looks like a bug in the application. The join should not have duplicate join keys (we need to add a check to plan node's constructor). The planner should have removes the duplicates.

rui-mo · 2023-08-08T13:54:31Z

@mbasmanova Thanks for your review. This behavior was enabled in Spark since apache/spark@fdadc4b, which allowed the same key being used repeatedly.

The planner should have removes the duplicates.

To remove duplicated keys, I assume an exta projection is needed in Spark plan to produce column aliases for this shared column as keys in Velox join.
Since Velox actually supports duplicate keys, do we consider enabling it with this small fix? I also tried duplicated keys on probe side, and it also works in Velox.

mbasmanova · 2023-08-08T14:20:59Z

Since Velox actually supports duplicate keys, do we consider enabling it with this small fix? I also tried duplicated keys on probe side, and it also works in Velox.

It may happen to work today (even though it doesn't; otherwise there would be no need for the fix). However, supporting this case makes it difficult to reason about the code and increases maintenance costs. It would be best to add a check to plan node to reject join nodes with duplicate join keys.

rui-mo · 2023-08-09T06:44:52Z

Looks using sort and std::adjacent_find is not proper for this case.

Constructing an unordered_map and comparing the map size with fields size requires objects copy.

So below for loop is still used to compare duplicates.

rui-mo · 2023-08-09T06:46:02Z

@mbasmanova Changed this PR to reject duplicated join keys. Could you review again? Thanks.

mbasmanova · 2023-08-09T07:17:10Z

+          .project({"c0 AS t0", "c1 as t1"})
+          .hashJoin(
+              {"t0", "t1"},
+              {"u0", "u0"},


@rui-mo Rui, I may have mis-understood the use case. I was thinking it is SELECT * FROM t, u WHERE t.key = u.key AND t.key = u.key (a pair of duplicate join keys), but looks like you have a difference case: SELECT * FROM t, u WHERE t.key1 = u.key AND t.key2 = u.key. Is this so?

Are you changing the plan to add a filter for 't' before the join?

Join t.key1 = u.key

Filter t.key1 = t.key2

TableScan t

TableScan u

Yes, that's the case. This is because Spark allows that behavior and generates such join, but indeed, we need to change the plan if Velox rejects it.

Got it. Thank you for clarifying.

We may need to think about this a bit. In case of inner join, a filter over table scan is a better plan, but left|right|outer joins may still need to support this case. What do you think?

It is obscure to me which is better on performance. By adding a filter, join can be simplified and improved, but the filter requires some time for execution. If this kind of plan is functionally needed in Velox in some cases, I can continue to work on it in this PR.

For non-inner joins, this would the question of correctness, not performance. It may not be possible to rewrite these joins using a filter. Any chance you could create a separate PR with your original change?

Sure. I'll add more tests to verify left/right/outer join.

@mbasmanova Opened #6084. Could you take a look? I find in Velox, full outer join is supported with NestedLoopJoin so its test is not added.

@rui-mo I'm out next week. Can you ping me on the following Monday?

BTW, NestedLoopJoin should only be used when there is no equi clause for the join.

@mbasmanova Do you have time to review #6084 again? Tests for full outer is also added.

rui-mo · 2023-08-25T07:03:55Z

Replaced by #6084. Closing this PR.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2023

mbasmanova requested changes Aug 8, 2023

View reviewed changes

rui-mo force-pushed the wip_reserve branch from 100ee96 to ecc0ab9 Compare August 9, 2023 02:42

rui-mo changed the title ~~Fix negative reserve in hash build when duplicated keys exist~~ Reject duplicated keys in abstract join node Aug 9, 2023

reject duplicated keys

97323d4

rui-mo commented Aug 9, 2023

View reviewed changes

rui-mo force-pushed the wip_reserve branch from ecc0ab9 to 97323d4 Compare August 9, 2023 06:48

mbasmanova reviewed Aug 9, 2023

View reviewed changes

rui-mo closed this Aug 25, 2023

rui-mo deleted the wip_reserve branch February 20, 2024 03:17

FelixYBW mentioned this pull request Feb 7, 2026

[VL] useful Velox PRs not merged into upstream apache/gluten#11585

Open

Conversation

rui-mo commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify Bot commented Aug 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

rui-mo commented Aug 7, 2023

Uh oh!

philo-he commented Aug 8, 2023

Uh oh!

rui-mo commented Aug 8, 2023

Uh oh!

mbasmanova left a comment

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Aug 8, 2023

Uh oh!

mbasmanova commented Aug 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Aug 9, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rui-mo commented Aug 25, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rui-mo commented Aug 7, 2023 •

edited

Loading

netlify Bot commented Aug 7, 2023 •

edited

Loading