Skip to content

Reject duplicated keys in abstract join node#6016

Closed
rui-mo wants to merge 1 commit intofacebookincubator:mainfrom
rui-mo:wip_reserve
Closed

Reject duplicated keys in abstract join node#6016
rui-mo wants to merge 1 commit intofacebookincubator:mainfrom
rui-mo:wip_reserve

Conversation

@rui-mo
Copy link
Copy Markdown
Contributor

@rui-mo rui-mo commented Aug 7, 2023

Reject join nodes with duplicated join keys because this case does not work in Velox, and the planner should avoid planning such join.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 7, 2023
@netlify
Copy link
Copy Markdown

netlify Bot commented Aug 7, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 97323d4
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/64d336b471362a0008c98605

@rui-mo
Copy link
Copy Markdown
Contributor Author

rui-mo commented Aug 7, 2023

@philo-he Could you help take a review? Thanks.

@philo-he
Copy link
Copy Markdown
Contributor

philo-he commented Aug 8, 2023

@philo-he Could you help take a review? Thanks.

LGTM. Let's wait for the community's feedback.

@rui-mo
Copy link
Copy Markdown
Contributor Author

rui-mo commented Aug 8, 2023

@mbasmanova Could you help review this PR? Thanks.

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug in the application. The join should not have duplicate join keys (we need to add a check to plan node's constructor). The planner should have removes the duplicates.

@rui-mo
Copy link
Copy Markdown
Contributor Author

rui-mo commented Aug 8, 2023

@mbasmanova Thanks for your review. This behavior was enabled in Spark since apache/spark@fdadc4b, which allowed the same key being used repeatedly.

The planner should have removes the duplicates.

To remove duplicated keys, I assume an exta projection is needed in Spark plan to produce column aliases for this shared column as keys in Velox join.
Since Velox actually supports duplicate keys, do we consider enabling it with this small fix? I also tried duplicated keys on probe side, and it also works in Velox.

@mbasmanova
Copy link
Copy Markdown
Contributor

Since Velox actually supports duplicate keys, do we consider enabling it with this small fix? I also tried duplicated keys on probe side, and it also works in Velox.

It may happen to work today (even though it doesn't; otherwise there would be no need for the fix). However, supporting this case makes it difficult to reason about the code and increases maintenance costs. It would be best to add a check to plan node to reject join nodes with duplicate join keys.

@rui-mo rui-mo changed the title Fix negative reserve in hash build when duplicated keys exist Reject duplicated keys in abstract join node Aug 9, 2023
Comment thread velox/core/PlanNode.cpp Outdated
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looks using sort and std::adjacent_find is not proper for this case.
  • Constructing an unordered_map and comparing the map size with fields size requires objects copy.

So below for loop is still used to compare duplicates.

@rui-mo
Copy link
Copy Markdown
Contributor Author

rui-mo commented Aug 9, 2023

@mbasmanova Changed this PR to reject duplicated join keys. Could you review again? Thanks.

.project({"c0 AS t0", "c1 as t1"})
.hashJoin(
{"t0", "t1"},
{"u0", "u0"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo Rui, I may have mis-understood the use case. I was thinking it is SELECT * FROM t, u WHERE t.key = u.key AND t.key = u.key (a pair of duplicate join keys), but looks like you have a difference case: SELECT * FROM t, u WHERE t.key1 = u.key AND t.key2 = u.key. Is this so?

Are you changing the plan to add a filter for 't' before the join?

  • Join t.key1 = u.key
    • Filter t.key1 = t.key2
      • TableScan t
    • TableScan u

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the case. This is because Spark allows that behavior and generates such join, but indeed, we need to change the plan if Velox rejects it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you for clarifying.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to think about this a bit. In case of inner join, a filter over table scan is a better plan, but left|right|outer joins may still need to support this case. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is obscure to me which is better on performance. By adding a filter, join can be simplified and improved, but the filter requires some time for execution. If this kind of plan is functionally needed in Velox in some cases, I can continue to work on it in this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For non-inner joins, this would the question of correctness, not performance. It may not be possible to rewrite these joins using a filter. Any chance you could create a separate PR with your original change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I'll add more tests to verify left/right/outer join.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Opened #6084. Could you take a look? I find in Velox, full outer join is supported with NestedLoopJoin so its test is not added.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rui-mo I'm out next week. Can you ping me on the following Monday?

BTW, NestedLoopJoin should only be used when there is no equi clause for the join.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Do you have time to review #6084 again? Tests for full outer is also added.

@rui-mo
Copy link
Copy Markdown
Contributor Author

rui-mo commented Aug 25, 2023

Replaced by #6084. Closing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants