Reject duplicated keys in abstract join node#6016
Reject duplicated keys in abstract join node#6016rui-mo wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@philo-he Could you help take a review? Thanks. |
LGTM. Let's wait for the community's feedback. |
|
@mbasmanova Could you help review this PR? Thanks. |
mbasmanova
left a comment
There was a problem hiding this comment.
This looks like a bug in the application. The join should not have duplicate join keys (we need to add a check to plan node's constructor). The planner should have removes the duplicates.
|
@mbasmanova Thanks for your review. This behavior was enabled in Spark since apache/spark@fdadc4b, which allowed the same key being used repeatedly.
To remove duplicated keys, I assume an exta projection is needed in Spark plan to produce column aliases for this shared column as keys in Velox join. |
It may happen to work today (even though it doesn't; otherwise there would be no need for the fix). However, supporting this case makes it difficult to reason about the code and increases maintenance costs. It would be best to add a check to plan node to reject join nodes with duplicate join keys. |
There was a problem hiding this comment.
- Looks using sort and std::adjacent_find is not proper for this case.
- Constructing an unordered_map and comparing the map size with fields size requires objects copy.
So below for loop is still used to compare duplicates.
|
@mbasmanova Changed this PR to reject duplicated join keys. Could you review again? Thanks. |
| .project({"c0 AS t0", "c1 as t1"}) | ||
| .hashJoin( | ||
| {"t0", "t1"}, | ||
| {"u0", "u0"}, |
There was a problem hiding this comment.
@rui-mo Rui, I may have mis-understood the use case. I was thinking it is SELECT * FROM t, u WHERE t.key = u.key AND t.key = u.key (a pair of duplicate join keys), but looks like you have a difference case: SELECT * FROM t, u WHERE t.key1 = u.key AND t.key2 = u.key. Is this so?
Are you changing the plan to add a filter for 't' before the join?
- Join t.key1 = u.key
- Filter t.key1 = t.key2
- TableScan t
- TableScan u
- Filter t.key1 = t.key2
There was a problem hiding this comment.
Yes, that's the case. This is because Spark allows that behavior and generates such join, but indeed, we need to change the plan if Velox rejects it.
There was a problem hiding this comment.
Got it. Thank you for clarifying.
There was a problem hiding this comment.
We may need to think about this a bit. In case of inner join, a filter over table scan is a better plan, but left|right|outer joins may still need to support this case. What do you think?
There was a problem hiding this comment.
It is obscure to me which is better on performance. By adding a filter, join can be simplified and improved, but the filter requires some time for execution. If this kind of plan is functionally needed in Velox in some cases, I can continue to work on it in this PR.
There was a problem hiding this comment.
For non-inner joins, this would the question of correctness, not performance. It may not be possible to rewrite these joins using a filter. Any chance you could create a separate PR with your original change?
There was a problem hiding this comment.
Sure. I'll add more tests to verify left/right/outer join.
There was a problem hiding this comment.
@mbasmanova Opened #6084. Could you take a look? I find in Velox, full outer join is supported with NestedLoopJoin so its test is not added.
There was a problem hiding this comment.
@rui-mo I'm out next week. Can you ping me on the following Monday?
BTW, NestedLoopJoin should only be used when there is no equi clause for the join.
There was a problem hiding this comment.
@mbasmanova Do you have time to review #6084 again? Tests for full outer is also added.
|
Replaced by #6084. Closing this PR. |
Reject join nodes with duplicated join keys because this case does not work in Velox, and the planner should avoid planning such join.