colexec: make hash table more dynamic and fix hash join in some cases#53226
colexec: make hash table more dynamic and fix hash join in some cases#53226craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
asubiotto
left a comment
There was a problem hiding this comment.
but I think it'd still be good to include a description of the cleanup in the commit message
Reviewed 2 of 2 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
bd74c02 to
61cf897
Compare
This commit replaces all fixed allocations of constant size in the hash table in favor of allocating the slices dynamically which makes the hash table more dynamic as well. This allows us to place the logic of clearing up those slices for the next iteration in a single place which simplifies the code a bit. Additionally it moves `buckets` slice that is only used by the hash joiner out of the hash table. Release note: None
|
Alright, I've updated this PR significantly, so I think it needs another look. Here are the benchmarks (they show minor changes) of the first commit: hash aggregator, hash joiner, and unordered distinct. |
Release note (bug fix): Previously, CockroachDB could return incorrect results when performing LEFT ANTI hash join when right equality columns form a key when it was executed via the vectorized engine, and now this has been fixed.
|
I found some bugs with LEFT ANTI and set-operation joins, so I added another couple of commits. Left anti was introduced in 20.1 release, so the second commit will need to be backported. Update: bug with set-operation joins can only be produced in the test environment because we currently don't ever set |
This commit fixes several problems with set-operation hash joins: 1. similar to how LEFT ANTI hash joins are handled, INTERSECT ALL and EXCEPT ALL hash joins rely on `headID` to be populated. That step is skipped if the right side is distinct (meaning right equality columns form a key). This would result in incorrect output, and we now override `rightDistinct` flag to get the desired behavior (probably sub-optimal but correct). 2. actually, before we would get an incorrect result, we would hit an out of bounds error because `hashTable.visited` would not have been created previously. That slice is used by set-op joins to track which tuples from the right side have already been "deleted" (i.e. they were used for a match). This is now also fixed. However, currently the information whether the right equality columns form a key is not propagated for set-operation joins, so the bug would not have occurred in the non-test environment. Additionally, this commit optimizes the handling of LEFT ANTI and EXCEPT ALL joins by not allocating `same` slice because those two join types have separate `collectAnti*` methods that don't use that slice. Release note: None
asubiotto
left a comment
There was a problem hiding this comment.
but does it make sense to squash the second and third commit? The third commit implies that the second one could still produce a panic by itself.
Reviewed 8 of 8 files at r3, 2 of 2 files at r4, 6 of 6 files at r5.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)
pkg/sql/colexec/hashjoiner.go, line 674 at r4 (raw file):
// matching. However, headID is only populated when the right side is // considered to be non-distinct, so we override that information here. rightDistinct = false
This feels forced because to me it seems that we're setting a seemingly unrelated variable to enable some required behavior (indeed, it seems to be the opposite of what you'd expect with right equality columns forming a key). Consider introducing a more explicit way to enable the behavior you want. Not a requirement for this PR to merge.
yuzefovich
left a comment
There was a problem hiding this comment.
I specifically carved out the second commit into a separate one so that it was easier to backport it (set-op joins were introduced in 20.2 release, but Left Anti is present in 20.1). And the implication of the third commit is slightly different - the panic would occur only with set-op joins, but the optimization of not allocating same slice is relevant for Left Anti and Except All (I'm not backporting the optimization).
TFTR!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)
pkg/sql/colexec/hashjoiner.go, line 674 at r4 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
This feels forced because to me it seems that we're setting a seemingly unrelated variable to enable some required behavior (indeed, it seems to be the opposite of what you'd expect with right equality columns forming a key). Consider introducing a more explicit way to enable the behavior you want. Not a requirement for this PR to merge.
I agree, it's a bit "forced", but we already do a similar thing in Left Semi case above (just in the opposite direction), so I think it's ok. The reasoning is that we want the "non-distinctness" behavior to be triggered, regardless of the fact whether the right side is distinct or not.
|
Build failed (retrying...): |
|
Build succeeded: |
colexec: make hash table more dynamic
This commit replaces all fixed allocations of constant size in the hash
table in favor of allocating the slices dynamically which makes the hash
table more dynamic as well. This allows us to place the logic of
clearing up those slices for the next iteration in a single place which
simplifies the code a bit.
Additionally it moves
bucketsslice that is only used by the hashjoiner out of the hash table.
Release note: None
colexec: fix LEFT ANTI hash join when right eq cols are key
Release note (bug fix): Previously, CockroachDB could return incorrect
results when performing LEFT ANTI hash join when right equality columns
form a key when it was executed via the vectorized engine, and now this
has been fixed.
colexec: fix set-op hash joins and optimize them a bit
This commit fixes several problems with set-operation hash joins:
EXCEPT ALL hash joins rely on
headIDto be populated. That step isskipped if the right side is distinct (meaning right equality columns
form a key). This would result in incorrect output, and we now override
rightDistinctflag to get the desired behavior (probably sub-optimalbut correct).
out of bounds error because
hashTable.visitedwould not have beencreated previously. That slice is used by set-op joins to track which
tuples from the right side have already been "deleted" (i.e. they were
used for a match). This is now also fixed.
However, currently the information whether the right equality columns
form a key is not propagated for set-operation joins, so the bug would
not have occurred in the non-test environment.
Additionally, this commit optimizes the handling of LEFT ANTI and EXCEPT
ALL joins by not allocating
sameslice because those two join typeshave separate
collectAnti*methods that don't use that slice.Release note: None