Skip to content

colexec: make hash table more dynamic and fix hash join in some cases#53226

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
yuzefovich:hashtable
Aug 24, 2020
Merged

colexec: make hash table more dynamic and fix hash join in some cases#53226
craig[bot] merged 3 commits intocockroachdb:masterfrom
yuzefovich:hashtable

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Aug 21, 2020

colexec: make hash table more dynamic

This commit replaces all fixed allocations of constant size in the hash
table in favor of allocating the slices dynamically which makes the hash
table more dynamic as well. This allows us to place the logic of
clearing up those slices for the next iteration in a single place which
simplifies the code a bit.

Additionally it moves buckets slice that is only used by the hash
joiner out of the hash table.

Release note: None

colexec: fix LEFT ANTI hash join when right eq cols are key

Release note (bug fix): Previously, CockroachDB could return incorrect
results when performing LEFT ANTI hash join when right equality columns
form a key when it was executed via the vectorized engine, and now this
has been fixed.

colexec: fix set-op hash joins and optimize them a bit

This commit fixes several problems with set-operation hash joins:

  1. similar to how LEFT ANTI hash joins are handled, INTERSECT ALL and
    EXCEPT ALL hash joins rely on headID to be populated. That step is
    skipped if the right side is distinct (meaning right equality columns
    form a key). This would result in incorrect output, and we now override
    rightDistinct flag to get the desired behavior (probably sub-optimal
    but correct).
  2. actually, before we would get an incorrect result, we would hit an
    out of bounds error because hashTable.visited would not have been
    created previously. That slice is used by set-op joins to track which
    tuples from the right side have already been "deleted" (i.e. they were
    used for a match). This is now also fixed.

However, currently the information whether the right equality columns
form a key is not propagated for set-operation joins, so the bug would
not have occurred in the non-test environment.

Additionally, this commit optimizes the handling of LEFT ANTI and EXCEPT
ALL joins by not allocating same slice because those two join types
have separate collectAnti* methods that don't use that slice.

Release note: None

@yuzefovich yuzefovich requested review from a team and jordanlewis August 21, 2020 18:46
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but I think it'd still be good to include a description of the cleanup in the commit message

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@yuzefovich yuzefovich force-pushed the hashtable branch 6 times, most recently from bd74c02 to 61cf897 Compare August 21, 2020 23:27
This commit replaces all fixed allocations of constant size in the hash
table in favor of allocating the slices dynamically which makes the hash
table more dynamic as well. This allows us to place the logic of
clearing up those slices for the next iteration in a single place which
simplifies the code a bit.

Additionally it moves `buckets` slice that is only used by the hash
joiner out of the hash table.

Release note: None
@yuzefovich yuzefovich changed the title colexec: minor cleanup of the hash table colexec: make hash table more dynamic Aug 21, 2020
@yuzefovich
Copy link
Copy Markdown
Member Author

yuzefovich commented Aug 22, 2020

Alright, I've updated this PR significantly, so I think it needs another look.

Here are the benchmarks (they show minor changes) of the first commit: hash aggregator, hash joiner, and unordered distinct.

Release note (bug fix): Previously, CockroachDB could return incorrect
results when performing LEFT ANTI hash join when right equality columns
form a key when it was executed via the vectorized engine, and now this
has been fixed.
@yuzefovich yuzefovich changed the title colexec: make hash table more dynamic colexec: make hash table more dynamic and fix hash join in some cases Aug 22, 2020
@yuzefovich
Copy link
Copy Markdown
Member Author

yuzefovich commented Aug 22, 2020

I found some bugs with LEFT ANTI and set-operation joins, so I added another couple of commits. Left anti was introduced in 20.1 release, so the second commit will need to be backported.

Update: bug with set-operation joins can only be produced in the test environment because we currently don't ever set RightEqColsAreKey for set-operation joins in prod.

This commit fixes several problems with set-operation hash joins:
1. similar to how LEFT ANTI hash joins are handled, INTERSECT ALL and
EXCEPT ALL hash joins rely on `headID` to be populated. That step is
skipped if the right side is distinct (meaning right equality columns
form a key). This would result in incorrect output, and we now override
`rightDistinct` flag to get the desired behavior (probably sub-optimal
but correct).
2. actually, before we would get an incorrect result, we would hit an
out of bounds error because `hashTable.visited` would not have been
created previously. That slice is used by set-op joins to track which
tuples from the right side have already been "deleted" (i.e. they were
used for a match). This is now also fixed.

However, currently the information whether the right equality columns
form a key is not propagated for set-operation joins, so the bug would
not have occurred in the non-test environment.

Additionally, this commit optimizes the handling of LEFT ANTI and EXCEPT
ALL joins by not allocating `same` slice because those two join types
have separate `collectAnti*` methods that don't use that slice.

Release note: None
Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: but does it make sense to squash the second and third commit? The third commit implies that the second one could still produce a panic by itself.

Reviewed 8 of 8 files at r3, 2 of 2 files at r4, 6 of 6 files at r5.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis and @yuzefovich)


pkg/sql/colexec/hashjoiner.go, line 674 at r4 (raw file):

		// matching. However, headID is only populated when the right side is
		// considered to be non-distinct, so we override that information here.
		rightDistinct = false

This feels forced because to me it seems that we're setting a seemingly unrelated variable to enable some required behavior (indeed, it seems to be the opposite of what you'd expect with right equality columns forming a key). Consider introducing a more explicit way to enable the behavior you want. Not a requirement for this PR to merge.

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically carved out the second commit into a separate one so that it was easier to backport it (set-op joins were introduced in 20.2 release, but Left Anti is present in 20.1). And the implication of the third commit is slightly different - the panic would occur only with set-op joins, but the optimization of not allocating same slice is relevant for Left Anti and Except All (I'm not backporting the optimization).

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @jordanlewis)


pkg/sql/colexec/hashjoiner.go, line 674 at r4 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

This feels forced because to me it seems that we're setting a seemingly unrelated variable to enable some required behavior (indeed, it seems to be the opposite of what you'd expect with right equality columns forming a key). Consider introducing a more explicit way to enable the behavior you want. Not a requirement for this PR to merge.

I agree, it's a bit "forced", but we already do a similar thing in Left Semi case above (just in the opposite direction), so I think it's ok. The reasoning is that we want the "non-distinctness" behavior to be triggered, regardless of the fact whether the right side is distinct or not.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 24, 2020

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 24, 2020

Build succeeded:

@craig craig bot merged commit a40c8cb into cockroachdb:master Aug 24, 2020
@yuzefovich yuzefovich deleted the hashtable branch August 24, 2020 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants