Skip to content

colflow: fix premature vectorized stats collection#62179

Closed
yuzefovich wants to merge 1 commit intocockroachdb:masterfrom
yuzefovich:stats-cleanup
Closed

colflow: fix premature vectorized stats collection#62179
yuzefovich wants to merge 1 commit intocockroachdb:masterfrom
yuzefovich:stats-cleanup

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented Mar 17, 2021

Previously, we could call finishVectorizedStatsCollectors prematurely
in some edge cases (e.g. when the right side of the hash join is empty,
we could collect the stats from the left input tree too early). This was
caused by the fact that we accumulate the vectorized stats collectors
into a queue which is handled by the outboxes or the root materializer.
This commit begins sharing the responsibility of the collecting stats
with the hash router too. The reasoning for this change is that when the
hash router exits, all stats collectors in its input tree should have
the final info, and the hash router can collect correct stats. Those
stats are added as another metadata object to be returned by the last to
exit router output.

Fixes: #56928.

Release note (bug fix): Previously, CockroachDB when collecting
execution statistics (e.g. when running EXPLAIN ANALYZE) could collect
them prematurely which would result in incorrect stats. This is now
fixed.

@yuzefovich yuzefovich requested review from a team and asubiotto March 17, 2021 23:26
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich yuzefovich force-pushed the stats-cleanup branch 2 times, most recently from f18f901 to 439092f Compare March 18, 2021 02:08
@yuzefovich
Copy link
Copy Markdown
Member Author

yuzefovich commented Mar 18, 2021

I briefly looked into why spec-planning configs fail for explain_analyze logic test, and I don't think the failures have to do with this change, most likely it's about the implementation of EXPLAIN in the new factory. I think we can ignore that issue for now, and we'll investigate as part of #47473.

Copy link
Copy Markdown
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: (but I would wait for Alfonso's review too)

Yeah, I think there are some fixes that are necessary in the new factory, safe to ignore for now.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)


pkg/sql/colflow/vectorized_flow.go, line 728 at r1 (raw file):

	if s.recordingStats && len(s.vectorizedStatsCollectorsQueue) > 0 {
		vscs := append([]vectorizedStatsCollector(nil), s.vectorizedStatsCollectorsQueue...)
		s.vectorizedStatsCollectorsQueue = s.vectorizedStatsCollectorsQueue[:0]

[nit] we seem to repeat this code in quite a few places, would be helpful to move to a function. I don't know what to call this operation - "reset queue"?

Previously, we could call `finishVectorizedStatsCollectors` prematurely
in some edge cases (e.g. when the right side of the hash join is empty,
we could collect the stats from the left input tree too early). This was
caused by the fact that we accumulate the vectorized stats collectors
into a queue which is handled by the outboxes or the root materializer.
This commit begins sharing the responsibility of the collecting stats
with the hash router too. The reasoning for this change is that when the
hash router exits, all stats collectors in its input tree should have
the final info, and the hash router can collect correct stats. Those
stats are added as another metadata object to be returned by the last to
exit router output.

Release note (bug fix): Previously, CockroachDB when collecting
execution statistics (e.g. when running EXPLAIN ANALYZE) could collect
them prematurely which would result in incorrect stats. This is now
fixed.
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)


pkg/sql/colflow/vectorized_flow.go, line 728 at r1 (raw file):

Previously, RaduBerinde wrote…

[nit] we seem to repeat this code in quite a few places, would be helpful to move to a function. I don't know what to call this operation - "reset queue"?

Done.

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should mention in your commit message that the reason this works is that currently the hash router is the only component to split an input stream. We should probably add a regression test that would fail when a new component that splits a stream is added but does not finish its input stat collectors.

Reviewed 1 of 6 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

@yuzefovich
Copy link
Copy Markdown
Member Author

I'll update the commit message, but what kind of unit test do you have in mind? It seems to me it is not possible to write such a unit test before we actually introduce that another component that would split streams.

@asubiotto
Copy link
Copy Markdown
Contributor

I was thinking more along the lines of checking invariants in logic tests.

@yuzefovich
Copy link
Copy Markdown
Member Author

I guess I see an option of introducing some special checker operator that intercept calls to finishVectorizedStatsCollectors. This would require updating the flow setup code as well so that we plan those checkers whenever we split streams, but then again we would need to update the planning code whenever such new component is introduced, and if we don't, then the checker framework wouldn't have caught such a regression.

@yuzefovich
Copy link
Copy Markdown
Member Author

yuzefovich commented Mar 18, 2021

Just to make sure we're on the same page: the problem is not that we don't finish some of the stats collectors, but that we do it too early. I imagine it would be easy to add a check for the former, but I don't see how we would enforce the latter.

Copy link
Copy Markdown
Contributor

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Can't we inject a dummy stat collector that is also a MetadataSource in each input tree and simply assert that the collector is not finished before DrainMeta is called?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

@yuzefovich
Copy link
Copy Markdown
Member Author

That's an interesting idea, I'll need to think a bit more (I'm concerned that we might be entangling stats collection and metadata collection even more, but I guess it is acceptable), but I think it would work.

@yuzefovich
Copy link
Copy Markdown
Member Author

Closing in favor of #62255

@yuzefovich yuzefovich closed this Mar 19, 2021
@rafiss rafiss added this to the 21.1 milestone Apr 22, 2021
@yuzefovich yuzefovich deleted the stats-cleanup branch April 27, 2021 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: vectorized stats collectors can finish prematurely

5 participants