Skip to content

colexec: derive a tracing span in materializer when collecting stats#81995

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:jr-stats
Aug 11, 2022
Merged

colexec: derive a tracing span in materializer when collecting stats#81995
craig[bot] merged 1 commit intocockroachdb:masterfrom
yuzefovich:jr-stats

Conversation

@yuzefovich
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich commented May 27, 2022

If we don't do this, then the stats would be attached to the span of
the materializer's user, and if that user itself has a lot of
payloads to attach (e.g. a joinReader attaching the KV keys it looked
up), then the stats might be dropped based on the maximum size of
structured payload per tracing span of 10KiB (see
tracing.maxStructuredBytesPerSpan). Deriving a separate span
guarantees that the stats won't be dropped.

This required some changes to make a test - that makes too many
assumptions about tracing infra - work.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@yuzefovich
Copy link
Copy Markdown
Member Author

Here is an example of TPCH Q9 before and after the fix. Note how JoinReader/5 and JoinReader/6 are missing stats. Confusingly, the behavior is non-deterministic, at least partially (probably because the traced events in KV or in storage are different).

@yuzefovich yuzefovich requested review from a team, cucaroach and msirek May 28, 2022 01:01
@yuzefovich yuzefovich marked this pull request as ready for review May 28, 2022 01:01
Copy link
Copy Markdown
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix!
:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/sql/colexec/materializer.go line 257 at r2 (raw file):

	} else {
		ctx = m.StartInternalNoSpan(ctx)
	}

The code changes look good, but I have a question. The RowSource interface says: "RowSources that consume other RowSources are expected to Start() their inputs. " Is there are reason that Materializer.Start doesn't call Start on its input?

Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reproduction here requires a lot of data, so I think it's ok to merge without a regression test. There isn't really any downside to this change too (having an extra tracing span incurs some overhead, but when we're collecting stats, we don't really care about that).

TFTR!

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)


pkg/sql/colexec/materializer.go line 257 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

The code changes look good, but I have a question. The RowSource interface says: "RowSources that consume other RowSources are expected to Start() their inputs. " Is there are reason that Materializer.Start doesn't call Start on its input?

Materializer is an adapter from the colexecop.Operator interface to the RowSource interface. Operator.Init is roughly equivalent to RowSource.Start, and a few lines below, wrapped in a panic catcher, we do call Init on the input. The adapter in the opposite direction (from the RowSource to Operator) is the Columnarizer, and Columnarizer.Init calls input.Start.

Copy link
Copy Markdown
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/sql/colexec/materializer.go line 257 at r2 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Materializer is an adapter from the colexecop.Operator interface to the RowSource interface. Operator.Init is roughly equivalent to RowSource.Start, and a few lines below, wrapped in a panic catcher, we do call Init on the input. The adapter in the opposite direction (from the RowSource to Operator) is the Columnarizer, and Columnarizer.Init calls input.Start.

👍

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 1, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 1, 2022

Build failed (retrying...):

@yuzefovich
Copy link
Copy Markdown
Member Author

Looks like the newly added tracing span is being used after it was finished, I'll need to look into that.

bors r-

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 1, 2022

Canceled.

@yuzefovich yuzefovich force-pushed the jr-stats branch 4 times, most recently from 7fee8d2 to fd0d5d3 Compare June 2, 2022 03:07
Copy link
Copy Markdown
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to make some tweaks to how the materializer is closed. Please take another look.

The problem was that some of the closers would try to log an event into the already finished tracing span (which happens automatically in InternalClose, which was previously done before closing the closers), so this would lead to use-after-finish crash with tracing infra. Some examples of such closers are rowexec.orderedAggregator.close, rowexec.hashAggregator.close, colexec.externalSorter.Close.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @msirek)


pkg/sql/colexec/materializer.go line 257 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

👍

Done.

Copy link
Copy Markdown
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. The updated code looks good.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @msirek)

@yuzefovich
Copy link
Copy Markdown
Member Author

This change doesn't seem worth backporting given that it's not exactly trivial, so I think we'll just merge it on master.

TFTRs!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 2, 2022

Build failed:

@yuzefovich
Copy link
Copy Markdown
Member Author

Hm, looks like this change slows down the tests somehow so that they time out (or more likely to time out). I'll need to take a closer look at that.

@yuzefovich yuzefovich marked this pull request as draft June 24, 2022 18:10
@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Jun 24, 2022
@yuzefovich yuzefovich force-pushed the jr-stats branch 2 times, most recently from c88ee56 to 220b6ff Compare July 19, 2022 17:14
@yuzefovich yuzefovich force-pushed the jr-stats branch 3 times, most recently from ab41ee5 to 243bb42 Compare August 4, 2022 00:44
@yuzefovich yuzefovich force-pushed the jr-stats branch 2 times, most recently from f91d5da to 530db9c Compare August 10, 2022 16:55
If we don't do this, then the stats would be attached to the span of
the materializer's user, and if that user itself has a lot of
payloads to attach (e.g. a joinReader attaching the KV keys it looked
up), then the stats might be dropped based on the maximum size of
structured payload per tracing span of 10KiB (see
`tracing.maxStructuredBytesPerSpan`). Deriving a separate span
guarantees that the stats won't be dropped.

This required some changes to make a test - that makes too many
assumptions about tracing infra - work.

Release note: None
@yuzefovich yuzefovich marked this pull request as ready for review August 11, 2022 02:53
@yuzefovich yuzefovich requested a review from a team August 11, 2022 02:53
@yuzefovich
Copy link
Copy Markdown
Member Author

I think I figured it out - TestAlreadyRunningJobsAreHandledProperly relies on how the traces work, and this change was breaking those assumptions. I added some testing knobs to make those assumptions true again.

Since I only modified the testing knobs behavior since I got LGTMs, I don't think extra reviews are needed.

The only failure I saw on CI is #85934 which is present on master.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 11, 2022

👎 Rejected by label

@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Aug 11, 2022
@yuzefovich
Copy link
Copy Markdown
Member Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 11, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 11, 2022

Build succeeded:

@craig craig bot merged commit e011dd4 into cockroachdb:master Aug 11, 2022
@yuzefovich yuzefovich deleted the jr-stats branch August 11, 2022 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants