sql: only create real spans when session tracing/sampling#61777
sql: only create real spans when session tracing/sampling#61777craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
Looks like the failure is an OOB on this line: cockroach/pkg/sql/sem/builtins/builtins.go Line 3633 in c7f9851 A non-nil span with an expected recording has none. |
86415df to
5cdd4b0
Compare
fa2136b to
a948908
Compare
|
Looking just at the My first guess is that a |
a948908 to
9127386
Compare
|
Heh, I saw your comment too late. I was just about to ask for your (or @yuzefovich's) second pair of eyes. I don't think the colflow outbox code correctly deals with the possibility that the span available to it is a no-op one. What do you suppose should be done? Does it make sense to force real span creation at outboxes? I suppose not, which is exactly the behavior #61380 looks to change. |
|
cockroach/pkg/sql/execinfrapb/data.go Lines 254 to 255 in 70dd877 Is where the empty TraceData is being set. |
|
Looks like we were willing to propagate empty traces before. We'll need the following diff, I think: diff --git a/pkg/sql/colflow/colrpc/outbox.go b/pkg/sql/colflow/colrpc/outbox.go
index f9395956b0..b97d494943 100644
--- a/pkg/sql/colflow/colrpc/outbox.go
+++ b/pkg/sql/colflow/colrpc/outbox.go
@@ -298,7 +298,7 @@ func (o *Outbox) sendMetadata(ctx context.Context, stream flowStreamClient, errT
o.span.RecordStructured(s)
}
}
- if trace := execinfra.GetTraceData(ctx); trace != nil {
+ if trace := execinfra.GetTraceData(ctx); trace != nil && len(trace) != 0 {
msg.Data.Metadata = append(msg.Data.Metadata, execinfrapb.RemoteProducerMetadata{
Value: &execinfrapb.RemoteProducerMetadata_TraceData_{
TraceData: &execinfrapb.RemoteProducerMetadata_TraceData{ |
7159e29 to
b5f52f9
Compare
I'm not sure what to say here yet, but I'd encourage you to take a look at modifying
Also I'm not quite sure how remote child spans deal interact with root spans, but I imagine we do need to have "real" spans in the outboxes when the execution stats are being collected. Yeah, #61380 was trying to avoid doing that but also to fix the misuse of the API (where we started and stopped spans only for the purposes of the stats propagation). Overall, I think I might not have the right terminology and understanding of the impact here, possibly Alfonso will have some suggestions/clarifications. |
f721a58 to
2b70653
Compare
2b70653 to
f278b1f
Compare
|
This should now be ready for a real look. It's surprisingly small diff, but it was a bit unwieldy to get right (most of sql is unfamiliar to me). Given we're hoping to backport this to release-21.1, we should give this a close look and maybe let it bake for a few days. |
tbg
left a comment
There was a problem hiding this comment.
Looks good to me, but I'm not an expert on the SQL so my LGTM is really just for the tracing package changes.
Reviewed 1 of 1 files at r1, 11 of 11 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
asubiotto
left a comment
There was a problem hiding this comment.
Reviewed 1 of 1 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @yuzefovich)
pkg/sql/exec_util.go, line 1656 at r2 (raw file):
"session tracing", tracing.WithForceRealSpan()) sp.SetVerbose(true) st.ex.state.Ctx = newTxnCtx
Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
if sp := tracing.SpanFromContext(ctx); sp != nil { ih.sp = sp ih.finishSpan = false
I don't think we want to remove finishSpan here. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case to instrumentation_test.go
pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):
statement ok SET TRACING = on
This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
I don't think we want to remove
finishSpanhere. cc @yuzefovich. The case I'm thinking about specifically is when statement tracing threshold is enabled. A span is created at a higher level so the instrumentation helper should not Finish it, but it is still worth it to get the span's recording and record the stats. We should probably add a regression test for this case toinstrumentation_test.go
Note that the first commit is the revert of #61532 which introduced finishSpan.
AFAICT the only remaining piece from that PR is few changes in execCmd. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?
irfansharif
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/exec_util.go, line 1656 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Does it make sense to use a
ctxHolderhere instead of overwriting the ctx directly?
Originally I didn't want to avoid a larger diff than necessary, but meh, done. PTAL.
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Note that the first commit is the revert of #61532 which introduced
finishSpan.AFAICT the only remaining piece from that PR is few changes in
execCmd. The revert looks good to me, but I wonder why you chose to do a "by-hand" revert?
I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?
pkg/sql/logictest/testdata/logic_test/contention_event, line 30 at r2 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
This is more of a macro question but I thought that the idea of the background tracing work was that we would always have some sort of tracing enabled, so having to turn on session tracing to get contention events feels wrong. How should I think about this?
The way to think about it now is that we'll only get contention events for sampled statements/txns. The sql.txn_stats.sample_rate knob lets us traverse a spectrum of going between "always on tracing" as you're thinking of it, and "always off". For these tests SET TRACING = on is effectively doing the same thing as sql.txn_stats.sample_rate = 1.0
|
https://www.githubstatus.com Github is down, and my latest fixups aren't showing up in this PR. |
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I probably caused more confusion than deserved. I wanted to keep the comments introduced in that commit, everything else was mechanical. So if I'm reading everything correctly, I don't think there's any thing to do with this review thread?
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
OK. So we should re-introduce finish span in another PR.
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
Alfonso, which "finish span" are you thinking about? finishSpan boolean was introduced in #61532 only because ih.sp could be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created in Setup and want to finish).
asubiotto
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Yeah, I think the only thing that might be worth doing is explicitly mentioning the desire to keep the comments in the revert commit message.
Alfonso, which "finish span" are you thinking about?
finishSpanboolean was introduced in #61532 only becauseih.spcould be of two flavours - the txn span (which we don't want to finish) and and the sampling span (which we explicitly created inSetupand want to finish).
Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set ih.sp, return needsFinish=true but set finishSpan=false. Does that make sense? I feel like all this state can probably be simplified.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
pkg/sql/instrumentation.go, line 171 at r1 (raw file):
Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Right, and I think we still have a case where we want to collect stats from a higher level span without finishing it. Concretely, when tracing is enabled for the statement or txn threshold cases. So in the if statement below, we want to set
ih.sp, returnneedsFinish=truebut setfinishSpan=false. Does that make sense? I feel like all this state can probably be simplified.
I see, yeah, it makes sense, but that would be a new behavior, so I think it is orthogonal to the revert commit and the main commit of this PR.
8d4f5dd to
4e48293
Compare
irfansharif
left a comment
There was a problem hiding this comment.
Looks like the txnState's Ctx lifetime is such that it's cleaned up before SessionTracing is. Specifically, we finishSQLTxn before we StopTracing:
cockroach/pkg/sql/conn_executor.go
Line 832 in 1b6c7f9
cockroach/pkg/sql/conn_executor.go
Lines 882 to 884 in 1b6c7f9
Does it make sense to use a ctxHolder here instead of overwriting the ctx directly?
The above makes this unwieldy. Yes, ideally we'd like to "unhijack" the context we just injected. But given it's cleaned up before we get to it, we won't have anything to clean up. This isn't try for the other usage of ctxHolder, where its lifetime is tied only to SessionTracer. For that reason, I'll keep things as was before and hijack Ctx directly instead of doing it through a ctxHolder. In the future, we can probably clean up the SessionTracer's lifecycle before we finishSQLTxn. When we do so, we can start using a ctxHolder instead. I'm not doing that in this PR to keep the blast radius as small as possible.
TFTR!
bors r+
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @irfansharif)
This drastically reduces the memory overhead for tracing we're observing in cockroachdb#59424. This commit does a few disparate things to make it happen: 1. We now access the tracing span through txnState.Ctx exclusively. This gives us a single point to hijack, which we'll later do. By default txn's are initialized with a no-op span. If later on session tracing is enabled, we'll create a real (verbose) span and swap it out with the txn's no-op one. This gives us the same semantics as earlier, and on the plus side, we're not re-using the same tracing span when session tracing is toggled. 2. Hard tracing methods to work with no-op spans. Specifically GetRecording and TraceID. 3. Remove a crash vector through crdb_internal.trace_id. It was previously reaching into the first recording to retrieve a trace ID. But it's not guaranteed that recordings are non-empty. This could be used to induce panics in the server. This PR will need to get backported to 21.1. Fixes cockroachdb#59424. Release note: None
4e48293 to
d7407d5
Compare
|
Github statuses are being goofy, and bors is confused. bors r+ |
|
Build succeeded: |
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
This drastically reduces the memory overhead for tracing we're observing
in #59424. This commit does a few disparate things to make it happen:
gives us a single point to hijack, which we'll later do. By default
txn's are initialized with a no-op span. If later on session tracing is
enabled, we'll create a real (verbose) span and swap it out with the
txn's no-op one. This gives us the same semantics as earlier, and on the
plus side, we're not re-using the same tracing span when session tracing
is toggled.
GetRecording and TraceID.
previously reaching into the first recording to retrieve a trace ID. But
it's not guaranteed that recordings are non-empty. This could be used to
induce panics in the server.
This PR will need to get backported to 21.1. Fixes #59424.
Release note: None
+cc @cockroachdb/kv-east.