sql: use adaptive sampling rate for telemetry logging#70786
sql: use adaptive sampling rate for telemetry logging#70786craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
b9ce8a3 to
0c7e084
Compare
knz
left a comment
There was a problem hiding this comment.
I like the simplification.
I understand this does not remove the QPS estimation code yet? If not, please ensure there's a follow-up issue for that.
Also we want the configuration setting to reflect the new purpose. "Rate" is not the right word.
Reviewed 4 of 4 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
pkg/sql/telemetry_logging.go, line 27 at r2 (raw file):
var telemetrySampleRate = settings.RegisterFloatSetting( "sql.telemetry.query_sampling.sample_rate",
Please rename this variable and the cluster setting below to reflect its new purpose.
We'll aim for something like "max event frequency" as the configuration knob, and you'll need to invert the value in the code to get a wait delay.
knz
left a comment
There was a problem hiding this comment.
Sorry, I see the QPS estimation code is removed now. My other comments still apply.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
811af5b to
c2b8c37
Compare
xinhaoz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/sql/telemetry_logging.go, line 27 at r2 (raw file):
Previously, knz (kena) wrote…
Please rename this variable and the cluster setting below to reflect its new purpose.
We'll aim for something like "max event frequency" as the configuration knob, and you'll need to invert the value in the code to get a wait delay.
Done.
knz
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
-- commits, line 23 at r3:
I don't understand why you changed "query sampling" to "query logging". What was your rationale?
Previously, knz (kena) wrote…
Oh, it seemed to reflect the purpose of the setting more, but I'm pretty impartial to either name. Should I change it back? |
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
Previously, xinhaoz (Xin Hao Zhang) wrote…
Oh, it seemed to reflect the purpose of the setting more, but I'm pretty impartial to either name. Should I change it back?
my thinking is that the term "query logging" implies that we log every query.
The functionality, here and even with your change, is still sampling-based.
c2b8c37 to
a8ca39f
Compare
Previously, knz (kena) wrote…
Ah, fair enough! I've changed it back now. |
knz
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r4, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
a8ca39f to
ca7c7e4
Compare
yuzefovich
left a comment
There was a problem hiding this comment.
Looks good!
In terms of benchmarking, you have already run the TPCC with the new change, but we also would like to see the TPCC without the new change. This way we get two sets of results, and we can compare the impact this change has on the TPCC numbers.
Reviewed 2 of 8 files at r2, 2 of 5 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)
pkg/sql/exec_log.go, line 365 at r5 (raw file):
if telemetryLoggingEnabled { // We only log to the telemetry channel if enough time has elapsed from the last event emission.
nit: wrap comments at 80 characters. If you're using GoLand, there is a "Wrap To Column" plugin that is pretty helpful.
pkg/sql/query_sampling.go, line 44 at r5 (raw file):
var telemetrySmoothingAlpha = settings.RegisterFloatSetting( "sql.telemetry.query_sampling.smoothing_alpha",
Removed settings should be added to retiredSettings map in settings/registry.go.
pkg/sql/telemetry_logging.go, line 59 at r5 (raw file):
// NewTelemetryLoggingMetrics returns a new TelemetryLoggingMetrics object. func NewTelemetryLoggingMetrics() *TelemetryLoggingMetrics {
nit: since now this method doesn't do much other than allocate a new struct, I'd just get rid off it entirely.
pkg/sql/telemetry_logging_test.go, line 119 at r5 (raw file):
// Even though the queries are executed within the required // elapsed interval, we should still see that they were all // logged, we log all statements that are not of type DML.
super nit: "logged, we log" -> "logged since we log".
Resolves cockroachdb#70553 Previously, telemetry logging used a configurable QPS threshold and sampling rate, for which we would log all statements if we were under the QPS threshold, and then start sampling at the given rate once at the threshold. Using this technique meant that we will often see a sharp decreaes in telemetry logging once the sampling rate increases, as sampling rates would typically need to be at low values to accomodate a high QPS. This commit replaces the above technique with an adaptive sampling rate which merely logs events to telemetry at a maximum frequency. Rather than relying on QPS, we will simply track when we have last logged to the telemtry channel, and decide whether or not to log a given event accordingly. Release note (sql change): The cluster setting `sql.telemetry.query_sampling.qps_threshold`, and `sql.telemetry.query_sampling.sample_rate` have been removed. A new setting, `sql.telemetry.query_sampling.max_event_frequency` has been introduced, with default value of 10 events per second.
ca7c7e4 to
ffaf876
Compare
xinhaoz
left a comment
There was a problem hiding this comment.
Addressed comments, will update message with results of TPCC with telemetry turned off once the run completes!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)
pkg/sql/exec_log.go, line 365 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: wrap comments at 80 characters. If you're using GoLand, there is a "Wrap To Column" plugin that is pretty helpful.
Done.
pkg/sql/query_sampling.go, line 44 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Removed settings should be added to
retiredSettingsmap insettings/registry.go.
Done.
pkg/sql/telemetry_logging.go, line 59 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: since now this method doesn't do much other than allocate a new struct, I'd just get rid off it entirely.
Done.
pkg/sql/telemetry_logging_test.go, line 119 at r5 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
super nit: "logged, we log" -> "logged since we log".
Done.
|
Description has been updated with a roachprod run, where telemetry is disabled. |
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @xinhaoz)
maryliag
left a comment
There was a problem hiding this comment.
Reviewed 5 of 8 files at r2, 4 of 5 files at r3, 2 of 5 files at r4, 4 of 7 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @xinhaoz)
|
TFTR!!! |
|
Build succeeded: |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from ffaf876 to blathers/backport-release-21.2-70786: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Resolves #70553
Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.
This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.
Release note (sql change): The cluster setting
sql.telemetry.query_sampling.qps_threshold, andsql.telemetry.query_sampling.sample_ratehave been removed.A new setting,
sql.telemetry.query_sampling.max_event_frequencyhas been introduced, with default value of 10 events per second.
roachrod results
cluster created with:
results of tpcc 1000
Telemetry OFF
Telemetry ON