Skip to content

sql: use adaptive sampling rate for telemetry logging#70786

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
xinhaoz:telemetry-adaptive-sampling
Sep 30, 2021
Merged

sql: use adaptive sampling rate for telemetry logging#70786
craig[bot] merged 1 commit intocockroachdb:masterfrom
xinhaoz:telemetry-adaptive-sampling

Conversation

@xinhaoz
Copy link
Copy Markdown
Contributor

@xinhaoz xinhaoz commented Sep 27, 2021

Resolves #70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
sql.telemetry.query_sampling.qps_threshold, and
sql.telemetry.query_sampling.sample_rate have been removed.
A new setting, sql.telemetry.query_sampling.max_event_frequency
has been introduced, with default value of 10 events per second.


roachrod results

cluster created with:

roachprod create $CLUSTER -n 4 --gce-machine-type=n1-standard-16

results of tpcc 1000

Telemetry OFF

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151075           21.0     53.9     48.2     67.1    109.1   5905.6  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509788          209.7     42.7     25.2     39.8     83.9   9663.7  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151161           21.0      7.8      5.0      7.6     13.6   5905.6  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1511404          209.9     22.4     13.6     23.1     46.1   9663.7  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151079           21.0     14.2     10.5     17.8     27.3   7247.8  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3474507          482.6     31.6     19.9     46.1     75.5   9663.7
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12581.5  97.8%     42.7     25.2     33.6     39.8     83.9   9663.7

Telemetry ON

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         150953           21.0     61.9     54.5     71.3     96.5   5637.1  delivery

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1509498          209.7     36.8     26.2     39.8     60.8   6174.0  newOrder

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151110           21.0      9.2      5.5      7.9     14.2   4831.8  orderStatus

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0        1510864          209.8     34.0     14.2     23.1     35.7   8321.5  payment

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__total
 7200.0s        0         151026           21.0     17.0     11.5     18.9     29.4   5637.1  stockLevel

_elapsed___errors_____ops(total)___ops/sec(cum)__avg(ms)__p50(ms)__p95(ms)__p99(ms)_pMax(ms)__result
 7200.0s        0        3473451          482.4     34.6     21.0     46.1     67.1   8321.5
Audit check 9.2.1.7: PASS
Audit check 9.2.2.5.1: PASS
Audit check 9.2.2.5.2: PASS
Audit check 9.2.2.5.3: PASS
Audit check 9.2.2.5.4: PASS
Audit check 9.2.2.5.5: PASS
Audit check 9.2.2.5.6: PASS

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
 7200.0s    12579.1  97.8%     36.8     26.2     33.6     39.8     60.8   6174.0

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@xinhaoz xinhaoz force-pushed the telemetry-adaptive-sampling branch 4 times, most recently from b9ce8a3 to 0c7e084 Compare September 28, 2021 21:02
@xinhaoz xinhaoz marked this pull request as ready for review September 28, 2021 21:05
@xinhaoz xinhaoz requested review from a team and knz September 28, 2021 21:06
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification.
I understand this does not remove the QPS estimation code yet? If not, please ensure there's a follow-up issue for that.

Also we want the configuration setting to reflect the new purpose. "Rate" is not the right word.

Reviewed 4 of 4 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)


pkg/sql/telemetry_logging.go, line 27 at r2 (raw file):

var telemetrySampleRate = settings.RegisterFloatSetting(
	"sql.telemetry.query_sampling.sample_rate",

Please rename this variable and the cluster setting below to reflect its new purpose.
We'll aim for something like "max event frequency" as the configuration knob, and you'll need to invert the value in the code to get a wait delay.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I see the QPS estimation code is removed now. My other comments still apply.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)

@xinhaoz xinhaoz force-pushed the telemetry-adaptive-sampling branch 2 times, most recently from 811af5b to c2b8c37 Compare September 29, 2021 14:45
Copy link
Copy Markdown
Contributor Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/sql/telemetry_logging.go, line 27 at r2 (raw file):

Previously, knz (kena) wrote…

Please rename this variable and the cluster setting below to reflect its new purpose.
We'll aim for something like "max event frequency" as the configuration knob, and you'll need to invert the value in the code to get a wait delay.

Done.

@xinhaoz xinhaoz requested a review from knz September 29, 2021 14:57
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)


-- commits, line 23 at r3:
I don't understand why you changed "query sampling" to "query logging". What was your rationale?

@xinhaoz
Copy link
Copy Markdown
Contributor Author

xinhaoz commented Sep 29, 2021


-- commits, line 23 at r3:

Previously, knz (kena) wrote…

I don't understand why you changed "query sampling" to "query logging". What was your rationale?

Oh, it seemed to reflect the purpose of the setting more, but I'm pretty impartial to either name. Should I change it back?

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)


-- commits, line 23 at r3:

Previously, xinhaoz (Xin Hao Zhang) wrote…

Oh, it seemed to reflect the purpose of the setting more, but I'm pretty impartial to either name. Should I change it back?

my thinking is that the term "query logging" implies that we log every query.
The functionality, here and even with your change, is still sampling-based.

@xinhaoz xinhaoz force-pushed the telemetry-adaptive-sampling branch from c2b8c37 to a8ca39f Compare September 29, 2021 16:17
@xinhaoz
Copy link
Copy Markdown
Contributor Author

xinhaoz commented Sep 29, 2021


-- commits, line 23 at r3:

Previously, knz (kena) wrote…

my thinking is that the term "query logging" implies that we log every query.
The functionality, here and even with your change, is still sampling-based.

Ah, fair enough! I've changed it back now.

Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

In terms of benchmarking, you have already run the TPCC with the new change, but we also would like to see the TPCC without the new change. This way we get two sets of results, and we can compare the impact this change has on the TPCC numbers.

Reviewed 2 of 8 files at r2, 2 of 5 files at r4, 7 of 7 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @xinhaoz)


pkg/sql/exec_log.go, line 365 at r5 (raw file):

	if telemetryLoggingEnabled {
		// We only log to the telemetry channel if enough time has elapsed from the last event emission.

nit: wrap comments at 80 characters. If you're using GoLand, there is a "Wrap To Column" plugin that is pretty helpful.


pkg/sql/query_sampling.go, line 44 at r5 (raw file):

var telemetrySmoothingAlpha = settings.RegisterFloatSetting(
	"sql.telemetry.query_sampling.smoothing_alpha",

Removed settings should be added to retiredSettings map in settings/registry.go.


pkg/sql/telemetry_logging.go, line 59 at r5 (raw file):

// NewTelemetryLoggingMetrics returns a new TelemetryLoggingMetrics object.
func NewTelemetryLoggingMetrics() *TelemetryLoggingMetrics {

nit: since now this method doesn't do much other than allocate a new struct, I'd just get rid off it entirely.


pkg/sql/telemetry_logging_test.go, line 119 at r5 (raw file):

			// Even though the queries are executed within the required
			// elapsed interval, we should still see that they were all
			// logged, we log all statements that are not of type DML.

super nit: "logged, we log" -> "logged since we log".

Resolves cockroachdb#70553

Previously, telemetry logging used a configurable QPS threshold
and sampling rate, for which we would log all statements if we
were under the QPS threshold, and then start sampling at the
given rate once at the threshold. Using this technique meant
that we will often see a sharp decreaes in telemetry logging
once the sampling rate increases, as sampling rates would typically
need to be at low values to accomodate a high QPS.

This commit replaces the above technique with an adaptive sampling
rate which merely logs events to telemetry at a maximum frequency.
Rather than relying on QPS, we will simply track when we have
last logged to the telemtry channel, and decide whether or not to
log a given event accordingly.

Release note (sql change): The cluster setting
`sql.telemetry.query_sampling.qps_threshold`, and
`sql.telemetry.query_sampling.sample_rate` have been removed.
A new setting, `sql.telemetry.query_sampling.max_event_frequency`
has been introduced, with default value of 10 events per second.
@xinhaoz xinhaoz force-pushed the telemetry-adaptive-sampling branch from ca7c7e4 to ffaf876 Compare September 30, 2021 16:58
Copy link
Copy Markdown
Contributor Author

@xinhaoz xinhaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments, will update message with results of TPCC with telemetry turned off once the run completes!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @yuzefovich)


pkg/sql/exec_log.go, line 365 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: wrap comments at 80 characters. If you're using GoLand, there is a "Wrap To Column" plugin that is pretty helpful.

Done.


pkg/sql/query_sampling.go, line 44 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Removed settings should be added to retiredSettings map in settings/registry.go.

Done.


pkg/sql/telemetry_logging.go, line 59 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: since now this method doesn't do much other than allocate a new struct, I'd just get rid off it entirely.

Done.


pkg/sql/telemetry_logging_test.go, line 119 at r5 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: "logged, we log" -> "logged since we log".

Done.

@xinhaoz
Copy link
Copy Markdown
Contributor Author

xinhaoz commented Sep 30, 2021

Description has been updated with a roachprod run, where telemetry is disabled.

@xinhaoz xinhaoz requested a review from yuzefovich September 30, 2021 18:53
Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! tpmC doesn't change. :lgtm:

Reviewed 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @xinhaoz)

Copy link
Copy Markdown
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 5 of 8 files at r2, 4 of 5 files at r3, 2 of 5 files at r4, 4 of 7 files at r5, 5 of 5 files at r6, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @xinhaoz)

@xinhaoz
Copy link
Copy Markdown
Contributor Author

xinhaoz commented Sep 30, 2021

TFTR!!!
bors r=knz,yuzefovich,maryliag

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 30, 2021

Build succeeded:

@craig craig bot merged commit 6f2201f into cockroachdb:master Sep 30, 2021
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 30, 2021

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from ffaf876 to blathers/backport-release-21.2-70786: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve SQL query telemetry collection with an adaptive sampling rate

5 participants