sql: per-statement statistics, step two by knz · Pull Request #14181 · cockroachdb/cockroach

knz · 2017-03-15T22:04:15Z

This PR follows up on #14092 towards #13968.

sql: track per-statement statistics.

This patch extends/replaces the previously introduced mechanism that
tracks per-application statistics: statistics are now bucketed by
statement/query instead, but each application has its own set of
buckets (so we collect different stats per application).

Only the statement types that do not implement the (new) interface
parser.HiddenFromStats are collected. For example SET and SHOW are
excluded.

The collected data is exposed in the new virtual table
crdb_internal.stmt_stats, which replaces the previous app_stats
virtual table.

These statistics are only enabled if the environment variable
COCKROACH_SQL_STMT_STATS_ENABLE is set and true.

Also, the data is cleared from memory periodically, at the interval
set by COCKROACH_SQL_STMT_STATS_RESET_INTERVAL, by default set to 1
day.

sql: track per-statement execution details.

This patch starts tracking the following details per statement:

total number of executions
total number of first attempts (to separate implicit retries)
maximum observed number of retries
last error encountered
average and variance for:
- total number of rows
- parse latency
- logical plan latency
- execution latency
- overhead latency (total latency - parse - logical - execution)

The statements are also separately bucketed depending on:

whether the execution failed with an error
whether the query was distributed.

(This results in potentially 4 buckets per statement.)

cc @petermattis

This change is

petermattis · 2017-03-16T02:11:11Z

Review status: 0 of 8 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.

pkg/sql/app_stats.go, line 149 at r1 (raw file):

			select {
			case <-ticker.C:
				s.resetStats()

I wonder if we should dump these to the log before resetting.

pkg/sql/crdb_internal.go, line 264 at r1 (raw file):

var crdbInternalStmtStatsTable = virtualSchemaTable{
	schema: `
CREATE TABLE crdb_internal.stmt_stats (

To abbreviate or not to abbreviate?

pkg/sql/executor.go, line 362 at r1 (raw file):

	// Until per-statement statistics are properly recorded and
	// scrubbed, we clear them periodically.
	e.stopper.RunWorker(e.sqlStats.getResetWorker(e.stopper))

That's sort of an odd construction. Normally we'd just call something like e.sqlStats.resetWorkerLoop(e.stopper) and have that create a worker goroutine.

Comments from Reviewable

dt · 2017-03-16T13:55:42Z

Reviewed 8 of 8 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.

pkg/sql/app_stats.go, line 35 at r1 (raw file):

// stmtStats holds per-statement statistics.
type stmtStats struct {

I'm fine doing in a followup, but I'd like to switch to a wrapped porto-defined struct as early as possible for holding the actual stats, since separating data from behavior in a normal struct later can get trickier than building it that way from the beginning.

pkg/sql/app_stats.go, line 44 at r1 (raw file):

// statistics.
var StmtStatsEnable = envutil.EnvOrDefaultBool(
	"COCKROACH_SQL_STMT_STATS_ENABLE", false,

out of scope for this PR but...

env var is fine for things we don't trust yet and want to hide, but in general, I'd like to be able to change settings like this at runtime without editing the startup scripts and rebooting the whole cluster.

Seems like we should have a system.settings table that nodes can cache for ~1m or something.

pkg/sql/app_stats.go, line 123 at r1 (raw file):

	for _, a := range s.apps {
		a.Lock()
		a.stmts = make(map[string]*stmtStats)

we could potentially size this map based on the old one (e.g. len(a.stmtStats)/2) to avoid having to re-grow incrementally after a clear.

pkg/sql/app_stats.go, line 128 at r1 (raw file):

	// Clear everything.
	s.apps = make(map[string]*appStats)

ditto

pkg/sql/app_stats.go, line 144 at r1 (raw file):

func (s *sqlStats) getResetWorker(stopper *stop.Stopper) func() {
	return func() {
		ticker := time.NewTicker(StmtStatsResetFrequency)

I'm find landing this as-is for now, but I'll probably pull periodic triggering of clearing into the server reporting loop, so we don't risk accidentally clearing right before reporting.

pkg/sql/app_stats.go, line 129 at r2 (raw file):

// numericStat collect the running values until the mean and variance
// are required.
type numericStat struct {

okay, yeah, now I really want to proto this.

pkg/sql/executor.go, line 362 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

That's sort of an odd construction. Normally we'd just call something like e.sqlStats.resetWorkerLoop(e.stopper) and have that create a worker goroutine.

ditto above -- this goes away and we pul it into existing worker running a 24h loop in server. feel free to TODO(dt): remove this here and above.

Comments from Reviewable

knz · 2017-03-16T23:18:07Z

(will process your other comments in a 2nd round of tweaks)

Review status: 4 of 8 files reviewed at latest revision, 9 unresolved discussions.

pkg/sql/app_stats.go, line 44 at r1 (raw file):

Previously, dt (David Taylor) wrote…

out of scope for this PR but...

env var is fine for things we don't trust yet and want to hide, but in general, I'd like to be able to change settings like this at runtime without editing the startup scripts and rebooting the whole cluster.

Seems like we should have a system.settings table that nodes can cache for ~1m or something.

That's it -- not trusted yet, not enabled by default. I envision that the feature will always be enabled later. Added a clarifying comment.

pkg/sql/app_stats.go, line 123 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we could potentially size this map based on the old one (e.g. len(a.stmtStats)/2) to avoid having to re-grow incrementally after a clear.

Done.

pkg/sql/app_stats.go, line 128 at r1 (raw file):

Previously, dt (David Taylor) wrote…

ditto

Done.

Comments from Reviewable

knz · 2017-03-17T00:51:20Z

Review status: 2 of 10 files reviewed at latest revision, 7 unresolved discussions.

pkg/sql/app_stats.go, line 35 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'm fine doing in a followup, but I'd like to switch to a wrapped porto-defined struct as early as possible for holding the actual stats, since separating data from behavior in a normal struct later can get trickier than building it that way from the beginning.

Done.

pkg/sql/app_stats.go, line 144 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'm find landing this as-is for now, but I'll probably pull periodic triggering of clearing into the server reporting loop, so we don't risk accidentally clearing right before reporting.

Let's talk more; I have a feeling we'll be more happy all around if we were to persist this data in a SQL table rather than our ts database. In which case, we'd issue a sql transaction here, at this custom interval. Then the reporting loop (actually the update loop, running once a day) can pick up the data to send from this SQL table.

pkg/sql/app_stats.go, line 149 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should dump these to the log before resetting.

Done.

pkg/sql/app_stats.go, line 129 at r2 (raw file):

Previously, dt (David Taylor) wrote…

okay, yeah, now I really want to proto this.

Done.

pkg/sql/crdb_internal.go, line 264 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

To abbreviate or not to abbreviate?

Not to. Done.

pkg/sql/executor.go, line 362 at r1 (raw file):

Previously, dt (David Taylor) wrote…

ditto above -- this goes away and we pul it into existing worker running a 24h loop in server. feel free to TODO(dt): remove this here and above.

Done.

Comments from Reviewable

petermattis · 2017-03-17T01:26:40Z

Review status: 2 of 10 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.

pkg/sql/app_stats.proto, line 42 at r5 (raw file):

  optional int64 MaxRetries = 3  [(gogoproto.nullable) = false];

  // LastError collects the last error encountered.

s/LastError/LastErr/g

pkg/sql/app_stats.proto, line 51 at r5 (raw file):

  optional NumericStat ParseLat = 6  [(gogoproto.nullable) = false];
  optional NumericStat PlanLat	= 7  [(gogoproto.nullable) = false];
  optional NumericStat RunLat	= 8  [(gogoproto.nullable) = false];

Tabs on the above lines.

Also, looking at this again, the terms Run and Execute are too similar. I'm not sure (without looking at the code) what they refer to. I think one needs to be renamed.

Comments from Reviewable

dt · 2017-03-17T01:27:05Z

Review status: 2 of 10 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.

pkg/sql/app_stats.proto, line 26 at r5 (raw file):

  // Count is the total number of times this statement was executed
  // since the begin of the reporting period.
  optional int64 Count = 1  [(gogoproto.nullable) = false];

These don't need to be capitalized and usually aren't -- the generated Go code will cap them for us.

pkg/sql/app_stats.proto, line 36 at r5 (raw file):

  // The cumulative number of retries can be computed with
  // count - firstAttemptCount.
  optional int64 FirstAttemptCount = 2  [(gogoproto.nullable) = false];

we usually snake_case proto field names -- again, generated Go code will convert to more idiomatic CamelCase identifiers.

pkg/sql/app_stats.proto, line 50 at r5 (raw file):

  // phase latencies.
  optional NumericStat ParseLat = 6  [(gogoproto.nullable) = false];
  optional NumericStat PlanLat	= 7  [(gogoproto.nullable) = false];

is that a tab? spaces in protos.

pkg/sql/crdb_internal.go, line 264 at r5 (raw file):

var crdbInternalStmtStatsTable = virtualSchemaTable{
	schema: `
CREATE TABLE crdb_internal.statement_statistics (

nit: I'm slightly apprehensive of presenting the node-specific, unaggregated stats directly to users via SQL. Getting different answers depending on what node you ask (e.g. behind a load-balancer) seems like a ux-surprise.

I understand that this is a step towards some form of aggregation that can come later, it just seems that surfacing this intermediate step in SQL, a medium where we usually are consistent (thanks to being on top of a common kv) has the potential to be surprising.

my concerns would be mostly assuaged by something as simple as making it obvious what it is, e.g. node_statement_statistics.

Comments from Reviewable

knz · 2017-03-17T18:31:36Z

Review status: 2 of 10 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.

pkg/sql/app_stats.proto, line 26 at r5 (raw file):

Previously, dt (David Taylor) wrote…

These don't need to be capitalized and usually aren't -- the generated Go code will cap them for us.

I had forgotten. Thanks for the reminder.

pkg/sql/app_stats.proto, line 36 at r5 (raw file):

Previously, dt (David Taylor) wrote…

we usually snake_case proto field names -- again, generated Go code will convert to more idiomatic CamelCase identifiers.

Done.

pkg/sql/app_stats.proto, line 42 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/LastError/LastErr/g

Done.

pkg/sql/app_stats.proto, line 50 at r5 (raw file):

Previously, dt (David Taylor) wrote…

is that a tab? spaces in protos.

Done.

pkg/sql/app_stats.proto, line 51 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Tabs on the above lines.

Also, looking at this again, the terms Run and Execute are too similar. I'm not sure (without looking at the code) what they refer to. I think one needs to be renamed.

"exec" senamed to "service latency", added comments too.

pkg/sql/crdb_internal.go, line 264 at r5 (raw file):

Previously, dt (David Taylor) wrote…

nit: I'm slightly apprehensive of presenting the node-specific, unaggregated stats directly to users via SQL. Getting different answers depending on what node you ask (e.g. behind a load-balancer) seems like a ux-surprise.

I understand that this is a step towards some form of aggregation that can come later, it just seems that surfacing this intermediate step in SQL, a medium where we usually are consistent (thanks to being on top of a common kv) has the potential to be surprising.

my concerns would be mostly assuaged by something as simple as making it obvious what it is, e.g. node_statement_statistics.

Of the other 4 tables in crdb_internal currently, 3 are already node-local (all but tables). All of them (including table) are guaranteed to not need access to the cluster to deliver a respone, and that's a feature. I think there is a case to be made to expose node-specific data via SQL, even if we have to gate it behind a big warning sign.

Comments from Reviewable

knz · 2017-03-21T14:59:28Z

Please suggest what remains to be done here in light of today's discussion :)

dt · 2017-03-21T15:12:19Z

documenting our discussion this morning:

we should make it clearer these are node-specific (add node_id col, maybe prefix the tables with node_).
this PR is about metrics collection, not aggregation or persistence, which are still TBD.
IMO the value removal here serves to canonicalize queries for the purposes of bucketing, but is not sufficient sanitation to make it safe to upload them to cockroach labs: table and column names could easily be considered sensitive / proprietary info, so we'll want to another sanitizer in the reporting side that cleans those up. There's likely some overlap with the tabledescriptor sanitizer there. This concern doesn't block merging this PR or anything relating to collection at all, just the followup reporting work.

@dt

…e_`. Requested by @dt: the name of the internal tables that only retrieve node-local information (i.e. not aggregated across the cluster) should be clearly marked (in their name) as such. This patch does this by adding a prefix `node_` to these tables. Also for good measure a column `node_id` is added to the tables so that a client inspecting the table can detect when the source of the information changes (e.g. via a load balancer, reconnect etc).

knz · 2017-03-21T16:59:47Z

I have renamed the other crdb_internal tables as suggested.

Note: value removal will occur in a subsequent PR. For now the bucketing includes the values. This is why the feature is currently gated.

This patch extends/replaces the previously introduced mechanism that tracks per-application statistics: statistics are now bucketed by statement/query instead, but each application has its own set of buckets (so we collect different stats per application). Only the statement types that do not implement the (new) interface `parser.HiddenFromStats` are collected. For example SET and SHOW are excluded. The collected data is exposed in the new virtual table `crdb_internal.statement_statistics`, which replaces the previous `app_stats` virtual table. These statistics are only enabled if the environment variable COCKROACH_SQL_STMT_STATS_ENABLE is set and true. This feature get is meant to be temporary, so as to not risk the stability of default setups. Also, the data is cleared from memory periodically, at the interval set by COCKROACH_SQL_STMT_STATS_RESET_INTERVAL, by default set to 1 hour. When this happens, the data collected so far is dump to the INFO log. Later this can be replaced by a dump to some system table in the database.

This patch starts tracking the following details per statement: - total number of executions - total number of first attempts (to separate implicit retries) - maximum observed number of retries - last error encountered - average and variance for: - total number of rows - parse latency - logical plan latency - execution latency - overhead latency (total latency - parse - logical - execution) The statements are also separately bucketed depending on: - whether the execution failed with an error - whether the query was distributed. (This results in potentially 4 buckets per statement.)

dt · 2017-03-21T17:50:50Z

👍

knz · 2017-03-21T18:12:16Z

Thanks!

knz requested a review from dt March 15, 2017 22:04

knz force-pushed the collect-queries branch from c23552b to f580754 Compare March 15, 2017 22:30

knz mentioned this pull request Mar 15, 2017

reporting: capture scrubbed information about queries and schemas #13968

Closed

18 tasks

knz force-pushed the collect-queries branch from f580754 to 51da909 Compare March 16, 2017 23:17

knz force-pushed the collect-queries branch from 51da909 to a12363b Compare March 17, 2017 00:51

knz force-pushed the collect-queries branch 3 times, most recently from 4fc7eeb to e1ef3ac Compare March 17, 2017 01:14

knz force-pushed the collect-queries branch from e1ef3ac to 51f2834 Compare March 17, 2017 18:31

knz force-pushed the collect-queries branch from 51f2834 to fb5adb8 Compare March 21, 2017 16:58

knz added 2 commits March 21, 2017 17:50

knz force-pushed the collect-queries branch from fb5adb8 to 8f4434f Compare March 21, 2017 17:50

knz merged commit 69521f9 into cockroachdb:master Mar 21, 2017

knz deleted the collect-queries branch March 21, 2017 18:12

spencerkimball mentioned this pull request Mar 29, 2017

sql: add EXPLAIN (STATS) that runs query and shows statistics #5445

Closed

knz mentioned this pull request Aug 21, 2020

sql: "last error" field in statement stats is lost from telemetry #53191

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: per-statement statistics, step two#14181

sql: per-statement statistics, step two#14181
knz merged 3 commits intocockroachdb:masterfrom
knz:collect-queries

knz commented Mar 15, 2017 •

edited by rjnn

Loading

Uh oh!

petermattis commented Mar 16, 2017

Uh oh!

dt commented Mar 16, 2017

Uh oh!

knz commented Mar 16, 2017

Uh oh!

knz commented Mar 17, 2017

Uh oh!

petermattis commented Mar 17, 2017

Uh oh!

dt commented Mar 17, 2017

Uh oh!

knz commented Mar 17, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

dt commented Mar 21, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

dt commented Mar 21, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

knz commented Mar 15, 2017 • edited by rjnn Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

petermattis commented Mar 16, 2017

Uh oh!

dt commented Mar 16, 2017

Uh oh!

knz commented Mar 16, 2017

Uh oh!

knz commented Mar 17, 2017

Uh oh!

petermattis commented Mar 17, 2017

Uh oh!

dt commented Mar 17, 2017

Uh oh!

knz commented Mar 17, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

dt commented Mar 21, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

dt commented Mar 21, 2017

Uh oh!

knz commented Mar 21, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

knz commented Mar 15, 2017 •

edited by rjnn

Loading