Skip to content

sql: per-statement statistics, step two#14181

Merged
knz merged 3 commits intocockroachdb:masterfrom
knz:collect-queries
Mar 21, 2017
Merged

sql: per-statement statistics, step two#14181
knz merged 3 commits intocockroachdb:masterfrom
knz:collect-queries

Conversation

@knz
Copy link
Copy Markdown
Contributor

@knz knz commented Mar 15, 2017

This PR follows up on #14092 towards #13968.

sql: track per-statement statistics.

This patch extends/replaces the previously introduced mechanism that
tracks per-application statistics: statistics are now bucketed by
statement/query instead, but each application has its own set of
buckets (so we collect different stats per application).

Only the statement types that do not implement the (new) interface
parser.HiddenFromStats are collected. For example SET and SHOW are
excluded.

The collected data is exposed in the new virtual table
crdb_internal.stmt_stats, which replaces the previous app_stats
virtual table.

These statistics are only enabled if the environment variable
COCKROACH_SQL_STMT_STATS_ENABLE is set and true.

Also, the data is cleared from memory periodically, at the interval
set by COCKROACH_SQL_STMT_STATS_RESET_INTERVAL, by default set to 1
day.

sql: track per-statement execution details.

This patch starts tracking the following details per statement:

  • total number of executions
  • total number of first attempts (to separate implicit retries)
  • maximum observed number of retries
  • last error encountered
  • average and variance for:
    • total number of rows
    • parse latency
    • logical plan latency
    • execution latency
    • overhead latency (total latency - parse - logical - execution)

The statements are also separately bucketed depending on:

  • whether the execution failed with an error
  • whether the query was distributed.

(This results in potentially 4 buckets per statement.)

cc @petermattis


This change is Reviewable

@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 0 of 8 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


pkg/sql/app_stats.go, line 149 at r1 (raw file):

			select {
			case <-ticker.C:
				s.resetStats()

I wonder if we should dump these to the log before resetting.


pkg/sql/crdb_internal.go, line 264 at r1 (raw file):

var crdbInternalStmtStatsTable = virtualSchemaTable{
	schema: `
CREATE TABLE crdb_internal.stmt_stats (

To abbreviate or not to abbreviate?


pkg/sql/executor.go, line 362 at r1 (raw file):

	// Until per-statement statistics are properly recorded and
	// scrubbed, we clear them periodically.
	e.stopper.RunWorker(e.sqlStats.getResetWorker(e.stopper))

That's sort of an odd construction. Normally we'd just call something like e.sqlStats.resetWorkerLoop(e.stopper) and have that create a worker goroutine.


Comments from Reviewable

@dt
Copy link
Copy Markdown
Contributor

dt commented Mar 16, 2017

Reviewed 8 of 8 files at r1, 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/sql/app_stats.go, line 35 at r1 (raw file):

// stmtStats holds per-statement statistics.
type stmtStats struct {

I'm fine doing in a followup, but I'd like to switch to a wrapped porto-defined struct as early as possible for holding the actual stats, since separating data from behavior in a normal struct later can get trickier than building it that way from the beginning.


pkg/sql/app_stats.go, line 44 at r1 (raw file):

// statistics.
var StmtStatsEnable = envutil.EnvOrDefaultBool(
	"COCKROACH_SQL_STMT_STATS_ENABLE", false,

out of scope for this PR but...

env var is fine for things we don't trust yet and want to hide, but in general, I'd like to be able to change settings like this at runtime without editing the startup scripts and rebooting the whole cluster.

Seems like we should have a system.settings table that nodes can cache for ~1m or something.


pkg/sql/app_stats.go, line 123 at r1 (raw file):

	for _, a := range s.apps {
		a.Lock()
		a.stmts = make(map[string]*stmtStats)

we could potentially size this map based on the old one (e.g. len(a.stmtStats)/2) to avoid having to re-grow incrementally after a clear.


pkg/sql/app_stats.go, line 128 at r1 (raw file):

	// Clear everything.
	s.apps = make(map[string]*appStats)

ditto


pkg/sql/app_stats.go, line 144 at r1 (raw file):

func (s *sqlStats) getResetWorker(stopper *stop.Stopper) func() {
	return func() {
		ticker := time.NewTicker(StmtStatsResetFrequency)

I'm find landing this as-is for now, but I'll probably pull periodic triggering of clearing into the server reporting loop, so we don't risk accidentally clearing right before reporting.


pkg/sql/app_stats.go, line 129 at r2 (raw file):

// numericStat collect the running values until the mean and variance
// are required.
type numericStat struct {

okay, yeah, now I really want to proto this.


pkg/sql/executor.go, line 362 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

That's sort of an odd construction. Normally we'd just call something like e.sqlStats.resetWorkerLoop(e.stopper) and have that create a worker goroutine.

ditto above -- this goes away and we pul it into existing worker running a 24h loop in server. feel free to TODO(dt): remove this here and above.


Comments from Reviewable

@knz knz force-pushed the collect-queries branch from f580754 to 51da909 Compare March 16, 2017 23:17
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 16, 2017

(will process your other comments in a 2nd round of tweaks)


Review status: 4 of 8 files reviewed at latest revision, 9 unresolved discussions.


pkg/sql/app_stats.go, line 44 at r1 (raw file):

Previously, dt (David Taylor) wrote…

out of scope for this PR but...

env var is fine for things we don't trust yet and want to hide, but in general, I'd like to be able to change settings like this at runtime without editing the startup scripts and rebooting the whole cluster.

Seems like we should have a system.settings table that nodes can cache for ~1m or something.

That's it -- not trusted yet, not enabled by default. I envision that the feature will always be enabled later. Added a clarifying comment.


pkg/sql/app_stats.go, line 123 at r1 (raw file):

Previously, dt (David Taylor) wrote…

we could potentially size this map based on the old one (e.g. len(a.stmtStats)/2) to avoid having to re-grow incrementally after a clear.

Done.


pkg/sql/app_stats.go, line 128 at r1 (raw file):

Previously, dt (David Taylor) wrote…

ditto

Done.


Comments from Reviewable

@knz knz force-pushed the collect-queries branch from 51da909 to a12363b Compare March 17, 2017 00:51
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 17, 2017

Review status: 2 of 10 files reviewed at latest revision, 7 unresolved discussions.


pkg/sql/app_stats.go, line 35 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'm fine doing in a followup, but I'd like to switch to a wrapped porto-defined struct as early as possible for holding the actual stats, since separating data from behavior in a normal struct later can get trickier than building it that way from the beginning.

Done.


pkg/sql/app_stats.go, line 144 at r1 (raw file):

Previously, dt (David Taylor) wrote…

I'm find landing this as-is for now, but I'll probably pull periodic triggering of clearing into the server reporting loop, so we don't risk accidentally clearing right before reporting.

Let's talk more; I have a feeling we'll be more happy all around if we were to persist this data in a SQL table rather than our ts database. In which case, we'd issue a sql transaction here, at this custom interval. Then the reporting loop (actually the update loop, running once a day) can pick up the data to send from this SQL table.


pkg/sql/app_stats.go, line 149 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I wonder if we should dump these to the log before resetting.

Done.


pkg/sql/app_stats.go, line 129 at r2 (raw file):

Previously, dt (David Taylor) wrote…

okay, yeah, now I really want to proto this.

Done.


pkg/sql/crdb_internal.go, line 264 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

To abbreviate or not to abbreviate?

Not to. Done.


pkg/sql/executor.go, line 362 at r1 (raw file):

Previously, dt (David Taylor) wrote…

ditto above -- this goes away and we pul it into existing worker running a 24h loop in server. feel free to TODO(dt): remove this here and above.

Done.


Comments from Reviewable

@knz knz force-pushed the collect-queries branch 3 times, most recently from 4fc7eeb to e1ef3ac Compare March 17, 2017 01:14
@petermattis
Copy link
Copy Markdown
Collaborator

Review status: 2 of 10 files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


pkg/sql/app_stats.proto, line 42 at r5 (raw file):

  optional int64 MaxRetries = 3  [(gogoproto.nullable) = false];

  // LastError collects the last error encountered.

s/LastError/LastErr/g


pkg/sql/app_stats.proto, line 51 at r5 (raw file):

  optional NumericStat ParseLat = 6  [(gogoproto.nullable) = false];
  optional NumericStat PlanLat	= 7  [(gogoproto.nullable) = false];
  optional NumericStat RunLat	= 8  [(gogoproto.nullable) = false];

Tabs on the above lines.

Also, looking at this again, the terms Run and Execute are too similar. I'm not sure (without looking at the code) what they refer to. I think one needs to be renamed.


Comments from Reviewable

@dt
Copy link
Copy Markdown
Contributor

dt commented Mar 17, 2017

Review status: 2 of 10 files reviewed at latest revision, 8 unresolved discussions, some commit checks pending.


pkg/sql/app_stats.proto, line 26 at r5 (raw file):

  // Count is the total number of times this statement was executed
  // since the begin of the reporting period.
  optional int64 Count = 1  [(gogoproto.nullable) = false];

These don't need to be capitalized and usually aren't -- the generated Go code will cap them for us.


pkg/sql/app_stats.proto, line 36 at r5 (raw file):

  // The cumulative number of retries can be computed with
  // count - firstAttemptCount.
  optional int64 FirstAttemptCount = 2  [(gogoproto.nullable) = false];

we usually snake_case proto field names -- again, generated Go code will convert to more idiomatic CamelCase identifiers.


pkg/sql/app_stats.proto, line 50 at r5 (raw file):

  // phase latencies.
  optional NumericStat ParseLat = 6  [(gogoproto.nullable) = false];
  optional NumericStat PlanLat	= 7  [(gogoproto.nullable) = false];

is that a tab? spaces in protos.


pkg/sql/crdb_internal.go, line 264 at r5 (raw file):

var crdbInternalStmtStatsTable = virtualSchemaTable{
	schema: `
CREATE TABLE crdb_internal.statement_statistics (

nit: I'm slightly apprehensive of presenting the node-specific, unaggregated stats directly to users via SQL. Getting different answers depending on what node you ask (e.g. behind a load-balancer) seems like a ux-surprise.

I understand that this is a step towards some form of aggregation that can come later, it just seems that surfacing this intermediate step in SQL, a medium where we usually are consistent (thanks to being on top of a common kv) has the potential to be surprising.

my concerns would be mostly assuaged by something as simple as making it obvious what it is, e.g. node_statement_statistics.


Comments from Reviewable

@knz knz force-pushed the collect-queries branch from e1ef3ac to 51f2834 Compare March 17, 2017 18:31
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 17, 2017

Review status: 2 of 10 files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


pkg/sql/app_stats.proto, line 26 at r5 (raw file):

Previously, dt (David Taylor) wrote…

These don't need to be capitalized and usually aren't -- the generated Go code will cap them for us.

I had forgotten. Thanks for the reminder.


pkg/sql/app_stats.proto, line 36 at r5 (raw file):

Previously, dt (David Taylor) wrote…

we usually snake_case proto field names -- again, generated Go code will convert to more idiomatic CamelCase identifiers.

Done.


pkg/sql/app_stats.proto, line 42 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/LastError/LastErr/g

Done.


pkg/sql/app_stats.proto, line 50 at r5 (raw file):

Previously, dt (David Taylor) wrote…

is that a tab? spaces in protos.

Done.


pkg/sql/app_stats.proto, line 51 at r5 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Tabs on the above lines.

Also, looking at this again, the terms Run and Execute are too similar. I'm not sure (without looking at the code) what they refer to. I think one needs to be renamed.

"exec" senamed to "service latency", added comments too.


pkg/sql/crdb_internal.go, line 264 at r5 (raw file):

Previously, dt (David Taylor) wrote…

nit: I'm slightly apprehensive of presenting the node-specific, unaggregated stats directly to users via SQL. Getting different answers depending on what node you ask (e.g. behind a load-balancer) seems like a ux-surprise.

I understand that this is a step towards some form of aggregation that can come later, it just seems that surfacing this intermediate step in SQL, a medium where we usually are consistent (thanks to being on top of a common kv) has the potential to be surprising.

my concerns would be mostly assuaged by something as simple as making it obvious what it is, e.g. node_statement_statistics.

Of the other 4 tables in crdb_internal currently, 3 are already node-local (all but tables). All of them (including table) are guaranteed to not need access to the cluster to deliver a respone, and that's a feature. I think there is a case to be made to expose node-specific data via SQL, even if we have to gate it behind a big warning sign.


Comments from Reviewable

@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 21, 2017

Please suggest what remains to be done here in light of today's discussion :)

@dt
Copy link
Copy Markdown
Contributor

dt commented Mar 21, 2017

documenting our discussion this morning:

  • we should make it clearer these are node-specific (add node_id col, maybe prefix the tables with node_).
  • this PR is about metrics collection, not aggregation or persistence, which are still TBD.
  • IMO the value removal here serves to canonicalize queries for the purposes of bucketing, but is not sufficient sanitation to make it safe to upload them to cockroach labs: table and column names could easily be considered sensitive / proprietary info, so we'll want to another sanitizer in the reporting side that cleans those up. There's likely some overlap with the tabledescriptor sanitizer there. This concern doesn't block merging this PR or anything relating to collection at all, just the followup reporting work.

…e_`.

Requested by @dt: the name of the internal tables that only retrieve
node-local information (i.e. not aggregated across the cluster) should
be clearly marked (in their name) as such.

This patch does this by adding a prefix `node_` to these tables.  Also
for good measure a column `node_id` is added to the tables so that a
client inspecting the table can detect when the source of the
information changes (e.g. via a load balancer, reconnect etc).
@knz knz force-pushed the collect-queries branch from 51f2834 to fb5adb8 Compare March 21, 2017 16:58
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 21, 2017

I have renamed the other crdb_internal tables as suggested.

Note: value removal will occur in a subsequent PR. For now the bucketing includes the values. This is why the feature is currently gated.

knz added 2 commits March 21, 2017 17:50
This patch extends/replaces the previously introduced mechanism that
tracks per-application statistics: statistics are now bucketed by
statement/query instead, but each application has its own set of
buckets (so we collect different stats per application).

Only the statement types that do not implement the (new) interface
`parser.HiddenFromStats` are collected. For example SET and SHOW are
excluded.

The collected data is exposed in the new virtual table
`crdb_internal.statement_statistics`, which replaces the previous
`app_stats` virtual table.

These statistics are only enabled if the environment variable
COCKROACH_SQL_STMT_STATS_ENABLE is set and true. This feature get is
meant to be temporary, so as to not risk the stability of default
setups.

Also, the data is cleared from memory periodically, at the interval
set by COCKROACH_SQL_STMT_STATS_RESET_INTERVAL, by default set to 1
hour. When this happens, the data collected so far is dump to the INFO
log. Later this can be replaced by a dump to some system table in the
database.
This patch starts tracking the following details per statement:

- total number of executions
- total number of first attempts (to separate implicit retries)
- maximum observed number of retries
- last error encountered
- average and variance for:
  - total number of rows
  - parse latency
  - logical plan latency
  - execution latency
  - overhead latency (total latency - parse - logical - execution)

The statements are also separately bucketed depending on:
- whether the execution failed with an error
- whether the query was distributed.

(This results in potentially 4 buckets per statement.)
@dt
Copy link
Copy Markdown
Contributor

dt commented Mar 21, 2017

👍

@knz knz force-pushed the collect-queries branch from fb5adb8 to 8f4434f Compare March 21, 2017 17:50
@knz
Copy link
Copy Markdown
Contributor Author

knz commented Mar 21, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants