Skip to content

sql: ensure SQL Stats cleanup job to not interfere with foreground traffic#80515

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
Azhng:sqlstats-1h-guard
May 4, 2022
Merged

sql: ensure SQL Stats cleanup job to not interfere with foreground traffic#80515
craig[bot] merged 1 commit intocockroachdb:masterfrom
Azhng:sqlstats-1h-guard

Conversation

@Azhng
Copy link
Copy Markdown
Contributor

@Azhng Azhng commented Apr 25, 2022

Related to #79548

Previously, if a workload generated large amount of unique
statement/transaction fingerprints, the number of distinct fingerprint
within the latest aggregation interval can become so large that it would
exceed the limit defined by sql.stats.persisted_rows.max When this
happened, the background SQL Stats cleanup job would be issuing DELETE
query that contended with the flush traffic, which could lead to spiking
query latencies.

This commit ensured that the background SQL Stats cleanup job completely
avoids scanning over the overlapping key ranges touched by the SQL Stats
flush.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@Azhng Azhng force-pushed the sqlstats-1h-guard branch 2 times, most recently from 544d788 to 67f4b62 Compare April 25, 2022 23:48
@Azhng Azhng mentioned this pull request Apr 26, 2022
5 tasks
@Azhng Azhng requested review from a team and mgartner April 26, 2022 01:53
@Azhng Azhng marked this pull request as ready for review April 26, 2022 01:53
@Azhng Azhng force-pushed the sqlstats-1h-guard branch from 67f4b62 to 4fc717f Compare April 26, 2022 01:55
Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 5 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @mgartner)


pkg/sql/sqlstats/persistedsqlstats/compaction_exec.go line 168 at r1 (raw file):

		for remainToBeRemoved := rowsToRemove; remainToBeRemoved > 0; {
			if len(lastDeletedRow) > 0 {
				stmt = ops.subsequentDeleteStmt

I think it'd simplify things to make a method on cleanupOperations (or maybe just a method of StatsCompactor, do you need the cleanupOperations struct?) that returns the stmt to run, given lastDeletedRow. It would return the unconstrained query if lastDeletedRow was empty.


pkg/sql/sqlstats/persistedsqlstats/compaction_exec.go line 224 at r1 (raw file):

func getQargs(shardIdx, limit int64, lastDeletedRow tree.Datums) []interface{} {
	qargs := make([]interface{}, 0, len(lastDeletedRow)+2)

nit: you could make this a field of StatsCompactor so it can be reused to avoid allocations.


pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 41 at r1 (raw file):

	deleteFromStmtStatsWithWideScan = `
		DELETE FROM system.statement_statistics

It would be nice to have some tests that verified the query plans of these statements. We typically keep those type of tests in pkg/sql/opt/exec/execbuilder/testdata/, but I'm not sure of a good way to ensure that any queries we put there stay consistent with the queries here... I suppose once we have UDFs we can define these in a single place and reference them here and in execbuilder tests.


pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 105 at r1 (raw file):

}

func (c *StatsCompactor) cleanupOpForTable(tableName string) *cleanupOperations {

In most cases I think it's best to define methods of a struct in a single place. Any particular reason for creating a new file here?


pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 108 at r1 (raw file):

	switch tableName {
	case "system.statement_statistics":
		return &cleanupOperations{

Might be simpler to make this singleton vars and eliminate this function altogether. You could also define the SQL string directly in the structs so the unconstrained and constrained queries could be more compared side-by-side.

@mgartner
Copy link
Copy Markdown
Contributor

pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 41 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

It would be nice to have some tests that verified the query plans of these statements. We typically keep those type of tests in pkg/sql/opt/exec/execbuilder/testdata/, but I'm not sure of a good way to ensure that any queries we put there stay consistent with the queries here... I suppose once we have UDFs we can define these in a single place and reference them here and in execbuilder tests.

We could define these as builtins that run the internal executor, like

"pg_sequence_parameters": makeBuiltin(tree.FunctionProperties{DistsqlBlocklist: true},
// pg_sequence_parameters is an undocumented Postgres builtin that returns
// information about a sequence given its OID. It's nevertheless used by
// at least one UI tool, so we provide an implementation for compatibility.
// The real implementation returns a record; we fake it by returning a
// comma-delimited string enclosed by parentheses.
// TODO(jordan): convert this to return a record type once we support that.
tree.Overload{
Types: tree.ArgTypes{{"sequence_oid", types.Oid}},
ReturnType: tree.FixedReturnType(types.String),
Fn: func(ctx *tree.EvalContext, args tree.Datums) (tree.Datum, error) {
r, err := ctx.Planner.QueryRowEx(
ctx.Ctx(), "pg_sequence_parameters",
ctx.Txn,
sessiondata.NoSessionDataOverride,
`SELECT seqstart, seqmin, seqmax, seqincrement, seqcycle, seqcache, seqtypid `+
`FROM pg_catalog.pg_sequence WHERE seqrelid=$1`, args[0])
if err != nil {
return nil, err
}
if len(r) == 0 {
return nil, pgerror.Newf(pgcode.UndefinedTable, "unknown sequence (OID=%s)", args[0])
}
seqstart, seqmin, seqmax, seqincrement, seqcycle, seqcache, seqtypid := r[0], r[1], r[2], r[3], r[4], r[5], r[6]
seqcycleStr := "t"
if seqcycle.(*tree.DBool) == tree.DBoolFalse {
seqcycleStr = "f"
}
return tree.NewDString(fmt.Sprintf("(%s,%s,%s,%s,%s,%s,%s)", seqstart, seqmin, seqmax, seqincrement, seqcycleStr, seqcache, seqtypid)), nil
},
Info: notUsableInfo,
Volatility: tree.VolatilityStable,
},
),
, but I'm skeptical that this is a pattern we want to repeat.

@mgartner
Copy link
Copy Markdown
Contributor

I'm realizing that some of my comments are duplicated from the ones I made in #80341. I thought that reviewable had lost them. Next time you have a PR stacked on-top of another that's in review, can you post a link to them in a comment so it's easy to find the other one?

@Azhng Azhng force-pushed the sqlstats-1h-guard branch from 4fc717f to 662d203 Compare April 28, 2022 21:41
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 41 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

We could define these as builtins that run the internal executor, like

"pg_sequence_parameters": makeBuiltin(tree.FunctionProperties{DistsqlBlocklist: true},
// pg_sequence_parameters is an undocumented Postgres builtin that returns
// information about a sequence given its OID. It's nevertheless used by
// at least one UI tool, so we provide an implementation for compatibility.
// The real implementation returns a record; we fake it by returning a
// comma-delimited string enclosed by parentheses.
// TODO(jordan): convert this to return a record type once we support that.
tree.Overload{
Types: tree.ArgTypes{{"sequence_oid", types.Oid}},
ReturnType: tree.FixedReturnType(types.String),
Fn: func(ctx *tree.EvalContext, args tree.Datums) (tree.Datum, error) {
r, err := ctx.Planner.QueryRowEx(
ctx.Ctx(), "pg_sequence_parameters",
ctx.Txn,
sessiondata.NoSessionDataOverride,
`SELECT seqstart, seqmin, seqmax, seqincrement, seqcycle, seqcache, seqtypid `+
`FROM pg_catalog.pg_sequence WHERE seqrelid=$1`, args[0])
if err != nil {
return nil, err
}
if len(r) == 0 {
return nil, pgerror.Newf(pgcode.UndefinedTable, "unknown sequence (OID=%s)", args[0])
}
seqstart, seqmin, seqmax, seqincrement, seqcycle, seqcache, seqtypid := r[0], r[1], r[2], r[3], r[4], r[5], r[6]
seqcycleStr := "t"
if seqcycle.(*tree.DBool) == tree.DBoolFalse {
seqcycleStr = "f"
}
return tree.NewDString(fmt.Sprintf("(%s,%s,%s,%s,%s,%s,%s)", seqstart, seqmin, seqmax, seqincrement, seqcycleStr, seqcache, seqtypid)), nil
},
Info: notUsableInfo,
Volatility: tree.VolatilityStable,
},
),
, but I'm skeptical that this is a pattern we want to repeat.

To clarify, is this something I should implement for this commit or we should wait for the implementation of UDF?

@mgartner
Copy link
Copy Markdown
Contributor

pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 41 at r1 (raw file):

Previously, Azhng (Archer Zhang) wrote…

To clarify, is this something I should implement for this commit or we should wait for the implementation of UDF?

I was trying to think of ways we can notice and prevent query plan regressions for these delete queries before shipping them. I think it'd be wise to add these queries to a new file pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction so we can view the query plans and make sure they are still efficient. This should also help prevent query plan regressions as the optimizer evolves. I'd leave notes both above the definition of these queries to make it clear that the tests you add in execbuilder need to be updated if the queries change, and a note above the tests pointing to where they are defined.

@Azhng Azhng force-pushed the sqlstats-1h-guard branch from 662d203 to a822d6d Compare May 3, 2022 22:29
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

                              estimated row count: 9 (missing stats)
                              table: statement_statistics@primary
                              spans

@mgartner for some reason the spans here is not rendered at all when i run this test with -rewrite flag. Any thoughts on how I should fix this?


pkg/sql/sqlstats/persistedsqlstats/query_utils.go line 41 at r1 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

I was trying to think of ways we can notice and prevent query plan regressions for these delete queries before shipping them. I think it'd be wise to add these queries to a new file pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction so we can view the query plans and make sure they are still efficient. This should also help prevent query plan regressions as the optimizer evolves. I'd leave notes both above the definition of these queries to make it clear that the tests you add in execbuilder need to be updated if the queries change, and a note above the tests pointing to where they are defined.

Test added

Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

@mgartner for some reason the spans here is not rendered at all when i run this test with -rewrite flag. Any thoughts on how I should fix this?

Or perhaps it's fine to ignore? Since unit test already check for the scan spans through KV request filters

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 30 at r4 (raw file):

    AND aggregated_ts < date_trunc('hour', now())
  ORDER BY aggregated_ts ASC
  LIMIT 10

nit: Let's use the default batch size for the LIMIT here so that we mimic the default query as closely as possible.


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Or perhaps it's fine to ignore? Since unit test already check for the scan spans through KV request filters

The spans aren't rendered because there aren't any - the optimizer has determined there is a contradiction in the query filter (and for reasons not super important here, the scan cannot be reduced to an empty values operator). So this delete statement won't delete any rows. We should not ignore this - it seems like a serious issue.

I'm trying to figure out what the contradiction is, but no luck yet. Let me know if you think of anything.

@mgartner
Copy link
Copy Markdown
Contributor

mgartner commented May 4, 2022

pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

The spans aren't rendered because there aren't any - the optimizer has determined there is a contradiction in the query filter (and for reasons not super important here, the scan cannot be reduced to an empty values operator). So this delete statement won't delete any rows. We should not ignore this - it seems like a serious issue.

I'm trying to figure out what the contradiction is, but no luck yet. Let me know if you think of anything.

Ahh, the contradiction is from the filters on aggregated_ts — it cannot be both >= now() and < date_trunc('hour', now()) (unless now() is at the exact top of the hour).

@Azhng Azhng force-pushed the sqlstats-1h-guard branch from a822d6d to c4e105d Compare May 4, 2022 16:09
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 30 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: Let's use the default batch size for the LIMIT here so that we mimic the default query as closely as possible.

Done.


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

Ahh, the contradiction is from the filters on aggregated_ts — it cannot be both >= now() and < date_trunc('hour', now()) (unless now() is at the exact top of the hour).

Ah thanks so much! Done.

@Azhng Azhng force-pushed the sqlstats-1h-guard branch from c4e105d to 4179255 Compare May 4, 2022 16:13
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 80 at r4 (raw file):

Previously, Azhng (Archer Zhang) wrote…

Ah thanks so much! Done.

Also hardcoded the timestamp, just realized that using now() will make the test non-deterministic.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2, 2 of 4 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @Azhng and @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 20 at r5 (raw file):

query T
EXPLAIN (VERBOSE)
DELETE FROM system.statement_statistics

nit: add the unconstrained queries for both tables here too.

…affic

Related to cockroachdb#79548

Previously, if a workload generated large amount of unique
statement/transaction fingerprints, the number of distinct fingerprint
within the latest aggregation interval can become so large that it would
exceed the limit defined by `sql.stats.persisted_rows.max` When this
happened, the background SQL Stats cleanup job would be issuing DELETE
query that contended with the flush traffic, which could lead to spiking
query latencies.

This commit ensured that the background SQL Stats cleanup job completely
avoids scanning over the overlapping key ranges touched by the SQL Stats
flush.

Release note: None
@Azhng Azhng force-pushed the sqlstats-1h-guard branch from 4179255 to 2aab526 Compare May 4, 2022 18:40
Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @mgartner)


pkg/sql/opt/exec/execbuilder/testdata/sql_activity_stats_compaction line 20 at r5 (raw file):

Previously, mgartner (Marcus Gartner) wrote…

nit: add the unconstrained queries for both tables here too.

Done.

Copy link
Copy Markdown
Contributor

@mgartner mgartner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mgartner)

@Azhng
Copy link
Copy Markdown
Contributor Author

Azhng commented May 4, 2022

TFTR!

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 4, 2022

Build succeeded:

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented May 4, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2aab526 to blathers/backport-release-21.2-80515: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Copy link
Copy Markdown
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/sqlstats/persistedsqlstats/compaction_test.go line 271 at r6 (raw file):

}

func TestSQLStatsForegroundInterference(t *testing.T) {

please comment what "interference" this test is concerned with, and how it tests that there's no such interference. I've read it and couldn't figure out.


pkg/sql/sqlstats/persistedsqlstats/compaction_test.go line 303 at r6 (raw file):

	serverSQLStats.Flush(ctx)

	statsCompactor := persistedsqlstats.NewStatsCompactor(

just curious - is this statsCompactor running in parallel with the server's?

Copy link
Copy Markdown
Contributor Author

@Azhng Azhng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/sql/sqlstats/persistedsqlstats/compaction_test.go line 271 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment what "interference" this test is concerned with, and how it tests that there's no such interference. I've read it and couldn't figure out.

#81108 (comment)


pkg/sql/sqlstats/persistedsqlstats/compaction_test.go line 303 at r6 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

just curious - is this statsCompactor running in parallel with the server's?

Yes it is

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants