Add queries for collecting query store runtime and wait statistics by IgorKuchmienko · Pull Request #8465 · influxdata/telegraf

IgorKuchmienko · 2020-11-23T21:37:47Z

Required for all PRs:

Signed CLA.
Associated README.md updated.
Has appropriate unit tests.

Add queries for collecting query store runtime and wait statistics, for Azure SQL DB and MI.

Add ability to store in a hashtable a custom value associated with the query result, and use it when the query is run next time.
For query store queries, we store the time when the query was run last, and in the next run we only return the data between the last run time and the current time, rounded down to a multiple of the reporting interval.

The implementation takes care of two cases:

query store aggregation interval (configured in SQL) is larger than or equal to the telegraf interval
In this case, the query will return data only once per aggregation interval, and will return an empty set for each subsequent telegraf call within the same aggregation interval.
query store aggregation interval is smaller than the telegraf interval
In this case, the query will return records from several aggregation intervals, i.e. all the new data since the previous telegraf call.

…tore wait statistics, for Azure SQL DB and MI Add ability to store in a hashtable a custom value associated with the query result, and use it when the query is run next time. For query store queries, we store the time when the query was run last, and in the next run we only return the data between the last run time and the current time, rounded down to a multiple of the reporting interval.

denzilribeiro

All the comments put for SQLDB apply to MI as well. I have looked closer at SQLDB for now but haven't tested on running workload. For all the timestamp calculations please add comments for maintainability. I also feel we should have some "check" given we now have a cache to NOT run this query if it has been run say < 15 mins ago or something. Think of a person who had an old config file, who hasn't added this new addition to their "exclude" list so this would for them run by default at a 10 second interval. Min collection should be 15 mins. Also have you done any perf testing on collecting this on a SQL with workload? And for MI on an MI with multiple databases running a workload?
The other thing I noticed, is if someone changes the interval below the collection time as is we are not "merging' or deduping right so will get queries counted twice.

plugins/inputs/sqlserver/azuresqlqueries.go

denzilribeiro · 2020-12-08T14:34:06Z

plugins/inputs/sqlserver/azuresqlqueries.go

+
+`
+
+const sqlAzureDBQueryStoreRuntimeStatistics = sqlAzureDBPartQueryPeriod + `


You may want to check Updateability (until we have query store on read replicas). If not you will be collecting query store data (duplicate or same) from 2 replicas if read replica in BC/Hyperscale is configured for monitoring. Collect only if it is READ_WRITE.
https://docs.microsoft.com/en-us/sql/t-sql/functions/databasepropertyex-transact-sql?view=sql-server-ver15

We still need this check here aka collect only if a read/write database and not a read only replica

…as strings.

…ime_stats_interval_id. Need this aggregation so that we do not get multiple rows for the same query in the same time interval. See here https://docs.microsoft.com/en-us/sql/relational-databases/system-catalog-views/sys-query-store-runtime-stats-transact-sql?view=sql-server-ver15

…string

IgorKuchmienko · 2020-12-28T22:32:11Z

I also feel we should have some "check" given we now have a cache to NOT run this query if it has been run say < 15 mins ago or something.

The other thing I noticed, is if someone changes the interval below the collection time as is we are not "merging' or deduping right so will get queries counted twice.

Yes, this is exactly what the cache is used for. Whatever the interval in telegraf is set to, by using the cached time we guarantee that nothing is returned from telegraf while there is no new data in query store, and no duplicates are returned. I edited the comment in the description of this PR to elaborate on this, and I added comments in the implementation.

IgorKuchmienko · 2020-12-28T23:56:12Z

These tests were run for the queries.

On Azure SQL DB
Telegraf VM tier: Basic A1 (1 vcpu, 1.75 GiB memory)
SQL DB tier: Basic

case 1
query store aggregation interval = 60 min
telegraf interval = 15 min
case 2
query store aggregation interval = 5 min
telegraf interval = 15 min

In both cases, a CPU load was generated with a T-SQL script on the DB.

On Azure MI
Telegraf VM tier: Basic A1 (1 vcpu, 1.75 GiB memory)
MI tier: Business Critical Gen5 (32 GB, 4 vCores)
Number of DBs: 7
query store aggregation interval = 60 min
telegraf interval = 15 min

The plug-in was configured to output the data into a socket, and it was then streamed into an Azure Log Analytics workspace by the Log Analytics agent installed on the VM. The tests showed that the queries produce the data as expected: once per aggregation interval and with no duplicate records. There are no gaps in data and telegraf log file tgf.log contains no errors.

Test results are here: Tests.zip.
They contain the data produced by the queries (exported from Azure LA), CPU load generation script, telegraf config and log files.

denzilribeiro

Need some changes, error handling in the MI part, checks for databases that they are online, that it isn't a read replica and some minor chances of some columns to tags.

denzilribeiro · 2021-01-06T17:46:42Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	DB_NAME() AS [database_name],
+	CONVERT(nvarchar(30), MAX(rsi.start_time), 126) AS interval_start_time,
+	CONVERT(nvarchar(30), MAX(rsi.end_time), 126) AS interval_end_time,
+	MAX(q.query_id) AS query_id,


query_id should be a TAG and it isn't ( aka convert to string) as isn't a measurement just like plan_id was converted

denzilribeiro · 2021-01-06T17:50:36Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	DB_NAME() AS [database_name],
+	CONVERT(nvarchar(30), MAX(rsi.start_time), 126) AS interval_start_time,
+	CONVERT(nvarchar(30), MAX(rsi.end_time), 126) AS interval_end_time,
+	MAX(q.query_id) AS query_id,


query_id should be a TAG and it isn't ( aka convert to string) as isn't a measurement just like plan_id was converted

denzilribeiro · 2021-01-06T22:00:22Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	DB_NAME() AS [database_name],
+	CONVERT(nvarchar(30), MAX(rsi.start_time), 126) AS interval_start_time,
+	CONVERT(nvarchar(30), MAX(rsi.end_time), 126) AS interval_end_time,
+	MAX(q.query_id) AS query_id,


Query_ID should be a TAG ( convert to string) , it isn't a measurement - same like plan_id

denzilribeiro · 2021-01-06T22:00:54Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	[database_name] nvarchar(128),
+	interval_start_time nvarchar(30),
+	interval_end_time nvarchar(30),
+	query_id bigint,


query_id should be string/tag same like plan_id as isn't a measurement

denzilribeiro · 2021-01-06T22:01:08Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	DB_NAME() AS [database_name],
+	CONVERT(nvarchar(30), MAX(rsi.start_time), 126) AS interval_start_time,
+	CONVERT(nvarchar(30), MAX(rsi.end_time), 126) AS interval_end_time,
+	MAX(q.query_id) AS query_id,


Query_ID should be a TAG ( convert to string) , it isn't a measurement - same like plan_id

denzilribeiro · 2021-01-07T00:07:50Z

plugins/inputs/sqlserver/azuresqlqueries.go

+
+`
+
+const sqlAzureDBQueryStoreRuntimeStatistics = sqlAzureDBPartQueryPeriod + `


We still need this check here aka collect only if a read/write database and not a read only replica

denzilribeiro · 2021-01-07T00:08:59Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	SET NOCOUNT ON;
+
+	DECLARE dbCursor CURSOR LOCAL FOR 
+	SELECT 


Need other conditions, aka database is online, and read/write ( not read replica) - aka sys.database state column
This also needs error handling, check if query store is enabled as there is no Try Catch block here and it is iterating through multiple databases

denzilribeiro · 2021-01-07T00:11:46Z

plugins/inputs/sqlserver/azuresqlqueries.go

 FROM sys.dm_os_schedulers AS s
 `
+
+const sqlAzureMIPartQueryPeriod = `


This is a sizable query, have we thought of limiting what we retrieve? Aka take for example MI with 20 databases, creating a temp table with everything in query store could be expensive? Should you get like the top 50 from each?

denzilribeiro · 2021-01-07T00:12:48Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	[database_name] nvarchar(128),
+	interval_start_time nvarchar(30),
+	interval_end_time nvarchar(30),
+	query_id bigint,


query_id should be a TAG ( string) as not a measurement

dimitri-furman · 2021-01-07T00:31:54Z

plugins/inputs/sqlserver/azuresqlqueries.go

+/* Convert start and end times to datetime */
+DECLARE @currIntervalEndDate datetime = DATEADD(second, @currIntervalEndTimestamp % @secondsInDay, DATEADD(day, @currIntervalEndTimestamp / @secondsInDay, 0));
+DECLARE @currIntervalStartDate datetime = DATEADD(second, @currIntervalStartTimestamp % @secondsInDay, DATEADD(day, @currIntervalStartTimestamp / @secondsInDay, 0));
+


@currIntervalStartDate @currIntervalEndDate are declared as datetime, but start_time and end_time columns are datetimeoffset, which has higher precision, so rounding can cause unexpected results. More importantly, on MI, timezone may be something other than UTC and may have daylight savings, which won't be handled correctly with datetime.

@dimitri-furman Is the recommendation to use datetime2 at all places? datetime2 seems to have higher precision than datetime, however it is not aware of timezone offset.

Query Store uses datetimeoffset in sys.query_store_runtime_stats_interval. We should stay consistent and use the same data type to avoid type conversion problems. But you should check if the current time math that uses datetime still works correctly with datetimeoffset.

dimitri-furman · 2021-01-07T00:37:44Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	'sqlserver_azuredb_querystore_runtime_stats' AS [measurement],
+	REPLACE(@@SERVERNAME, '\', ':') AS [sql_instance],
+	DB_NAME() AS [database_name],
+	CONVERT(nvarchar(30), MAX(rsi.start_time), 126) AS interval_start_time,


This partially truncates the time zone. Need to use nvarchar(33) or longer.

dimitri-furman · 2021-01-07T00:51:50Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	RETURN
+END;
+
+IF OBJECT_ID ('tempdb.dbo.#ExecForeachDb') IS NOT NULL DROP PROCEDURE #ExecForeachDb;


Simplify: DROP PROCEDURE IF EXISTS #ExecForeachDb;

dimitri-furman · 2021-01-07T00:56:34Z

plugins/inputs/sqlserver/azuresqlqueries.go

+BEGIN
+	SET NOCOUNT ON;
+
+	DECLARE dbCursor CURSOR LOCAL FOR 


Declaring cursor as FAST_FORWARD will be more efficient

dimitri-furman · 2021-01-07T01:02:28Z

plugins/inputs/sqlserver/azuresqlqueries.go

+	FETCH NEXT FROM dbCursor INTO @dbName;
+
+	WHILE @@Fetch_Status=0 BEGIN
+		SET @CurrSqlText = N''USE ''+ QUOTENAME(@dbName ,''"'') + @queryText


A database may be dropped between the time the cursor is open and the time this USE executes.

mhall119 · 2021-03-15T21:47:45Z

Hi @IgorKuchmienko, are you still able to working on this PR? Please let me know if there's anything we can do to help

masree · 2021-03-15T22:46:36Z

Hi @IgorKuchmienko, are you still able to working on this PR? Please let me know if there's anything we can do to help

Yes, we'll be working on this PR. Unfortunately, the PR got delayed and ETA is around July.

masree · 2021-04-19T23:42:54Z

Splitting up this PR into 2 for easier review. New PR - #9150 (drafted wip to address review comments)

masree · 2021-04-21T01:36:35Z

All the comments put for SQLDB apply to MI as well. I have looked closer at SQLDB for now but haven't tested on running workload. For all the timestamp calculations please add comments for maintainability. I also feel we should have some "check" given we now have a cache to NOT run this query if it has been run say < 15 mins ago or something. Think of a person who had an old config file, who hasn't added this new addition to their "exclude" list so this would for them run by default at a 10 second interval. Min collection should be 15 mins. Also have you done any perf testing on collecting this on a SQL with workload? And for MI on an MI with multiple databases running a workload?
The other thing I noticed, is if someone changes the interval below the collection time as is we are not "merging' or deduping right so will get queries counted twice.

@denzilribeiro : Re " Think of a person who had an old config file, who hasn't added this new addition to their "exclude" list so this would for them run by default at a 10 second interval. Min collection should be 15 mins. " - for an user with old config and latest telegraf bits, yes it would start to collect QDS data at the interval specified. One possible way to avoid it is to introduce another config parameter "query_store_collection" set to false by default. If user wants to enables QDS queries, they will need to set query_store_collection = true and also remove the qds query list from exclude. if they just change query_store_collection to true without modifying exclude list, data will still not be collected. Let me know your thoughts

sjwang90 · 2021-12-14T19:30:15Z

Closing in favor of #9355

sjwang90 added area/sqlserver feat Improvement on an existing feature such as adding a new setting/mode to an existing plugin labels Nov 24, 2020

Fix 'pass lock by value' and 'column [measurement] truncated' issues

d2febb2

denzilribeiro suggested changes Dec 8, 2020

View reviewed changes

IgorKuchmienko and others added 7 commits December 16, 2020 00:08

Remove last_compile_batch_sql_handle. Return start_time and end_time …

fe89046

…as strings.

Add comments

c74f99e

Add metric page_server_io_reads

8e1dd45

Merge branch 'master' into add-querystore

ba9116f

Change string length from max to 20 when converting binary hashes to …

39383f8

…string

Change string length from max to 30 when converting time to string

c298065

denzilribeiro suggested changes Jan 7, 2021

View reviewed changes

dimitri-furman reviewed Jan 7, 2021

View reviewed changes

denzilribeiro mentioned this pull request Jan 25, 2021

Input sqlserver - Query Stats #7842

Closed

3 tasks

sjwang90 added the wip label Apr 5, 2021

This was referenced Jun 11, 2021

Add queries for collecting Query Store data in Azure SQL MI #9355

Closed

Adding queries for collecting QDS data for SQL DB #9150

Closed

Hipska marked this pull request as draft December 14, 2021 10:54

Hipska removed the wip label Dec 14, 2021

sjwang90 closed this Dec 14, 2021


		`

		const sqlAzureDBQueryStoreRuntimeStatistics = sqlAzureDBPartQueryPeriod + `

Conversation

IgorKuchmienko commented Nov 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Required for all PRs:

Uh oh!

denzilribeiro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IgorKuchmienko commented Dec 28, 2020

Uh oh!

IgorKuchmienko commented Dec 28, 2020

Uh oh!

denzilribeiro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mhall119 commented Mar 15, 2021

Uh oh!

masree commented Mar 15, 2021

Uh oh!

masree commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

masree commented Apr 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sjwang90 commented Dec 14, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

IgorKuchmienko commented Nov 23, 2020 •

edited

Loading

denzilribeiro left a comment •

edited

Loading

masree commented Apr 19, 2021 •

edited

Loading

masree commented Apr 21, 2021 •

edited

Loading