-
Notifications
You must be signed in to change notification settings - Fork 4.1k
sql: automatically cancel stats jobs that are running for too long #118584
Copy link
Copy link
Open
Labels
A-sql-table-statsTable statistics (and their automatic refresh).Table statistics (and their automatic refresh).C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-sql-queriesSQL Queries TeamSQL Queries Team
Metadata
Metadata
Assignees
Labels
A-sql-table-statsTable statistics (and their automatic refresh).Table statistics (and their automatic refresh).C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAIssues/test failures with no fix SLAT-sql-queriesSQL Queries TeamSQL Queries Team
Type
Projects
Status
Backlog
Is your feature request related to a problem? Please describe.
We have seen cases where stats jobs get stuck for various reasons (e.g., a bug in the jobs system). Since we only allow one stats job at a time in the entire cluster, this can block stats collection for all tables in the cluster until the job is cancelled.
Describe the solution you'd like
We should have a way to detect when a stats job has been running for too long, and automatically cancel it. For example, if a stats job has been running for longer than one day (this period should be configurable), we might want to cancel the job.
Even if we eventually allow multiple stats jobs to run at the same time, we should probably still cancel long-running stats jobs. There is no downside to canceling (other than some wasted work), and it will ensure that no table goes too long without a stats refresh.
Jira issue: CRDB-35799