jobs: avoid crdb_internal.system_jobs in gc-jobs#108093
Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom Aug 8, 2023
Merged
jobs: avoid crdb_internal.system_jobs in gc-jobs#108093craig[bot] merged 1 commit intocockroachdb:masterfrom
craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
The crdb_internal.system_jobs is a virtual table that joins
information from the jobs table and the jobs_info table.
For the previous query,
SELECT id, payload, status FROM "".crdb_internal.system_jobs
WHERE (created < $1) AND (id > $2)
ORDER BY id
LIMIT $3
this is a little suboptimal because:
- We don't make use of the progress column so any read of that is
useless.
- While the crdb_internal.virtual table has a virtual index on job id,
and EXPLAIN will even claim that it will be used:
• limit
│ count: 100
│
└── • filter
│ filter: created < '2023-07-20 07:29:01.17001'
│
└── • virtual table
table: system_jobs@system_jobs_id_idx
spans: [/101 - ]
This is actually a lie. A virtual index can only handle single-key
spans. As a result the unconstrained query is used:
WITH
latestpayload AS (SELECT job_id, value FROM system.job_info AS payload WHERE info_key = 'legacy_payload' ORDER BY written DESC),
latestprogress AS (SELECT job_id, value FROM system.job_info AS progress WHERE info_key = 'legacy_progress' ORDER BY written DESC)
SELECT
DISTINCT(id), status, created, payload.value AS payload, progress.value AS progress,
created_by_type, created_by_id, claim_session_id, claim_instance_id, num_runs, last_run, job_type
FROM system.jobs AS j
INNER JOIN latestpayload AS payload ON j.id = payload.job_id
LEFT JOIN latestprogress AS progress ON j.id = progress.job_id
which has a full scan of the jobs table and 2 full scans of the info
table:
• distinct
│ distinct on: id, value, value
│
└── • merge join
│ equality: (job_id) = (id)
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,318
│ │ filter: info_key = 'legacy_payload'
│ │
│ └── • scan
│ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: FULL SCAN
│
└── • merge join (right outer)
│ equality: (job_id) = (id)
│ right cols are key
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,317
│ │ filter: info_key = 'legacy_progress'
│ │
│ └── • scan
│ estimated row count: 14,648 (100% of the table; stats collected 39 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: FULL SCAN
│
└── • scan
missing stats
table: jobs@primary
spans: FULL SCAN
Because of the limit, I don't think this ends up being as bad as it
looks. But it still isn't great.
In this PR, we replace crdb_internal.jobs with a query that removes
the join on the unused progress field and also constrains the query of
the job_info table.
• distinct
│ distinct on: id, value
│
└── • merge join
│ equality: (job_id) = (id)
│ right cols are key
│
├── • render
│ │
│ └── • filter
│ │ estimated row count: 7,318
│ │ filter: info_key = 'legacy_payload'
│ │
│ └── • scan
│ estimated row count: 14,646 (100% of the table; stats collected 45 minutes ago; using stats forecast for 2 hours in the future)
│ table: job_info@primary
│ spans: [/101/'legacy_payload' - ]
│
└── • render
│
└── • limit
│ count: 100
│
└── • filter
│ filter: created < '2023-07-20 07:29:01.17001'
│
└── • scan
missing stats
table: jobs@primary
spans: [/101 - ]
In a local example, this does seem faster:
> SELECT id, payload, status, created
> FROM "".crdb_internal.system_jobs
> WHERE (created < '2023-07-20 07:29:01.17001') AND (id > 100) ORDER BY id LIMIT 100;
id | payload | status | created
-----+---------+--------+----------
(0 rows)
Time: 183ms total (execution 183ms / network 0ms)
> WITH
> latestpayload AS (
> SELECT job_id, value
> FROM system.job_info AS payload
> WHERE job_id > 100 AND info_key = 'legacy_payload'
> ORDER BY written desc
> ),
> jobpage AS (
> SELECT id, status, created
> FROM system.jobs
> WHERE (created < '2023-07-20 07:29:01.17001') and (id > 100)
> ORDER BY id
> LIMIT 100
> )
> SELECT distinct (id), latestpayload.value AS payload, status
> FROM jobpage AS j
> INNER JOIN latestpayload ON j.id = latestpayload.job_id;
id | payload | status
-----+---------+---------
(0 rows)
Time: 43ms total (execution 42ms / network 0ms)
Release note: None
Epic: none
Member
adityamaru
approved these changes
Aug 3, 2023
|
|
||
| const expiredJobsQueryWithJobInfoTable = ` | ||
| WITH | ||
| latestpayload AS ( |
Contributor
There was a problem hiding this comment.
having to unmarshal the payload just to read FinishedMicros is 😢
Contributor
|
Any plans on backporting this? |
Collaborator
Author
|
@miretskiy Yes, I think we should backport this. |
Collaborator
Author
|
bors r=adityamaru |
Contributor
|
Build succeeded: |
7 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The crdb_internal.system_jobs is a virtual table that joins information from the jobs table and the jobs_info table.
For the previous query,
this is a little suboptimal because:
We don't make use of the progress column so any read of that is useless.
While the crdb_internal.virtual table has a virtual index on job id, and EXPLAIN will even claim that it will be used:
This is actually a lie. A virtual index can only handle single-key spans. As a result the unconstrained query is used:
which has a full scan of the jobs table and 2 full scans of the info table:
Because of the limit, I don't think this ends up being as bad as it looks. But it still isn't great.
In this PR, we replace crdb_internal.jobs with a query that removes the join on the unused progress field and also constrains the query of the job_info table.
In a local example, this does seem faster:
Release note: None
Epic: none