distsql: avoid planning on a draining node.#128283
distsql: avoid planning on a draining node.#128283craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
The test for external-process mode is failing because the plan still includes the drained 5th node. This happens because setting Code quote: testutils.SucceedsWithin(t, func() error {
// TODO: REMOVE
res := r.QueryStr(t, "SELECT now(), id, encode(session_id, 'hex'), is_draining FROM system.sql_instances WHERE addr IS NOT NULL")
t.Logf("res : %v", res)
return checkQueryPlan(t, r, expectedPlans)
}, 4*time.Second) |
cc10bf1 to
661e2dc
Compare
shubhamdhama
left a comment
There was a problem hiding this comment.
This is also ready for review.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dt, @jeffswenson, @stevendanna, and @yuzefovich)
pkg/sql/distsql_physical_planner_test.go line 832 at r2 (raw file):
Previously, shubhamdhama (Shubham Dhama) wrote…
The test for external-process mode is failing because the plan still includes the drained 5th node. This happens because setting
is_drainingto true isn't being reflected on the gateway node. However, this works correctly on a roachprod cluster. Despite extensive investigation, I couldn't figure out why this is happening, so I would really appreciate any help.
Oh btw, this was fixed. Earlier, I was draining system tenant instance which I now changed with the correct sql instance.
jeffswenson
left a comment
There was a problem hiding this comment.
LGTM
I think you should also get a review from someone on sql foundations or sql queries. The functionality straddles the two teams. Historically sql foundations owned the sql_instances infrastructure and sql queries still owns distsql planning which is impacted by this change.
85f61c1 to
29d22c9
Compare
29d22c9 to
2f8894c
Compare
DrewKimball
left a comment
There was a problem hiding this comment.
Reviewed 6 of 22 files at r1, 6 of 18 files at r3, 13 of 13 files at r5, 4 of 4 files at r6, 4 of 4 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @dt, @jeffswenson, @shubhamdhama, and @stevendanna)
pkg/sql/distsql_physical_planner.go line 1725 at r7 (raw file):
} // GetAvailableInstances returns mostly healthy nodes, except those that have
[nit] Why the comment change?
pkg/sql/sqlinstance/instancestorage/instancereader.go line 138 at r5 (raw file):
Locality: row.locality, BinaryVersion: row.binaryVersion, Region: row.region,
Is it problematic that we weren't filling in Region before?
pkg/upgrade/upgrades/v24_3_sql_instances_add_draining.go line 39 at r6 (raw file):
mutableDesc.TableDescriptor = *expectedDesc mutableDesc.Version = version return txn.Descriptors().WriteDesc(ctx, false, mutableDesc, txn.KV())
Will WriteDesc increment the descriptor version?
shubhamdhama
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained (waiting on @DrewKimball, @dt, @jeffswenson, and @stevendanna)
pkg/sql/distsql_physical_planner.go line 1725 at r7 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
[nit] Why the comment change?
Done. (Nice catch, in my previous revision I removed this function).
pkg/sql/sqlinstance/instancestorage/instancereader.go line 138 at r5 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Is it problematic that we weren't filling in
Regionbefore?
It wasn't problematic earlier, but I figured it's most likely a miss.
pkg/upgrade/upgrades/v24_3_sql_instances_add_draining.go line 39 at r6 (raw file):
Previously, DrewKimball (Drew Kimball) wrote…
Will
WriteDescincrement the descriptor version?
Yes,
cockroach/pkg/sql/catalog/descs/collection.go
Lines 559 to 566 in 8551145
cockroach/pkg/sql/catalog/descs/collection.go
Lines 276 to 289 in 8551145
2f8894c to
6337d93
Compare
|
@fqazi: Please could you take a look from the SQL Foundations side? Thanks. |
fqazi
left a comment
There was a problem hiding this comment.
One minor nit
Reviewed 6 of 22 files at r1, 4 of 18 files at r3, 3 of 13 files at r5, 2 of 4 files at r6, 10 of 10 files at r8, 4 of 4 files at r9, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @DrewKimball, @dt, @jeffswenson, @shubhamdhama, and @stevendanna)
pkg/upgrade/upgrades/v24_3_sql_instances_add_draining.go line 37 at r8 (raw file):
} version := mutableDesc.Version mutableDesc.TableDescriptor = *expectedDesc
Use protoutil.Clone here
6337d93 to
f1046f1
Compare
|
Previously, fqazi (Faizan Qazi) wrote…
Done. |
f1046f1 to
bebbfa5
Compare
|
Thanks everyone for the review. bors r=jeffswenson,DrewKimball,fqazi |
stevendanna
left a comment
There was a problem hiding this comment.
Thanks for working on this. Overall this looks reasonable to me.
|
bors r- |
|
Canceled. |
This is a preliminary commit toward improving the DistSQL to avoid planning on an instance that is draining. The idea here is to set this column to true as soon as the draining request is received by the server and use that through instance reader in the planner at gateway. Adding this column requires writing a migration for the upgrade. And since this new column is nullable we don't need to backfill and we are good with just swapping the descriptors. Informs: cockroachdb#100578 Release note: None Development notes: To update `pkg/sql/catalog/bootstrap/testdata/testdata` I had to manually update the hash values for `system`/`tenant` in this file and pass the `--rewrite` option to do the rest: ``` dev test //pkg/sql/catalog/bootstrap:bootstrap_test \ --filter=TestInitialValuesToString/testdata --rewrite ``` To update tests in `pkg/sql/catalog/systemschema_test/testdata/bootstrap_{system,tenant}` ``` dev test \ //pkg/sql/catalog/systemschema_test:systemschema_test_test --rewrite ```
bebbfa5 to
95e3751
Compare
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
95e3751 to
9ef988d
Compare
|
TFTR. bors r=jeffswenson,DrewKimball,fqazi,stevendanna |
Commit-wise description
Commit 1:
sql: add is_draining to system.sql_instances table.This is a preliminary commit toward improving the DistSQL to avoid planning
on an instance that is draining. The idea here is to set this column to
true as soon as the draining request is received by the server and use that
through instance reader in the planner at gateway.
Adding this column requires writing a migration for the upgrade. And since
this new column is nullable we don't need to backfill and we are good with
just swapping the descriptors.
Informs: #100578
Release note: None
Development notes:
To update
pkg/sql/catalog/bootstrap/testdata/testdataI had to manuallyupdate the hash values for
system/tenantin this file and pass the--rewriteoption to do the rest:To update tests in
pkg/sql/catalog/systemschema_test/testdata/bootstrap_{system,tenant}Commit 2:
distsql: avoid planning on a draining node.This change would prevent distsql planner to include a draining node in a
plan by leveraging the system.sql_instances table's
is_drainingfield.Somewhat related work to avoid planning on an unavailable node: #124986
Fixes: #100578
Epic: CRDB-31575
Release note: None