distsql: improve unhealthy node detection in SQL instances-based planning.#124986
Conversation
|
Hey @shubhamdhama, is this ready for a look? I see that there are a couple of check boxes still. In any case, I'll take a look tomorrow. |
|
Hi @yuzefovich, Yes, I'm still working on the above pending tasks and some unit test failures. However, I would appreciate your review on this anyway. |
yuzefovich
left a comment
There was a problem hiding this comment.
Nice! This is definitely on the right track. I have some nits, but my main question is when should we be performing the filtering of unhealthy instances and whether we should cache that information. Currently, in this PR we perform it "on demand", whenever a particular instance might be considered for physical planning, and in some cases we will perform it many times for the same instance. In the single-tenant setup we maintain a map from the NodeID to health status (stored in PlanningCtx.nodeStatuses) where we perform the health check at most once for a particular node and then use the result of that check going forward, throughout the whole physical planning for this query.
I think we might want to replicate that - i.e. rather that filtering out instances from the list, we'd maintain nodeStatuses for instances and use that as the cache for the health of instances. If we haven't tried dialing an instance, then nodeStatuses[instanceID] will be empty; if we have dialed, then we'll get the health status from the map. Only gateway instance is considered healthy right away (this is already done in NewPlanningCtxWithOracle). If we go with this approach, then some of my inlined comments on the PR will become irrelevant.
Additionally, there is a second aspect of health that we check in the system tenant - isAvailable is also called. In single-tenant setup it performs the "node liveness" check (which in case of the DistSQL means that the node is "alive and connected"). We currently hard-code return true for multi-tenant setup only because at the time of writing that code (and actually still at the moment) DistSQL is disabled on serverless (i.e. in the external process mode). I feel like ConnHealthTryDialInstance approach actually takes care of this "isAvailable" too, so perhaps we just want to update some comments in server_sql.go where isAvailable function is created.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama)
pkg/rpc/nodedialer/nodedialer.go line 239 at r1 (raw file):
// heartbeat. If no healthy connection is found, it will attempt to dial the // instance. func (n *Dialer) ConnHealthTryDialInstance(instanceInfo sqlinstance.InstanceInfo) error {
nit: rather than plumb the whole InstanceInfo struct (which makes rpc/nodedialer depend on sqlinstance package, which seems somewhat wrong), we could pass addr and id arguments explicitly.
pkg/sql/distsql_physical_planner.go line 1567 at r1 (raw file):
ctx context.Context, ) func(nodeID roachpb.NodeID) (base.SQLInstanceID, SpanPartitionReason) { allHealthy, err := dsp.sqlAddressResolver.GetAllInstances(ctx)
nit: let's rename allHealthy to allInstances.
pkg/sql/distsql_physical_planner.go line 1588 at r1 (raw file):
nodeID, rpc.DefaultClass); err == nil { return sqlInstance, SpanPartitionReason_TARGET_HEALTHY }
If we fail to dial here, should we remove sqlInstance from healthyNodes? This way we can avoid repeated dials for the same instance. The downside is that if the instance does come up in the middle of the physical planning, we won't use it (which seems acceptable to me).
pkg/sql/distsql_physical_planner.go line 1675 at r1 (raw file):
// cache. The filtering out of these nodes is deferred to the resolver using // the filterUnhealthyNodes function. instances, err := dsp.sqlAddressResolver.GetAllInstances(ctx)
There are a couple other places (GetAllInstancesByLocality and setupAllNodesPlanningTenant) where we call GetAllInstances and probably should add filtering too.
pkg/sql/distsql_physical_planner.go line 1732 at r1 (raw file):
// sql_instances cache. Therefore, we filter out these unhealthy nodes by // dialing them. instances = dsp.filterUnhealthyInstances(instances)
resolver function can be called many times during physical planning of a single statement, so filtering (which includes possibly dialing the instances) also will be performed many times, and I don't think that's desirable. I think we should perform filtering only once unconditionally, right above "healthy SQL instances ..." log message. (Perhaps we could add another log message to highlight which instances were deemed unhealthy.)
pkg/sql/distsql_physical_planner_test.go line 486 at r1 (raw file):
} func TestDistSQLDeadHostsInSharedMode(t *testing.T) {
nit: there is a lot of duplication with TestDistSQLDeadHosts. I think it'd be nice to extract the shared code into a helper function that performs the necessary set up.
pkg/sql/distsql_physical_planner_test.go line 490 at r1 (raw file):
defer log.Scope(t).Close(t) skip.UnderShort(t, "takes 20s")
nit: we might to do skip.UnderDuress since this test seems heavy and could be flaky under special configs (race or deadlock).
pkg/sql/distsql_physical_planner_test.go line 541 at r1 (raw file):
r.Exec(t, fmt.Sprintf("INSERT INTO t SELECT i, i*i FROM generate_series(1, %d) AS g(i)", n)) r.Exec(t, "SET DISTSQL = ON")
nit: do we need distsql = on? I'd expect that the aggregation query would be distributed even with the default auto.
pkg/sql/distsql_physical_planner_test.go line 563 at r1 (raw file):
// Verify the plan should include all 5 nodes. r.CheckQueryResults(t, "SELECT info FROM [EXPLAIN (DISTSQL, SHAPE) SELECT sum(xsquared) FROM t] WHERE info LIKE 'Diagram%'",
nit: I think SHAPE doesn't do anything here (i.e. it's only supported for EXPLAIN (DDL, SHAPE)).
d6f0e57 to
39aa31b
Compare
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I think this is resolved with |
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Done. |
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Partially done. I have kept the filtering within the resolver as we are doing with resolver for mixed-process case like we discussed it's probably a good idea to let it detect unhealthy nodes when resolver is called to react to unhealthy nodes between the creation of resolver and when it's called. |
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
I have gathered three related tests for three modes (single tenant, shared and external) into one |
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Right we don't. |
|
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Actually it does something, I'm not sure why, but it gives consistent URL for plans that looks completely same otherwise I have two plans that looks the same but different URL (I lost the URLs but I can reproduce and share them if you are interested). |
|
Btw I hope it's fine to use this directly as the value for cockroach/pkg/base/test_server_args.go Lines 261 to 264 in 7b8f732 Code quote: SharedTestTenantAlwaysEnabled |
shubhamdhama
left a comment
There was a problem hiding this comment.
Thanks for the detailed review!
Now some other tests are failing (in non-sql_test target), I'll figure out why. In the meantime I can use another round of review to check it's still on the right direction. And sorry for the delay in updating this PR, I was thinking of fixing the draining case in this one but that may take bit longer we can get this one going. I'll try to follow up quickly for next review iterations.
On a side note, if I stress run shared tenant mode it's unexpectedly failing with less than 5 nodes in the plan (even before killing the 5th node). Again, I am yet to dig deeper into this one too.
Wraps: (2) query 'SELECT info FROM [EXPLAIN (DISTSQL, SHAPE) SELECT sum(xsquared) FROM t] WHERE info LIKE 'Diagram%'':
| expected:
| Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJyklV9vmzAUxd_3KdCVpraSq2AghPDUqmFqtPTPQqZNqqLKC7cMleDUNmqrKt99IjQNYQVB4CERxj7n8Lv25Q3kUwwu-N7Eu5hpUfLAtW_Tmyvtzvt9OzkfX2vHo7E_839MiOZfnt96J9r7VJkuj1_kU8oEBif5GjXXfl16Uy-XmYy_e9rRKGKhYMuvR0Ag4QFesyVKcO-AAgEDCJhAwAICfZgTWAm-QCm5yKa8bRaMgxdwdQJRskpVNjwnsOACwX0DFakYwYUZ-xPjFFmAoqcDgQAVi-KNjTpT96tHfAUCFzxOl4l0tW1sIOCvWDbSo7rdo6f5X8_QYb4mwFO1s5SKhQguLWQcj8DV16R5zPMwFBgyxUWvv5_S_3l1fEZPKm2Nkm2_0nbnliZcBJi_5s5qvq4PRvV2ycxSMrpPhDYvHD28cD1D39bOal472iZpAZHdrXb2nq3RHJDRAZD1AchuDshok7QAaNAN0GDP1mwOyOwAyP4A5DQHZLZJWgDkdAPk7NlazQFZHQA574CMxnSsNjELdIbd6AzbtOQpyhVPJJZ65OdOesnplGbNFIMQ884reSoWeCv4YjM3v73ZCG0GApQqf0rzm3GyfSSVQLb8-KIUlWitkrGnRItK_bKSUZ-pTSizVsqqVqJlJevQ17PLSv1aJbs6k1FWsg_NNCgrDWqVnOpMZlnJOTSTU1Ya1m8DvTqU9d_erN_mNamG2dF5iPnzfRSAC_r7dfrJz_aCbAELZXZ-_b_8eSM7e11lp--BxRIJXLFHHKFCsYySSKpoAa4SKa7XX_4FAAD__5H-gCw=
| got:
| Diagram: https://cockroachdb.github.io/distsqlplan/decode.html#eJyklF1ro04Uxu__n2I48KctTImjxmS9amlcGjZ92ZhlF0oos_HUSo1jZ0baUvLdF7VpjDSiiRcJzsvveeY543kH9RyDC7438S5mJEoeBPk-vbkid96f28n5-Jocj8b-zP85ocS_PL_1TsjHUpUtj1_Vc8YlBiflHj0nvy-9qVdiJuMfHjkaRTyUfPn_EVBIRIDXfIkK3DtgQMEEChZQsGFOIZVigUoJmU-_F4vHwSu4BoUoSTOdD88pLIREcN9BRzpGcGHG_8Y4RR6g7BlAIUDNo7iQ0Gf6Pn3CN6BwIeJsmSiXrC0DBT_l-UiPGU6PnZZ_PdMgPAkII0I_ooT5ioLI9EZeaR4iuKzidzwC11jR9pbPw1BiyLWQPXvbsf_r6viMneyUNWuy9k7ZjVqWCBlgeeSN1HzVbGzYzZhVMzbcMsbal5DtX8KeaayraButK8e6OK0E1D-scv0tWbN9QOYBAdmfATntAzK7OK0E5BwWkLMla7UPyDogIOczoGH7gKwuTisBDQ4LaNCl5UxRpSJRWOsBXysZNaVTljcLDEIsO4sSmVzgrRSLYm35elOAioEAlS5nWfkyTtZTSkvky8-OWSWxRpK5RWJVkl0nmY2kbx08WY0kezeJ1Un2vqfr10n9RpKz25NZJzn7enLqpEEjabjbk1UnDff1NMjv6EMsXu6jAFwwPp7TL37WD-QbeKjyD8V_FC8FdvaW5tf8gccKKVzxJxyhRrmMkkjpaAGulhmuVv_9CwAA__-0ofXg
PS: Oops, I guess I wasn't supposed to hit "send" button for each comment on Reviewable :)
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
yuzefovich
left a comment
There was a problem hiding this comment.
Nice! I think this is almost ready to go, I have a few more comments and nits.
Reviewed 7 of 7 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama)
pkg/sql/distsql_physical_planner.go line 1732 at r1 (raw file):
Previously, shubhamdhama (Shubham Dhama) wrote…
Partially done. I have kept the filtering within the resolver as we are doing with resolver for mixed-process case like we discussed it's probably a good idea to let it detect unhealthy nodes when resolver is called to react to unhealthy nodes between the creation of resolver and when it's called.
SGTM.
pkg/sql/distsql_physical_planner.go line 1606 at r2 (raw file):
if log.ExpensiveLogEnabled(ctx, 2) { log.VEventf(ctx, 2, "healthy SQL instances available for distributed planning: %v", allInstances)
nit: let's reword the comment a bit: s/healthy/all/.
pkg/sql/distsql_physical_planner.go line 1676 at r2 (raw file):
instances []sqlinstance.InstanceInfo, nodeStatusesCache map[base.SQLInstanceID]NodeStatus, ) (healthyInstances []sqlinstance.InstanceInfo, unhealthyInstances []sqlinstance.InstanceInfo) { healthyInstances = make([]sqlinstance.InstanceInfo, 0, len(instances))
Let's avoid this allocation by modifying the passed-in argument instances in-place to become the healthy instances.
pkg/sql/distsql_physical_planner.go line 1776 at r2 (raw file):
if len(instances) == 0 { log.Eventf(ctx, "no healthy sql instances available for planning, using the gateway") return dsp.gatewaySQLInstanceID, SpanPartitionReason_GATEWAY_ON_ERROR
nit: I think this (and below) should be SpanPartitionReason_GATEWAY_TARGET_UNHEALTHY.
pkg/sql/distsql_plan_bulk.go line 110 at r2 (raw file):
return nil, nil, err } pods, _ = dsp.filterUnhealthyInstances(pods, nil /* nodeStatusesCache */)
nit: even though it doesn't matter right now, it seems cleaner to pass planCtx.nodeStatuses as nodeStatusesCache argument.
pkg/sql/distsql_physical_planner_test.go line 486 at r1 (raw file):
Previously, shubhamdhama (Shubham Dhama) wrote…
I have gathered three related tests for three modes (single tenant, shared and external) into one
TestDistSQLUnavailableHosts. I have named it this way as I'm thinking of moving tests related to draining into this one too (the one that's present currently for single tenant and other ones I would be writing in follow-up PR).
SGTM
pkg/sql/distsql_physical_planner_test.go line 563 at r1 (raw file):
Previously, shubhamdhama (Shubham Dhama) wrote…
Actually it does something, I'm not sure why, but it gives consistent URL for plans that looks completely same otherwise I have two plans that looks the same but different URL (I lost the URLs but I can reproduce and share them if you are interested).
Yeah, good point. In explain.MakeFlags(), if SHAPE option is used, we set a few flags to hide values (to make some things deterministic). It's not used widely, so I was surprised about SHAPE being used in a very old test, but the test was recently adjusted a bit.
pkg/sql/distsql_physical_planner_test.go line 523 at r2 (raw file):
Previously, shubhamdhama (Shubham Dhama) wrote…
Btw I hope it's fine to use this directly as the value for
DefaultTestTenant. Asking because of this notecockroach/pkg/base/test_server_args.go
Lines 261 to 264 in 7b8f732
Yeah, the idea behind SharedTestTenantAlwaysEnabled and ExternalTestTenantAlwaysEnabled was that we wouldn't use them in the tests like this. The proposed approach is to use base.TestControlsTenantsExplicitly and then manually start the correct tenant in the sub-tests. It is a bit more tedious, but take a look at TestClusterInflightTraces in util/tracing/collector/collector_test.go for an example.
(The thinking behind this is that tenants being started automatically by the testserver are random, so if you need a tenant for a sub-test, you shouldn't override the randomized behavior and, instead, you should opt out of it completely and manually control the tenant creation.)
No problem! |
39aa31b to
b8f6572
Compare
shubhamdhama
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @msbutler and @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1776 at r2 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: I think this (and below) should be
SpanPartitionReason_GATEWAY_TARGET_UNHEALTHY.
Good catch, it seems there is a better option: SpanPartitionReason_GATEWAY_NO_HEALTHY_INSTANCES
This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in cockroachdb#124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: cockroachdb#126908 Epic: CRDB-40179 Release note: None
This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in cockroachdb#124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: cockroachdb#126908 Epic: CRDB-40179 Release note: None
This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in cockroachdb#124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: cockroachdb#126908 Epic: CRDB-40179 Release note: None
126976: distsql: Fix flaky TestDistSQLUnavailableHosts. r=yuzefovich a=shubhamdhama This PR fixes the two flaky tests as reported in the linked issue: **TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant** The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Some more context: Although this test shouldn't be affected by the core changes in #124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and range is planned on the gateway. **TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode** The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: #126908 Epic: CRDB-40179 Release note: None Co-authored-by: Shubham Dhama <shubham.dhama@cockroachlabs.com>
This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in #124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: #126908 Epic: CRDB-40179 Release note: None
|
I just noticed this PR merged, but I'm not sure if this is the right approach. The goal was to centralize "all" liveness checks into the NodeVitality calls. With the upcoming work on leader leases, we are going to make an even more robust version of this. Specifically I don't mind the addition of the The problem right now is that NodeVitality may not "function correctly" on SQL pods, although I don't think that is easily solvable, it just can't look at the liveness records from gossip. Ideally we shouldn't need the I know this is already merged, so I'm not sure if it is worth changing now, but if might be a lot cleaner to not have to split this out the way it is done. |
|
@andrewbaptist IIUC what you're saying is that this code cockroach/pkg/server/server_sql.go Lines 919 to 931 in 7665495 ideally should become the only necessary function for DistSQL planner to decide whether remote nodes / instances are healthy and available for DistSQL work, right? It sounds like this If my understanding is correct, then I don't see a difficulty with this PR being merged - it improved the situation in the multi-tenant environment, especially so in the external process mode (i.e. serverless). Given that KV layer doesn't really control when SQL pods go up or down, I imagine that we'll always need some mechanism at the SQL layer to ensure that SQL pods are healthy, and this mechanism after this change is "try dial the pod". It sounds like shortly we could simplify the DistSQL physical planner health check in the system tenant, but some folks think that we should fully deprecate the system tenant code path altogether and only keep the "instance-based DistSQL planning", so I'm curious in your thoughts on how we can improve the situation for SQL pods where we don't have access to gossip-based info. |
|
I was a little curious about if this was possible and did a small refactoring of this code a little to make it more clear the difference between checking SQL vs KV node health. I moved things around but should not have changed the logic at all in #127293. I think after this change it should be possible to move the code from I agree that the Longer term, I'd like to think about what |
|
Thanks for the patch! I'll take a look tomorrow.
Indeed, this is a known limitation #100578, and Shubham will be tackling it shortly. |
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field added in cockroachdb#126765. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field added in cockroachdb#126765. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: cockroachdb#124986 Fixes: cockroachdb#100578 Epic: CRDB-31575 Release note: None
128283: distsql: avoid planning on a draining node. r=jeffswenson,DrewKimball,fqazi,stevendanna a=shubhamdhama Commit-wise description --- **Commit 1: [`sql: add is_draining to system.sql_instances table.`](https://github.com/cockroachdb/cockroach/pull/128283/commits/4b0de41bc39)** This is a preliminary commit toward improving the DistSQL to avoid planning on an instance that is draining. The idea here is to set this column to true as soon as the draining request is received by the server and use that through instance reader in the planner at gateway. Adding this column requires writing a migration for the upgrade. And since this new column is nullable we don't need to backfill and we are good with just swapping the descriptors. Informs: #100578 Release note: None Development notes: To update `pkg/sql/catalog/bootstrap/testdata/testdata` I had to manually update the hash values for `system`/`tenant` in this file and pass the `--rewrite` option to do the rest: ``` dev test //pkg/sql/catalog/bootstrap:bootstrap_test \ --filter=TestInitialValuesToString/testdata --rewrite ``` To update tests in `pkg/sql/catalog/systemschema_test/testdata/bootstrap_{system,tenant}` ``` dev test \ //pkg/sql/catalog/systemschema_test:systemschema_test_test --rewrite ``` --- **Commit 2: [`distsql: avoid planning on a draining node.`](https://github.com/cockroachdb/cockroach/pull/128283/commits/6337d932854)** This change would prevent distsql planner to include a draining node in a plan by leveraging the system.sql_instances table's `is_draining` field. Somewhat related work to avoid planning on an unavailable node: #124986 Fixes: #100578 Epic: CRDB-31575 Release note: None Co-authored-by: Shubham Dhama <shubham.dhama@cockroachlabs.com>
The distsql planner currently uses different strategies to avoid planning
on unhealthy nodes in single and multi-tenant deployments. For
single-tenant deployments, we check for a healthy gRPC connection and use
information from the gossip protocol, which allows for quick detection. For
multi-tenant deployments, we rely on the sql_instances system table cache
to check if an instance is healthy. However, this cache may not update for
over half a minute (40 seconds in my tests) after a node goes down, leading
to planning on an unhealthy instance, which are later retried-as-local at the gateway.
To prevent planning on dead SQL instances, the resolver now ensures it does
not consider such instances by first establishing a healthy gRPC
connection. This means that in shared-process mode, if an instance is found
to be unhealthy, the gateway will be used. In external mode, such a node
would be excluded from the candidates for locality-based matching or
round-robin selection.
Fixes: #120774
Epic CRDB-39091
Release note: None