distsql: Fix flaky TestDistSQLUnavailableHosts.#126976
distsql: Fix flaky TestDistSQLUnavailableHosts.#126976craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
dafb126 to
3998406
Compare
|
I was thinking of stressing using the way mentioned here but found out some I didn't have a concrete answer to why a particular replica should be a leaseholder so I explicitly added r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))but now it's failing with the same error for another subtest, I'm out of ideas what's causing this? Or is this expected and is the reason why Yahor suggested to mark these as |
|
Adding @yuzefovich as a reviewer since it looks like you reviewed the most recent change to this test. If you don't have time let me know and I can take a look. Thanks! |
yuzefovich
left a comment
There was a problem hiding this comment.
race builds impose performance penalty, stressrace builds even larger one. This test has 5 nodes (which obviously requires more resources than a single node), and when we have tenants (either shared or external mode), then performance overhead of the test is even larger. Generally speaking, commands that modify the KV range allocation decisions (like relocating leases, etc) are kinda best-effort, and they generally work, but they could fail / take long time under overload (which is the case in this test). In short, yes, all these observations are the reason why I suggested skip.UnderDuress since this test is destined to be flaky under overload.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama)
pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):
status := NodeOK err := dsp.nodeHealth.connHealthInstance(instanceID, instanceRPCAddr) if errors.Is(err, rpc.ErrNotHeartbeated) {
nit: perhaps move this check inside the err != nil block below.
3998406 to
8ee7e42
Compare
This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in cockroachdb#124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: cockroachdb#126908 Epic: CRDB-40179 Release note: None
8ee7e42 to
3277360
Compare
shubhamdhama
left a comment
There was a problem hiding this comment.
Got it, thanks for the explanation.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
nit: perhaps move this check inside the
err != nilblock below.
Done
pkg/sql/distsql_physical_planner_test.go line 467 at r1 (raw file):
i+1, (i+1)%4+1, (i+2)%4+1, n*i/5, )) r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))
I'm removing this line then because I'm sure the execute query one line above does this anyway, and it would just increase the test time.
yuzefovich
left a comment
There was a problem hiding this comment.
Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained
|
Thanks for the review! Stress and race tests on my local and gceworkers have passed (many times), so I'm merging this. I will take care of any other failures in follow-up PRs. bors r=yuzefovich |
|
bors r=yuzefovich |
This PR fixes the two flaky tests as reported in the linked issue:
TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant
The issue was that the first plan of
plansslice was incorrect and included the 5th node.Some more context: Although this test shouldn't be affected by the core changes in #124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned.
This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and range is planned on the gateway.
TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode
The test runs fast enough that if a heartbeat hasn't occurred yet,
ConnHealthTryDialInstancereturnsErrNotHeartbeated. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local.Fixes: #126908
Epic: CRDB-40179
Release note: None