distsql: Fix flaky TestDistSQLUnavailableHosts. by shubhamdhama · Pull Request #126976 · cockroachdb/cockroach

shubhamdhama · 2024-07-11T06:07:08Z

This PR fixes the two flaky tests as reported in the linked issue:

TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant

The issue was that the first plan of plans slice was incorrect and included the 5th node.

Some more context: Although this test shouldn't be affected by the core changes in #124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned.

This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and range is planned on the gateway.

TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode

The test runs fast enough that if a heartbeat hasn't occurred yet, ConnHealthTryDialInstance returns ErrNotHeartbeated. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local.

Fixes: #126908

Epic: CRDB-40179

Release note: None

cockroach-teamcity · 2024-07-11T06:07:18Z

This change is

shubhamdhama · 2024-07-11T06:08:39Z

Revert skip.UnderShort back to skip.UnderDuress. It's here right now to stress on teamcity.

shubhamdhama · 2024-07-11T12:56:15Z

I was thinking of stressing using the way mentioned here but found out some Bazel Extended CI was failing with the following error

Link to the test

Failed
=== RUN   TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode
    distsql_physical_planner_test.go:563: query 'SELECT IF(substring(start_key for 1)='…',start_key,NULL),
                      IF(substring(end_key for 1)='…',end_key,NULL),
                      lease_holder, replicas FROM [SHOW RANGES FROM TABLE t WITH DETAILS]': expected:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 1, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
        got:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 3, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
    --- FAIL: TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode (98.59s)

I didn't have a concrete answer to why a particular replica should be a leaseholder so I explicitly added

			r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))

but now it's failing with the same error for another subtest,

Failed
=== RUN   TestDistSQLUnavailableHosts/unhealthy-nodes-in-external-mode
    distsql_physical_planner_test.go:614: query 'SELECT IF(substring(start_key for 1)='…',start_key,NULL),
                      IF(substring(end_key for 1)='…',end_key,NULL),
                      lease_holder, replicas FROM [SHOW RANGES FROM TABLE t WITH DETAILS]': expected:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 1, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
        got:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 3, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
    --- FAIL: TestDistSQLUnavailableHosts/unhealthy-nodes-in-external-mode (91.26s)

I'm out of ideas what's causing this? Or is this expected and is the reason why Yahor suggested to mark these as skip.UnderDuress?

rytaft · 2024-07-11T14:25:29Z

Adding @yuzefovich as a reviewer since it looks like you reviewed the most recent change to this test. If you don't have time let me know and I can take a look. Thanks!

yuzefovich

race builds impose performance penalty, stressrace builds even larger one. This test has 5 nodes (which obviously requires more resources than a single node), and when we have tenants (either shared or external mode), then performance overhead of the test is even larger. Generally speaking, commands that modify the KV range allocation decisions (like relocating leases, etc) are kinda best-effort, and they generally work, but they could fail / take long time under overload (which is the case in this test). In short, yes, all these observations are the reason why I suggested skip.UnderDuress since this test is destined to be flaky under overload.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama)

pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):

	status := NodeOK
	err := dsp.nodeHealth.connHealthInstance(instanceID, instanceRPCAddr)
	if errors.Is(err, rpc.ErrNotHeartbeated) {

nit: perhaps move this check inside the err != nil block below.

This PR fixes the two flaky tests as reported in the linked issue: *TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant* The issue was that the first plan of `plans` slice was incorrect and included the 5th node. Although this test shouldn't be affected by the changes in cockroachdb#124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned. This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and they are planned on the gateway. *TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode* The test runs fast enough that if a heartbeat hasn't occurred yet, `ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local. Fixes: cockroachdb#126908 Epic: CRDB-40179 Release note: None

shubhamdhama

Got it, thanks for the explanation.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps move this check inside the err != nil block below.

Done

pkg/sql/distsql_physical_planner_test.go line 467 at r1 (raw file):

				i+1, (i+1)%4+1, (i+2)%4+1, n*i/5,
			))
			r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))

I'm removing this line then because I'm sure the execute query one line above does this anyway, and it would just increase the test time.

yuzefovich

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained

shubhamdhama · 2024-07-11T19:30:58Z

Thanks for the review!

Stress and race tests on my local and gceworkers have passed (many times), so I'm merging this. I will take care of any other failures in follow-up PRs.

bors r=yuzefovich

yuzefovich · 2024-07-12T02:23:56Z

bors r=yuzefovich

craig · 2024-07-12T02:56:47Z

Build succeeded:

shubhamdhama requested a review from a team as a code owner July 11, 2024 06:07

shubhamdhama requested review from rytaft and removed request for a team July 11, 2024 06:07

shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch 2 times, most recently from dafb126 to 3998406 Compare July 11, 2024 12:48

rytaft requested review from yuzefovich and removed request for rytaft July 11, 2024 14:24

yuzefovich reviewed Jul 11, 2024

View reviewed changes

shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch from 3998406 to 8ee7e42 Compare July 11, 2024 18:50

shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch from 8ee7e42 to 3277360 Compare July 11, 2024 18:52

shubhamdhama requested a review from yuzefovich July 11, 2024 18:56

shubhamdhama commented Jul 11, 2024

View reviewed changes

yuzefovich added the backport-24.2.x label Jul 11, 2024

yuzefovich approved these changes Jul 11, 2024

View reviewed changes

craig bot merged commit 2ee6332 into cockroachdb:master Jul 12, 2024

blathers-crl bot mentioned this pull request Jul 12, 2024

release-24.2: distsql: Fix flaky TestDistSQLUnavailableHosts. #127044

Merged

shubhamdhama mentioned this pull request Jul 31, 2024

sql: clean up health checks #127293

Open

shubhamdhama deleted the fix-test-unavailable-hosts-126908 branch March 1, 2025 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distsql: Fix flaky TestDistSQLUnavailableHosts.#126976

distsql: Fix flaky TestDistSQLUnavailableHosts.#126976
craig[bot] merged 1 commit intocockroachdb:masterfrom
shubhamdhama:fix-test-unavailable-hosts-126908

shubhamdhama commented Jul 11, 2024

Uh oh!

cockroach-teamcity commented Jul 11, 2024

Uh oh!

shubhamdhama commented Jul 11, 2024 •

edited

Loading

Uh oh!

shubhamdhama commented Jul 11, 2024

Uh oh!

rytaft commented Jul 11, 2024

Uh oh!

yuzefovich left a comment

Uh oh!

shubhamdhama left a comment

Uh oh!

yuzefovich left a comment

Uh oh!

shubhamdhama commented Jul 11, 2024

Uh oh!

yuzefovich commented Jul 12, 2024

Uh oh!

craig bot commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shubhamdhama commented Jul 11, 2024

Uh oh!

cockroach-teamcity commented Jul 11, 2024

Uh oh!

shubhamdhama commented Jul 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shubhamdhama commented Jul 11, 2024

Uh oh!

rytaft commented Jul 11, 2024

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

shubhamdhama left a comment

Choose a reason for hiding this comment

Uh oh!

yuzefovich left a comment

Choose a reason for hiding this comment

Uh oh!

shubhamdhama commented Jul 11, 2024

Uh oh!

yuzefovich commented Jul 12, 2024

Uh oh!

craig bot commented Jul 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shubhamdhama commented Jul 11, 2024 •

edited

Loading