Skip to content

distsql: Fix flaky TestDistSQLUnavailableHosts.#126976

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
shubhamdhama:fix-test-unavailable-hosts-126908
Jul 12, 2024
Merged

distsql: Fix flaky TestDistSQLUnavailableHosts.#126976
craig[bot] merged 1 commit intocockroachdb:masterfrom
shubhamdhama:fix-test-unavailable-hosts-126908

Conversation

@shubhamdhama
Copy link
Copy Markdown
Contributor

This PR fixes the two flaky tests as reported in the linked issue:

TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant

The issue was that the first plan of plans slice was incorrect and included the 5th node.

Some more context: Although this test shouldn't be affected by the core changes in #124986, I made slight improvements during refactoring. I added a plan check after stopping the 5th node and before running the query to avoid updating the lease and to understand how the query is planned.

This issue wasn't detected even in local stress testing. My hypothesis is that leases are usually updated quickly enough that the 5th node's ranges are moved to the 2nd or 3rd node most of the time. But on teamcity it hits the case where the leases haven't moved and range is planned on the gateway.

TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode

The test runs fast enough that if a heartbeat hasn't occurred yet, ConnHealthTryDialInstance returns ErrNotHeartbeated. The fix here is in most cases, this error can be ignored, and the node can be considered healthy. If the node is indeed unhealthy, we would anyway retry-as-local.

Fixes: #126908

Epic: CRDB-40179

Release note: None

@shubhamdhama shubhamdhama requested a review from a team as a code owner July 11, 2024 06:07
@shubhamdhama shubhamdhama requested review from rytaft and removed request for a team July 11, 2024 06:07
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@shubhamdhama
Copy link
Copy Markdown
Contributor Author

shubhamdhama commented Jul 11, 2024

  • Revert skip.UnderShort back to skip.UnderDuress. It's here right now to stress on teamcity.

@shubhamdhama shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch 2 times, most recently from dafb126 to 3998406 Compare July 11, 2024 12:48
@shubhamdhama
Copy link
Copy Markdown
Contributor Author

I was thinking of stressing using the way mentioned here but found out some Bazel Extended CI was failing with the following error

Link to the test

Failed
=== RUN   TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode
    distsql_physical_planner_test.go:563: query 'SELECT IF(substring(start_key for 1)='…',start_key,NULL),
                      IF(substring(end_key for 1)='…',end_key,NULL),
                      lease_holder, replicas FROM [SHOW RANGES FROM TABLE t WITH DETAILS]': expected:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 1, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
        got:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 3, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
    --- FAIL: TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode (98.59s)

I didn't have a concrete answer to why a particular replica should be a leaseholder so I explicitly added

			r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))

but now it's failing with the same error for another subtest,

Failed
=== RUN   TestDistSQLUnavailableHosts/unhealthy-nodes-in-external-mode
    distsql_physical_planner_test.go:614: query 'SELECT IF(substring(start_key for 1)='…',start_key,NULL),
                      IF(substring(end_key for 1)='…',end_key,NULL),
                      lease_holder, replicas FROM [SHOW RANGES FROM TABLE t WITH DETAILS]': expected:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 1, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
        got:
        NULL, …/Table/106/1/0, 1, {1}
        …/Table/106/1/0, …/Table/106/1/20, 3, {1,2,3}
        …/Table/106/1/20, …/Table/106/1/40, 2, {2,3,4}
        …/Table/106/1/40, …/Table/106/1/60, 3, {1,3,4}
        …/Table/106/1/60, …/Table/106/1/80, 4, {1,2,4}
        …/Table/106/1/80, NULL, 5, {2,3,5}
    --- FAIL: TestDistSQLUnavailableHosts/unhealthy-nodes-in-external-mode (91.26s)

I'm out of ideas what's causing this? Or is this expected and is the reason why Yahor suggested to mark these as skip.UnderDuress?

@rytaft rytaft requested review from yuzefovich and removed request for rytaft July 11, 2024 14:24
@rytaft
Copy link
Copy Markdown
Collaborator

rytaft commented Jul 11, 2024

Adding @yuzefovich as a reviewer since it looks like you reviewed the most recent change to this test. If you don't have time let me know and I can take a look. Thanks!

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

race builds impose performance penalty, stressrace builds even larger one. This test has 5 nodes (which obviously requires more resources than a single node), and when we have tenants (either shared or external mode), then performance overhead of the test is even larger. Generally speaking, commands that modify the KV range allocation decisions (like relocating leases, etc) are kinda best-effort, and they generally work, but they could fail / take long time under overload (which is the case in this test). In short, yes, all these observations are the reason why I suggested skip.UnderDuress since this test is destined to be flaky under overload.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @shubhamdhama)


pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):

	status := NodeOK
	err := dsp.nodeHealth.connHealthInstance(instanceID, instanceRPCAddr)
	if errors.Is(err, rpc.ErrNotHeartbeated) {

nit: perhaps move this check inside the err != nil block below.

@shubhamdhama shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch from 3998406 to 8ee7e42 Compare July 11, 2024 18:50
This PR fixes the two flaky tests as reported in the linked issue:

*TestDistSQLUnavailableHosts/unhealthy-nodes-in-single-tenant*

The issue was that the first plan of `plans` slice was incorrect and
included the 5th node.

Although this test shouldn't be affected by the changes in cockroachdb#124986, I made
slight improvements during refactoring. I added a plan check after stopping
the 5th node and before running the query to avoid updating the lease and
to understand how the query is planned.

This issue wasn't detected even in local stress testing. My hypothesis is
that leases are usually updated quickly enough that the 5th node's ranges
are moved to the 2nd or 3rd node most of the time. But on teamcity it hits
the case where the leases haven't moved and they are planned on the
gateway.

*TestDistSQLUnavailableHosts/unhealthy-nodes-in-shared-mode*

The test runs fast enough that if a heartbeat hasn't occurred yet,
`ConnHealthTryDialInstance` returns `ErrNotHeartbeated`. The fix here is in
most cases, this error can be ignored, and the node can be considered
healthy. If the node is indeed unhealthy, we would anyway retry-as-local.

Fixes: cockroachdb#126908

Epic: CRDB-40179

Release note: None
@shubhamdhama shubhamdhama force-pushed the fix-test-unavailable-hosts-126908 branch from 8ee7e42 to 3277360 Compare July 11, 2024 18:52
@shubhamdhama shubhamdhama requested a review from yuzefovich July 11, 2024 18:56
Copy link
Copy Markdown
Contributor Author

@shubhamdhama shubhamdhama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/distsql_physical_planner.go line 1578 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: perhaps move this check inside the err != nil block below.

Done


pkg/sql/distsql_physical_planner_test.go line 467 at r1 (raw file):

				i+1, (i+1)%4+1, (i+2)%4+1, n*i/5,
			))
			r.Exec(t, fmt.Sprintf("ALTER TABLE t RELOCATE LEASE VALUES (%d, %d)", i+1, n*i/5))

I'm removing this line then because I'm sure the execute query one line above does this anyway, and it would just increase the test time.

Copy link
Copy Markdown
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@shubhamdhama
Copy link
Copy Markdown
Contributor Author

Thanks for the review!

Stress and race tests on my local and gceworkers have passed (many times), so I'm merging this. I will take care of any other failures in follow-up PRs.

bors r=yuzefovich

@yuzefovich
Copy link
Copy Markdown
Member

bors r=yuzefovich

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 12, 2024

@craig craig bot merged commit 2ee6332 into cockroachdb:master Jul 12, 2024
@shubhamdhama shubhamdhama deleted the fix-test-unavailable-hosts-126908 branch March 1, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sql: TestDistSQLUnavailableHosts failed

4 participants