Skip to content

roachtest: avoid decommissioning suspect nodes in mixed version test#106859

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
AlexTalks:fix_rt_decom_precheck_suspect
Jul 17, 2023
Merged

roachtest: avoid decommissioning suspect nodes in mixed version test#106859
craig[bot] merged 1 commit intocockroachdb:masterfrom
AlexTalks:fix_rt_decom_precheck_suspect

Conversation

@AlexTalks
Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks commented Jul 14, 2023

Once decommission pre-checks were introduced, in #98113 the roachtests were updated to handle the new output in certain cases. Despite this, it was not handled in all cases and in the decommission/mixed-versions test, which upgrades and restarts nodes, decommission requests that happen shortly after restart could fail the pre-checks because nodes are considered "suspect" for 30s after being unavailable. This change decreases the suspect time limit and ensures that the nodes are considered fully available before decommissioning, with pre-checks enabled.

Fixes: #101620

Release note: None

@AlexTalks AlexTalks requested a review from a team as a code owner July 14, 2023 19:18
@AlexTalks AlexTalks requested review from erikgrinaker, renatolabs and smg260 and removed request for a team July 14, 2023 19:18
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@AlexTalks AlexTalks added the backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only label Jul 14, 2023
Once decommission pre-checks were introduced, in cockroachdb#98113 the roachtests
were updated to handle the new output in certain cases. Despite this,
it was not handled in all cases and in the `decommission/mixed-versions`
test, which upgrades and restarts nodes, decommission requests that
happen shortly after restart could fail the pre-checks because nodes are
considered "suspect" for 30s after being unavailable. This change
decreases the suspect time limit and ensures that the nodes are
considered fully available before decommissioning, with pre-checks
enabled.

Fixes: cockroachdb#101620

Release note: None
@AlexTalks AlexTalks force-pushed the fix_rt_decom_precheck_suspect branch from 949cffe to 38c9f0b Compare July 14, 2023 19:32
@AlexTalks
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jul 17, 2023

Build succeeded:

@craig craig bot merged commit c07cb84 into cockroachdb:master Jul 17, 2023
@AlexTalks AlexTalks deleted the fix_rt_decom_precheck_suspect branch July 18, 2023 00:01
AlexTalks added a commit to AlexTalks/cockroach that referenced this pull request Aug 9, 2023
In cockroachdb#106859, the `decommission/mixed-versions` test was updated to
properly support the decommission pre-checks introduced in 23.1, however
in doing so there was an inadvertent bug introduced in the test due to
the `server.time_after_store_suspect` setting. While this setting can be
used to shorten the time a store is considered suspect after node
restart, there exists a discrepency in this setting between 23.1 (the
current predecessor major version) and 23.2, as 23.2 requires the
setting to have a minimum of 10s, otherwise reverting to the default of
30s, despite the fact that this validation is not performed when the
setting is actually overridden on the predecessor version.

This change corrects that mistake, setting the value to the correct
minimum version and waiting out the "suspect" time after restart before
attempting decommission.

Fixes: cockroachdb#107150.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 10, 2023
108408: roachtest: ensure valid suspect duration in mixed version decommission r=AlexTalks a=AlexTalks

In #106859, the `decommission/mixed-versions` test was updated to
properly support the decommission pre-checks introduced in 23.1, however
in doing so there was an inadvertent bug introduced in the test due to
the `server.time_after_store_suspect` setting. While this setting can be
used to shorten the time a store is considered suspect after node
restart, there exists a discrepency in this setting between 23.1 (the
current predecessor major version) and 23.2, as 23.2 requires the
setting to have a minimum of 10s, otherwise reverting to the default of
30s, despite the fact that this validation is not performed when the
setting is actually overridden on the predecessor version.

This change corrects that mistake, setting the value to the correct
minimum version and waiting out the "suspect" time after restart before
attempting decommission.

Fixes: #107150.

Release note: None

Co-authored-by: Alex Sarkesian <sarkesian@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: decommission/mixed-versions failed

4 participants