Skip to content

roachtest: de-flake 'inconsistent'#54019

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:rt-incons
Sep 10, 2020
Merged

roachtest: de-flake 'inconsistent'#54019
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:rt-incons

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Sep 8, 2020

This test sets up an intentionally corrupted replica and wants its node
to shut down as a result of its detection. When only two of the three
nodes were included in the consistency check, either one of them could
end up terminating (as no obvious majority of healthy replicas can be
determined). Change the test so that we wait for the cluster to come
fully together before setting a low consistency check interval.

Closes #54005.

Release justification: testing
Release note: None

@tbg tbg requested a review from knz September 8, 2020 08:34
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

// consistency check interval. This makes sure that the consistency check runs
// against all three nodes. If it targeted only two nodes, a random one would
// fatal - not what we want.
time.Sleep(20 * time.Second)
Copy link
Copy Markdown
Contributor

@knz knz Sep 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stand behind moving the cluster setting change to after the call for waitForFullReplication()

However I do not understand the 20s wait - isn't waitForFullReplication() sufficient to "wait for the cluster to come together"?

Also, separately, a question: the comment says "this makes sure that the consistency check runs against all three nodes". However the cluster setting change is onyl effected against node 2. Given that cluster settings take a little while to propagate, it looks to me like node 2 is the one that's going to predominently fall over, as it will start its checks before the other two nodes get the setting update. Is that intentional?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However I do not understand the 20s wait - isn't waitForFullReplication() sufficient to "wait for the cluster to come together"?

You're right that this should not be necessary. If replicas are moved to the node, it is likely (though not guaranteed, but alas) that it is considered live by all. Removed.

Also, separately, a question: the comment says "this makes sure that the consistency check runs against all three nodes". However the cluster setting change is onyl effected against node 2. Given that cluster settings take a little while to propagate, it looks to me like node 2 is the one that's going to predominently fall over, as it will start its checks before the other two nodes get the setting update. Is that intentional?

The consistency check runs against all members of the range. It does not matter which node it gets run on, other than that that node's view of liveness determines who is probed as part of the check. I am choosing n2 here since n1 is the one we expect to die; it may already be dead (though it usually isn't).

This test sets up an intentionally corrupted replica and wants its node
to shut down as a result of its detection. When only two of the three
nodes were included in the consistency check, either one of them could
end up terminating (as no obvious majority of healthy replicas can be
determined). Change the test so that we wait for the cluster to come
fully together before setting a low consistency check interval.

Closes cockroachdb#54005.

Release justification: testing
Release note: None
Copy link
Copy Markdown
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Sep 10, 2020

bors r=knz

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 10, 2020

Build succeeded:

@craig craig bot merged commit eab6bcb into cockroachdb:master Sep 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: inconsistency failed

3 participants