Skip to content

kvserver: add mixed version race condition test for policy refresher#144106

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
wenyihu6:newmixedversion
Apr 12, 2025
Merged

kvserver: add mixed version race condition test for policy refresher#144106
craig[bot] merged 2 commits intocockroachdb:masterfrom
wenyihu6:newmixedversion

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Apr 8, 2025

kvserver: add PolicyRefresher.TestingKnobs

This commit adds PolicyRefresher.TestingKnobs, which enables testing race
scenarios where the sender hasn't observed a cluster version upgrade when
creating the policy map, but the PolicyRefresher later does.

Epic: none
Release note: none


kvserver: add mixed version race condition test for policy refresher

This commit adds TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster.

TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster verifies
that the closed timestamp policy refresher behaves correctly in a
mixed-version cluster.

Particularly, a race condition like below might occur:

  1. Side transport prepares a policy map before cluster upgrade is complete.
  2. Cluster upgrade completes.
  3. Policy refresher sees the upgrade and quickly updates replica policies to
    use latency-based policies.
  4. Replica tries to use a latency-based policy but the policy map from step 1
    doesn't include it yet.

The logic in replica.getTargetByPolicy handles this race condition by falling
back to no-latency based policies if no-latency based policies were included
from the map provided by the side transport sender.

This test simulates a race condition by using a testing knob to allow the
policy refresher to use latency based policies on replicas while the rest of
the system is still on an older version. It verifies that even if the policy
refresher sees an upgrade to V25_2 and replicas starts holding latency based
policies, replicas will correctly fall back to non-latency-based policies if
the sender hasn’t yet sent the updated latency-based policies.

Part of: #143888
Release note: none

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 8, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the newmixedversion branch 4 times, most recently from 9d8eeaf to 9ecbe83 Compare April 8, 2025 21:52
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Apr 8, 2025

I verified that this test crashes without the fallback logic added in https://github.com/cockroachdb/cockroach/blob/de50076c03cb342a57c972705dcb860651416ba5/pkg/kv/kvserver/replica_closedts.go#L32-L49.

@wenyihu6 wenyihu6 marked this pull request as ready for review April 8, 2025 21:56
@wenyihu6 wenyihu6 requested review from a team as code owners April 8, 2025 21:56
@wenyihu6 wenyihu6 requested review from arulajmani and removed request for a team April 8, 2025 21:56
@wenyihu6 wenyihu6 force-pushed the newmixedversion branch 2 times, most recently from 8705874 to 27647a6 Compare April 8, 2025 22:02
@wenyihu6 wenyihu6 changed the title kvserver: add mixed version test kvserver: add mixed version race condition test for policy refresher Apr 8, 2025
@wenyihu6 wenyihu6 force-pushed the newmixedversion branch 3 times, most recently from 9787219 to d71deb9 Compare April 9, 2025 11:08
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 8 of 8 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/ccl/multiregionccl/multiregion_test.go line 272 at r2 (raw file):

//
// Particularly, a race condition like below might occur:
// 1. Side transport prepares a policy map before cluster upgrade is complete.

nit: bullets on separate lines for the aesthetic!


pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):

	// timestamp policies. When set to true, auto-tuning will be enabled regardless
	// of the actual cluster version.
	ClusterUpgradeOverride atomic.Bool

This testing knob's a bit weird because it skips a cluster version check. Maybe instead of this, you could inject latencies directly and say injected latencies are returned regardless of the autotuning cluster setting?


pkg/kv/kvserver/replica.go line 2844 at r1 (raw file):

// TestingGetCachedClosedTimestampPolicy returns the closed timestamp policy
// held by r. This is a testing-only helper.
func (r *Replica) TestingGetCachedClosedTimestampPolicy() ctpb.RangeClosedTimestampPolicy {

Does this belong in this commit?


EDIT: (I think you accidentally picked this up in the first commit and removed it in the second)

@wenyihu6 wenyihu6 force-pushed the newmixedversion branch 2 times, most recently from 4bbefc0 to 05ca93e Compare April 10, 2025 09:44
Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

This testing knob's a bit weird because it skips a cluster version check. Maybe instead of this, you could inject latencies directly and say injected latencies are returned regardless of the autotuning cluster setting?

Good idea, done.


pkg/ccl/multiregionccl/multiregion_test.go line 272 at r2 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: bullets on separate lines for the aesthetic!

Prettier now :D


pkg/kv/kvserver/replica.go line 2844 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Does this belong in this commit?


EDIT: (I think you accidentally picked this up in the first commit and removed it in the second)

Oops, removed.

Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Good idea, done.

While making this change, I realized that using these testing knobs might be a better approach than setting up multi-region tables as we currently do. I plan to do a similar to thing to the other test I added TestReplicaClosedTSPolicyWithPolicyRefresher earlier and move it back into the kvserver package. wdyt?

Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)


pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

While making this change, I realized that using these testing knobs might be a better approach than setting up multi-region tables as we currently do. I plan to do a similar to thing to the other test I added TestReplicaClosedTSPolicyWithPolicyRefresher earlier and move it back into the kvserver package. wdyt?

On second thought, I’d prefer to keep one unit test that uses an actual multi-region table. I’ll leave TestReplicaClosedTSPolicyWithPolicyRefresher as is, but I’ve updated the new test I’m adding here to follow the approach I described above.

Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 6 of 6 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

On second thought, I’d prefer to keep one unit test that uses an actual multi-region table. I’ll leave TestReplicaClosedTSPolicyWithPolicyRefresher as is, but I’ve updated the new test I’m adding here to follow the approach I described above.

This sounds good to me!


pkg/kv/kvserver/closedts/policyrefresher/policy_refresher.go line 134 at r3 (raw file):

func (pr *PolicyRefresher) getCurrentLatencies() map[roachpb.NodeID]time.Duration {
	// Testing knobs only.
	if pr.knobs != nil {

Should we also be checking pr.knobs.InjectedLatencies != nil? In the future, we may add more testing knobs here.

This commit adds PolicyRefresher.TestingKnobs, which enables testing race
scenarios where the sender hasn't observed a cluster version upgrade when
creating the policy map, but the PolicyRefresher later does.

Epic: none
Release note: none
This commit adds TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster.

TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster verifies
that the closed timestamp policy refresher behaves correctly in a
mixed-version cluster.

Particularly, a race condition like below might occur:
1. Side transport prepares a policy map before cluster upgrade is complete.
2. Cluster upgrade completes.
3. Policy refresher sees the upgrade and quickly updates replica policies to
use latency-based policies.
4. Replica tries to use a latency-based policy but the policy map from step 1
doesn't include it yet.

The logic in replica.getTargetByPolicy handles this race condition by falling
back to no-latency based policies if no-latency based policies were included
from the map provided by the side transport sender.

This test simulates a race condition by using a testing knob to allow the
policy refresher to use latency based policies on replicas while the rest of
the system is still on an older version. It verifies that even if the policy
refresher sees an upgrade to V25_2 and replicas starts holding latency based
policies, replicas will correctly fall back to non-latency-based policies if
the sender hasn’t yet sent the updated latency-based policies.

Part of: cockroachdb#143888
Release note: none
Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/kv/kvserver/closedts/policyrefresher/policy_refresher.go line 134 at r3 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we also be checking pr.knobs.InjectedLatencies != nil? In the future, we may add more testing knobs here.

Good call, done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Apr 12, 2025

Extended CI: known flake TestLeaseQueueShedsOnIOOverload.

TFTR!
bors r=arulajmani

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 12, 2025

@craig craig bot merged commit 13866fc into cockroachdb:master Apr 12, 2025
23 of 24 checks passed
@wenyihu6 wenyihu6 deleted the newmixedversion branch April 12, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants