kvserver: add mixed version race condition test for policy refresher#144106
kvserver: add mixed version race condition test for policy refresher#144106craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
9d8eeaf to
9ecbe83
Compare
|
I verified that this test crashes without the fallback logic added in https://github.com/cockroachdb/cockroach/blob/de50076c03cb342a57c972705dcb860651416ba5/pkg/kv/kvserver/replica_closedts.go#L32-L49. |
8705874 to
27647a6
Compare
9787219 to
d71deb9
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 8 of 8 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/ccl/multiregionccl/multiregion_test.go line 272 at r2 (raw file):
// // Particularly, a race condition like below might occur: // 1. Side transport prepares a policy map before cluster upgrade is complete.
nit: bullets on separate lines for the aesthetic!
pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):
// timestamp policies. When set to true, auto-tuning will be enabled regardless // of the actual cluster version. ClusterUpgradeOverride atomic.Bool
This testing knob's a bit weird because it skips a cluster version check. Maybe instead of this, you could inject latencies directly and say injected latencies are returned regardless of the autotuning cluster setting?
pkg/kv/kvserver/replica.go line 2844 at r1 (raw file):
// TestingGetCachedClosedTimestampPolicy returns the closed timestamp policy // held by r. This is a testing-only helper. func (r *Replica) TestingGetCachedClosedTimestampPolicy() ctpb.RangeClosedTimestampPolicy {
Does this belong in this commit?
EDIT: (I think you accidentally picked this up in the first commit and removed it in the second)
4bbefc0 to
05ca93e
Compare
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
This testing knob's a bit weird because it skips a cluster version check. Maybe instead of this, you could inject latencies directly and say injected latencies are returned regardless of the autotuning cluster setting?
Good idea, done.
pkg/ccl/multiregionccl/multiregion_test.go line 272 at r2 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit: bullets on separate lines for the aesthetic!
Prettier now :D
pkg/kv/kvserver/replica.go line 2844 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Does this belong in this commit?
EDIT: (I think you accidentally picked this up in the first commit and removed it in the second)
Oops, removed.
05ca93e to
dd1651c
Compare
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
Good idea, done.
While making this change, I realized that using these testing knobs might be a better approach than setting up multi-region tables as we currently do. I plan to do a similar to thing to the other test I added TestReplicaClosedTSPolicyWithPolicyRefresher earlier and move it back into the kvserver package. wdyt?
dd1651c to
88b9b02
Compare
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @arulajmani)
pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
While making this change, I realized that using these testing knobs might be a better approach than setting up multi-region tables as we currently do. I plan to do a similar to thing to the other test I added
TestReplicaClosedTSPolicyWithPolicyRefresherearlier and move it back into thekvserverpackage. wdyt?
On second thought, I’d prefer to keep one unit test that uses an actual multi-region table. I’ll leave TestReplicaClosedTSPolicyWithPolicyRefresher as is, but I’ve updated the new test I’m adding here to follow the approach I described above.
88b9b02 to
d7fc2a4
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 6 of 6 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/closedts/policyrefresher/config.go line 16 at r1 (raw file):
Previously, wenyihu6 (Wenyi Hu) wrote…
On second thought, I’d prefer to keep one unit test that uses an actual multi-region table. I’ll leave
TestReplicaClosedTSPolicyWithPolicyRefresheras is, but I’ve updated the new test I’m adding here to follow the approach I described above.
This sounds good to me!
pkg/kv/kvserver/closedts/policyrefresher/policy_refresher.go line 134 at r3 (raw file):
func (pr *PolicyRefresher) getCurrentLatencies() map[roachpb.NodeID]time.Duration { // Testing knobs only. if pr.knobs != nil {
Should we also be checking pr.knobs.InjectedLatencies != nil? In the future, we may add more testing knobs here.
This commit adds PolicyRefresher.TestingKnobs, which enables testing race scenarios where the sender hasn't observed a cluster version upgrade when creating the policy map, but the PolicyRefresher later does. Epic: none Release note: none
This commit adds TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster. TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster verifies that the closed timestamp policy refresher behaves correctly in a mixed-version cluster. Particularly, a race condition like below might occur: 1. Side transport prepares a policy map before cluster upgrade is complete. 2. Cluster upgrade completes. 3. Policy refresher sees the upgrade and quickly updates replica policies to use latency-based policies. 4. Replica tries to use a latency-based policy but the policy map from step 1 doesn't include it yet. The logic in replica.getTargetByPolicy handles this race condition by falling back to no-latency based policies if no-latency based policies were included from the map provided by the side transport sender. This test simulates a race condition by using a testing knob to allow the policy refresher to use latency based policies on replicas while the rest of the system is still on an older version. It verifies that even if the policy refresher sees an upgrade to V25_2 and replicas starts holding latency based policies, replicas will correctly fall back to non-latency-based policies if the sender hasn’t yet sent the updated latency-based policies. Part of: cockroachdb#143888 Release note: none
d7fc2a4 to
8f6c489
Compare
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
pkg/kv/kvserver/closedts/policyrefresher/policy_refresher.go line 134 at r3 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
Should we also be checking
pr.knobs.InjectedLatencies != nil? In the future, we may add more testing knobs here.
Good call, done.
|
Extended CI: known flake TestLeaseQueueShedsOnIOOverload. TFTR! |
kvserver: add PolicyRefresher.TestingKnobs
This commit adds PolicyRefresher.TestingKnobs, which enables testing race
scenarios where the sender hasn't observed a cluster version upgrade when
creating the policy map, but the PolicyRefresher later does.
Epic: none
Release note: none
kvserver: add mixed version race condition test for policy refresher
This commit adds TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster.
TestReplicaClosedTSPolicyWithPolicyRefresherInMixedVersionCluster verifies
that the closed timestamp policy refresher behaves correctly in a
mixed-version cluster.
Particularly, a race condition like below might occur:
use latency-based policies.
doesn't include it yet.
The logic in replica.getTargetByPolicy handles this race condition by falling
back to no-latency based policies if no-latency based policies were included
from the map provided by the side transport sender.
This test simulates a race condition by using a testing knob to allow the
policy refresher to use latency based policies on replicas while the rest of
the system is still on an older version. It verifies that even if the policy
refresher sees an upgrade to V25_2 and replicas starts holding latency based
policies, replicas will correctly fall back to non-latency-based policies if
the sender hasn’t yet sent the updated latency-based policies.
Part of: #143888
Release note: none