kvserver: deflake TestReplicaTombstone#99099
Conversation
ec2fd7d to
a798f3d
Compare
a798f3d to
d809c2c
Compare
andrewbaptist
left a comment
There was a problem hiding this comment.
This PR looks ok, but it would have been simpler to only handle two types of snapshots "initial" or "catchup" (as a bool).
For initial snapshots, the constraint is used, but for catchup, it is not. This may have become a little more flakey due to delegated snapshots since they hold this constraint longer than before, but hopefully, these aren't "new" flakes".
This PR makes it seem like there are "various types" of snapshots that might be sent when really there are only two, and we will never add more.
I am still approving this, but consider changing it to a bool and passing that through based on whether it is initial or not. I think that will make this easier to follow and understand why this is done.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/raft_log_queue.go line 272 at r1 (raw file):
const anyRecipientStore roachpb.StoreID = 0 const anySnapType kvserverpb.SnapshotRequest_Type = -1
nit: I was hoping to remove this type/field completely "soon". I will likely do this during stability period, but only put it on master. The field doesn't need to be transmitted over the wire as it is only used for logging on the receiver side, so it will be easy enough to replace this with a non-proto bool. This logic looks correct though and it might be cleaner when there are really only 2 types (initial, catchup)
pkg/kv/kvserver/replica_raft.go line 1826 at r1 (raw file):
// recipient or 0 if there isn't one. Passing 0 for recipientStore means any // recipient. func (r *Replica) getSnapshotLogTruncationConstraints(
nit: Consider just inlining this method into hasOutstandingSnapshotInFlightToStore as this is the only usage of it.
pkg/kv/kvserver/replica_raft.go line 1854 at r1 (raw file):
} // errOnOutstandingLearnerSnapshotInflight returns true if there is a snapshot in
nit: Fix comment to say this now returns an error vs true/false.
d809c2c to
6932508
Compare
tbg
left a comment
There was a problem hiding this comment.
Done, PTAL.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)
andrewbaptist
left a comment
There was a problem hiding this comment.
Thank you!
Looking at this again I think it might be worse now since it mixes the bool and the Type. Can you also change addSnapshotLogTruncationConstraint and snapTruncationInfo to use initialOnly bool. Then you won't need to use the SnapshotRequest_Type anywhere in this code.
Reviewable status:
complete! 0 of 0 LGTMs obtained
6932508 to
f2bcc50
Compare
Like many other tests, this test could flake because we'd sometimes catch a "cannot remove learner while snapshot is in flight" error. I think the root cause is that sometimes there are errant Raft snapshots in the system[^1] and these get mistaken for LEARNERs that are still being caught up by the replicate queue. I tried to address this general class of issues by making the check for in-flight learner snapshots not care about *raft* snapshots. I was able to stress TestReplicaTombstone for 30+ minutes without a failure using that approach, whereas previously it usually failed within a few minutes. ``` ./dev test --stress pkg/kv/kvserver/ --filter TestReplicaTombstone 2>&1 | tee stress.log [...] 2461 runs so far, 0 failures, over 35m45s ``` [^1]: cockroachdb#87553 Fixes cockroachdb#98883. Epic: none Release note: None
f2bcc50 to
6444df5
Compare
|
Sorry, I meant to have removed SnapType usage entirely. Fixed up now. |
andrewbaptist
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 1 of 0 LGTMs obtained
|
bors r=andrewbaptist |
|
Build failed (retrying...): |
|
Build succeeded: |
|
I think this should be backported? |
|
blathers backport 23.1 |
|
blathers backport 22.2 |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 6444df5 to blathers/backport-release-22.2-99099: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 22.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
They seem fixable, but annoying to fix. v22.2 and before also tracked the |
|
They're pretty widespread, but I guess it's ok. Especially as we're taking unit test failures into the test triage rotation, presumably people will get pretty used to closing them out. |
|
My comment predates our discussion in standup (I think) yesterday, I'm planning on giving the backport another push (22.2 only). |
Like many other tests, this test could flake because we'd sometimes
catch a "cannot remove learner while snapshot is in flight" error.
I think the root cause is that sometimes there are errant Raft snapshots
in the system1 and these get mistaken for LEARNERs that are still
being caught up by the replicate queue. I tried to address this general
class of issues by making the check for in-flight learner snapshots not
care about raft snapshots.
I was able to stress TestReplicaTombstone for 30+ minutes without a
failure using that approach, whereas previously it usually failed within
a few minutes.
Fixes #98883.
Epic: none
Release note: None
Footnotes
https://github.com/cockroachdb/cockroach/issues/87553 ↩