Skip to content

kvserver: deflake TestReplicaTombstone#99099

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:TestReplicaTombstone
Mar 22, 2023
Merged

kvserver: deflake TestReplicaTombstone#99099
craig[bot] merged 1 commit intocockroachdb:masterfrom
tbg:TestReplicaTombstone

Conversation

@tbg
Copy link
Copy Markdown
Member

@tbg tbg commented Mar 21, 2023

Like many other tests, this test could flake because we'd sometimes
catch a "cannot remove learner while snapshot is in flight" error.

I think the root cause is that sometimes there are errant Raft snapshots
in the system1 and these get mistaken for LEARNERs that are still
being caught up by the replicate queue. I tried to address this general
class of issues by making the check for in-flight learner snapshots not
care about raft snapshots.

I was able to stress TestReplicaTombstone for 30+ minutes without a
failure using that approach, whereas previously it usually failed within
a few minutes.

./dev test --stress pkg/kv/kvserver/ --filter TestReplicaTombstone 2>&1 | tee stress.log
[...]
2461 runs so far, 0 failures, over 35m45s

Fixes #98883.

Epic: none
Release note: None

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/87553

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@tbg tbg force-pushed the TestReplicaTombstone branch 3 times, most recently from ec2fd7d to a798f3d Compare March 21, 2023 10:02
@tbg tbg requested a review from andrewbaptist March 21, 2023 10:03
@tbg tbg marked this pull request as ready for review March 21, 2023 10:03
@tbg tbg requested a review from a team March 21, 2023 10:03
@tbg tbg requested a review from a team as a code owner March 21, 2023 10:03
@tbg tbg force-pushed the TestReplicaTombstone branch from a798f3d to d809c2c Compare March 21, 2023 12:10
Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks ok, but it would have been simpler to only handle two types of snapshots "initial" or "catchup" (as a bool).

For initial snapshots, the constraint is used, but for catchup, it is not. This may have become a little more flakey due to delegated snapshots since they hold this constraint longer than before, but hopefully, these aren't "new" flakes".

This PR makes it seem like there are "various types" of snapshots that might be sent when really there are only two, and we will never add more.

I am still approving this, but consider changing it to a bool and passing that through based on whether it is initial or not. I think that will make this easier to follow and understand why this is done.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvserver/raft_log_queue.go line 272 at r1 (raw file):

	const anyRecipientStore roachpb.StoreID = 0
	const anySnapType kvserverpb.SnapshotRequest_Type = -1

nit: I was hoping to remove this type/field completely "soon". I will likely do this during stability period, but only put it on master. The field doesn't need to be transmitted over the wire as it is only used for logging on the receiver side, so it will be easy enough to replace this with a non-proto bool. This logic looks correct though and it might be cleaner when there are really only 2 types (initial, catchup)


pkg/kv/kvserver/replica_raft.go line 1826 at r1 (raw file):

// recipient or 0 if there isn't one. Passing 0 for recipientStore means any
// recipient.
func (r *Replica) getSnapshotLogTruncationConstraints(

nit: Consider just inlining this method into hasOutstandingSnapshotInFlightToStore as this is the only usage of it.


pkg/kv/kvserver/replica_raft.go line 1854 at r1 (raw file):

}

// errOnOutstandingLearnerSnapshotInflight returns true if there is a snapshot in

nit: Fix comment to say this now returns an error vs true/false.

@tbg tbg force-pushed the TestReplicaTombstone branch from d809c2c to 6932508 Compare March 22, 2023 12:11
Copy link
Copy Markdown
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist)

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Looking at this again I think it might be worse now since it mixes the bool and the Type. Can you also change addSnapshotLogTruncationConstraint and snapTruncationInfo to use initialOnly bool. Then you won't need to use the SnapshotRequest_Type anywhere in this code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg tbg force-pushed the TestReplicaTombstone branch from 6932508 to f2bcc50 Compare March 22, 2023 13:48
Like many other tests, this test could flake because we'd sometimes
catch a "cannot remove learner while snapshot is in flight" error.

I think the root cause is that sometimes there are errant Raft snapshots
in the system[^1] and these get mistaken for LEARNERs that are still
being caught up by the replicate queue. I tried to address this general
class of issues by making the check for in-flight learner snapshots not
care about *raft* snapshots.

I was able to stress TestReplicaTombstone for 30+ minutes without a
failure using that approach, whereas previously it usually failed within
a few minutes.

```
./dev test --stress pkg/kv/kvserver/ --filter TestReplicaTombstone 2>&1 | tee stress.log
[...]
2461 runs so far, 0 failures, over 35m45s
```

[^1]: cockroachdb#87553

Fixes cockroachdb#98883.

Epic: none
Release note: None
@tbg tbg force-pushed the TestReplicaTombstone branch from f2bcc50 to 6444df5 Compare March 22, 2023 13:49
@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 22, 2023

Sorry, I meant to have removed SnapType usage entirely. Fixed up now.

Copy link
Copy Markdown

@andrewbaptist andrewbaptist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Thanks for the change!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Mar 22, 2023

bors r=andrewbaptist
TFTR!

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2023

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 22, 2023

Build succeeded:

@craig craig bot merged commit c2460f1 into cockroachdb:master Mar 22, 2023
@knz
Copy link
Copy Markdown
Contributor

knz commented Apr 2, 2023

I think this should be backported?

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 25, 2023

blathers backport 23.1

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 25, 2023

blathers backport 22.2

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 25, 2023

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 6444df5 to blathers/backport-release-22.2-99099: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 25, 2023

Encountered an error creating backports. Some common things that can go wrong:

They seem fixable, but annoying to fix. v22.2 and before also tracked the index and there was various code movement related to, I think, delegated snaps - I'd have to sort of re-do the PR, I think, as I'd be anxious to slip in a bug otherwise. I think it would be better to live with the flakes which are getting more infrequent as these releases age. It's not ideal, though. @erikgrinaker wdyt?

@erikgrinaker
Copy link
Copy Markdown
Contributor

They're pretty widespread, but I guess it's ok. Especially as we're taking unit test failures into the test triage rotation, presumably people will get pretty used to closing them out.

@tbg
Copy link
Copy Markdown
Member Author

tbg commented Apr 27, 2023

My comment predates our discussion in standup (I think) yesterday, I'm planning on giving the backport another push (22.2 only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: TestReplicaTombstone test failure

5 participants