roachtest: run consistency checks after tests#116218
roachtest: run consistency checks after tests#116218craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
|
@renatolabs Submitting an initial draft to get test-eng's thoughts on this. We'd like to run actual consistency checks across all roachtests, to detect data corruption bugs like #114421. We currently only do cheap stats checks, which are insufficient for this purpose. A final PR would likely include:
Wdyt? |
|
Ah right, there is more. Disregard my LGTM-in-principle until other questions are addressed.
Alternatively, export just the checkpoints (which are "narrow" now, and take O(range size) space). Concern: if many roachtests fail with a consistency check simultaneously, exporting all the corresponding stores can be expensive. Or even a single test, but with particularly large stores. |
Another option is to keep the cluster(s) running up to some reasonable upper bound; e.g., at most 3 clusters are retained upon failing the consistency check. |
What's the actual average throughput relative to
Why is this useful when most of the runs are unattended? Is this to help with debugging the consistency checker itself?
Why are we not choosing to run these more aggressively during other non-benchmark workloads? |
Yeah, we should take narrow checkpoints. We'll have to add support for that to
I think we want to keep narrow checkpoints in the artifacts in any case, for posterity and because we often may not get around to investigating these for a week or two. Keeping the clusters around would be helpful to retain the entire store, which could help with debugging beyond the checkpoints. How would this work? Do we have any prior art for keeping failed clusters around?
I'll do a couple of manual runs to get some rough figures. Should be close to the disk's read bandwidth in the common case, unless we saturate CPU (this is single-threaded).
For attended tests. If I'm running a random roachtest manually, I'd like to know why it's "hanging" for 10 minutes at the end. It would also give us some idea about how much progress it's making for unattended runs, if we need visibility into it.
It may introduce a lot of flake, and the test/workload may trample the evidence before we get around to checkpointing it. We also already have post-test assertion infrastructure in place (including opt-out settings), while we'd have to build additional infrastructure for this otherwise. The consistency queue does run in all roachtests as usual though, but it's very lax (targets a scan every 24 hours at 8 MB/s). We could consider running regular background consistency checks more aggressively. I'm inclined to start with the straightforward option of post-test assertions, and we can consider expanding it later. That should get us 90% of the way there, with very little effort. |
On second thought, I think we want to be able to do this via a SQL builtin, so that we take the snapshots at the same Raft index -- otherwise, we're going to get spurious diffs if there are any writes to the range. |
03f13a0 to
4bbe75c
Compare
Checked 92 GB across 180 ranges in 7m14s. That works out to 212 MB/s. |
4bbe75c to
6b3302c
Compare
|
This should be ready for review now. I'll add checkpointing separately, since it requires changes to the |
ac6e115 to
8262b68
Compare
renatolabs
left a comment
There was a problem hiding this comment.
Comments:
- Thanks for doing this! I agree that this is a good first step, and we can improve the debugging experience and scope of our consistency checks in follow-up work.
- Currently, our nightly builds on master have a ~20h duration. I'm wondering how much of an impact this will have on the overall build time; it could easily start taking over 24h if a non-negligible percentage of tests start running 20 mins longer. Thoughts on running consistency checks 50% (or so) of times, and running the previous stats-only check otherwise?
Depends on how large the datasets are, I suppose. I don't expect this to add significant runtime for most roachtests, which have a small dataset. We could also bound the check time to 10 minutes instead -- I think I prefer that to a 50% coin toss, since we get better coverage across all tests, and we'll still be able to check up to 100 GB of data for large tests. Another option would be to disable this assertion for some very large roachtests. |
Yes, that's also my intuition, although it's hard to predict. I think it's fine to merge as-is and observe the impact on the next nightly build and adapt accordingly if necessary. |
I am confused... |
|
We set the rate to 1 GB/s here, basically unbounded. |
8262b68 to
ab93b38
Compare
e54737a to
c45a875
Compare
|
I think this should be good to go now, @renatolabs. |
This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots. The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping `registry.PostValidationReplicaDivergence`. Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately. Epic: none Release note: None
c45a875 to
e675216
Compare
|
TFTR! Let's how the nightlies go. bors r+ |
|
Build succeeded: |
|
blathers backport 23.2 |
|
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from e675216 to blathers/backport-release-23.2-116218: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 23.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots.
The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping
registry.PostValidationReplicaDivergence.Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately.
Touches #115770.
Epic: none
Release note: None