roachtest: run consistency checks after tests by erikgrinaker · Pull Request #116218 · cockroachdb/cockroach

erikgrinaker · 2023-12-12T13:12:37Z

This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots.

The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping registry.PostValidationReplicaDivergence.

Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately.

Touches #115770.
Epic: none
Release note: None

cockroach-teamcity · 2023-12-12T13:12:49Z

This change is

erikgrinaker · 2023-12-12T13:16:40Z

@renatolabs Submitting an initial draft to get test-eng's thoughts on this. We'd like to run actual consistency checks across all roachtests, to detect data corruption bugs like #114421. We currently only do cheap stats checks, which are insufficient for this purpose.

A final PR would likely include:

Logic to export corrupt stores to test artifacts for debugging.
Bounding the consistency check time (20 minutes?), but still properly report inconsistencies best-effort for ranges we did manage to scan.
Emit progress info to stdout for long-running checks.
Anything else?

Wdyt?

pav-kv · 2023-12-12T13:59:21Z

Ah right, there is more. Disregard my LGTM-in-principle until other questions are addressed.

Logic to export corrupt stores to test artifacts for debugging.

Alternatively, export just the checkpoints (which are "narrow" now, and take O(range size) space).

Concern: if many roachtests fail with a consistency check simultaneously, exporting all the corresponding stores can be expensive. Or even a single test, but with particularly large stores.

srosenberg · 2023-12-14T02:56:34Z

Concern: if many roachtests fail with a consistency check simultaneously, exporting all the corresponding stores can be expensive. Or even a single test, but with particularly large stores.

Another option is to keep the cluster(s) running up to some reasonable upper bound; e.g., at most 3 clusters are retained upon failing the consistency check.

srosenberg · 2023-12-14T03:04:30Z

Bounding the consistency check time (20 minutes?), but still properly report inconsistencies best-effort for ranges we did manage to scan.

What's the actual average throughput relative to server.consistency_check.max_rate? What's the highest we can hit in say 3-node, 8 vCPU cluster? I am trying to get a sense of how quickly this scan can complete assuming nothing else is running.

Emit progress info to stdout for long-running checks.

Why is this useful when most of the runs are unattended? Is this to help with debugging the consistency checker itself?

Anything else?

Why are we not choosing to run these more aggressively during other non-benchmark workloads?

erikgrinaker · 2023-12-14T11:04:42Z

Alternatively, export just the checkpoints (which are "narrow" now, and take O(range size) space).

Yeah, we should take narrow checkpoints. We'll have to add support for that to cockroach debug pebble db checkpoint. We can limit the number of inconsistent ranges per test to 5 or something, in case we see widespread inconsistencies. The per-test cost should be fairly low in that case (about a GB).

Another option is to keep the cluster(s) running up to some reasonable upper bound; e.g., at most 3 clusters are retained upon failing the consistency check.

I think we want to keep narrow checkpoints in the artifacts in any case, for posterity and because we often may not get around to investigating these for a week or two.

Keeping the clusters around would be helpful to retain the entire store, which could help with debugging beyond the checkpoints. How would this work? Do we have any prior art for keeping failed clusters around?

What's the actual average throughput relative to server.consistency_check.max_rate? What's the highest we can hit in say 3-node, 8 vCPU cluster? I am trying to get a sense of how quickly this scan can complete assuming nothing else is running.

I'll do a couple of manual runs to get some rough figures. Should be close to the disk's read bandwidth in the common case, unless we saturate CPU (this is single-threaded).

Emit progress info to stdout for long-running checks.

Why is this useful when most of the runs are unattended? Is this to help with debugging the consistency checker itself?

For attended tests. If I'm running a random roachtest manually, I'd like to know why it's "hanging" for 10 minutes at the end. It would also give us some idea about how much progress it's making for unattended runs, if we need visibility into it.

Why are we not choosing to run these more aggressively during other non-benchmark workloads?

It may introduce a lot of flake, and the test/workload may trample the evidence before we get around to checkpointing it. We also already have post-test assertion infrastructure in place (including opt-out settings), while we'd have to build additional infrastructure for this otherwise.

The consistency queue does run in all roachtests as usual though, but it's very lax (targets a scan every 24 hours at 8 MB/s). We could consider running regular background consistency checks more aggressively.

I'm inclined to start with the straightforward option of post-test assertions, and we can consider expanding it later. That should get us 90% of the way there, with very little effort.

erikgrinaker · 2023-12-14T11:33:07Z

Yeah, we should take narrow checkpoints. We'll have to add support for that to cockroach debug pebble db checkpoint.

On second thought, I think we want to be able to do this via a SQL builtin, so that we take the snapshots at the same Raft index -- otherwise, we're going to get spurious diffs if there are any writes to the range.

erikgrinaker · 2023-12-14T13:34:06Z

What's the actual average throughput relative to server.consistency_check.max_rate?

Checked 92 GB across 180 ranges in 7m14s. That works out to 212 MB/s.

erikgrinaker · 2023-12-14T13:46:18Z

This should be ready for review now. I'll add checkpointing separately, since it requires changes to the crdb_internal.check_consistency() builtin that we may not want to backport. The current PR will at least detect whether we have a bug to find.

renatolabs

Comments:

Thanks for doing this! I agree that this is a good first step, and we can improve the debugging experience and scope of our consistency checks in follow-up work.
Currently, our nightly builds on master have a ~20h duration. I'm wondering how much of an impact this will have on the overall build time; it could easily start taking over 24h if a non-negligible percentage of tests start running 20 mins longer. Thoughts on running consistency checks 50% (or so) of times, and running the previous stats-only check otherwise?

pkg/cmd/roachtest/roachtestutil/validation_check.go

erikgrinaker · 2023-12-14T15:24:34Z

Currently, our nightly builds on master have a ~20h duration. I'm wondering how much of an impact this will have on the overall build time; it could easily start taking over 24h if a non-negligible percentage of tests start running 20 mins longer. Thoughts on running consistency checks 50% (or so) of times, and running the previous stats-only check otherwise?

Depends on how large the datasets are, I suppose. I don't expect this to add significant runtime for most roachtests, which have a small dataset.

We could also bound the check time to 10 minutes instead -- I think I prefer that to a 50% coin toss, since we get better coverage across all tests, and we'll still be able to check up to 100 GB of data for large tests. Another option would be to disable this assertion for some very large roachtests.

renatolabs · 2023-12-14T15:53:13Z

I don't expect this to add significant runtime for most roachtests

Yes, that's also my intuition, although it's hard to predict. I think it's fine to merge as-is and observe the impact on the next nightly build and adapt accordingly if necessary.

srosenberg · 2023-12-14T19:50:20Z

What's the actual average throughput relative to server.consistency_check.max_rate?

Checked 92 GB across 180 ranges in 7m14s. That works out to 212 MB/s.

I am confused... server.consistency_check.max_rate controls the maximum (scan) throughput per range/replica; it maxes out at 8 MB/s. Is there another knob that controls concurrency of range scans?

erikgrinaker · 2023-12-14T20:23:45Z

We set the rate to 1 GB/s here, basically unbounded.

erikgrinaker · 2023-12-18T10:00:30Z

I think this should be good to go now, @renatolabs.

This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots. The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping `registry.PostValidationReplicaDivergence`. Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately. Epic: none Release note: None

erikgrinaker · 2023-12-18T20:00:54Z

TFTR! Let's how the nightlies go.

bors r+

craig · 2023-12-18T21:39:38Z

Build succeeded:

Bazel Essential CI (Cockroach)

erikgrinaker · 2024-01-10T09:56:00Z

blathers backport 23.2

blathers-crl · 2024-01-10T09:56:04Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from e675216 to blathers/backport-release-23.2-116218: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2 failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

erikgrinaker requested review from jbowens, pav-kv and renatolabs December 12, 2023 13:12

erikgrinaker self-assigned this Dec 12, 2023

pav-kv approved these changes Dec 12, 2023

View reviewed changes

sumeerbhola mentioned this pull request Dec 12, 2023

db: weaken SINGLEDEL semantics in Iterator cockroachdb/pebble#3151

Closed

erikgrinaker force-pushed the roachtest-consistency-check branch from 03f13a0 to 4bbe75c Compare December 14, 2023 12:47

erikgrinaker force-pushed the roachtest-consistency-check branch from 4bbe75c to 6b3302c Compare December 14, 2023 13:44

erikgrinaker marked this pull request as ready for review December 14, 2023 13:44

erikgrinaker requested a review from a team as a code owner December 14, 2023 13:44

erikgrinaker requested review from DarrylWong and removed request for a team December 14, 2023 13:44

erikgrinaker force-pushed the roachtest-consistency-check branch 2 times, most recently from ac6e115 to 8262b68 Compare December 14, 2023 14:59

erikgrinaker requested a review from pav-kv December 14, 2023 15:18

renatolabs reviewed Dec 14, 2023

View reviewed changes

pkg/cmd/roachtest/roachtestutil/validation_check.go Show resolved Hide resolved

pkg/cmd/roachtest/roachtestutil/validation_check.go Outdated Show resolved Hide resolved

erikgrinaker force-pushed the roachtest-consistency-check branch from 8262b68 to ab93b38 Compare December 15, 2023 11:13

erikgrinaker force-pushed the roachtest-consistency-check branch 2 times, most recently from e54737a to c45a875 Compare December 18, 2023 09:58

renatolabs approved these changes Dec 18, 2023

View reviewed changes

erikgrinaker force-pushed the roachtest-consistency-check branch from c45a875 to e675216 Compare December 18, 2023 20:00

craig bot merged commit fec5213 into cockroachdb:master Dec 18, 2023

erikgrinaker mentioned this pull request Jan 5, 2024

roachtest: loqrecovery/workload=tpcc/rangeSize=16mb failed #117317

Closed

erikgrinaker mentioned this pull request Jan 10, 2024

release-23.2: roachtest: run consistency checks after tests #117607

Merged

renatolabs mentioned this pull request Jan 16, 2024

Drain queues as part of test cleanup / teardown #102211

Closed

Conversation

erikgrinaker commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Dec 12, 2023

Uh oh!

erikgrinaker commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pav-kv commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srosenberg commented Dec 14, 2023

Uh oh!

srosenberg commented Dec 14, 2023

Uh oh!

erikgrinaker commented Dec 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikgrinaker commented Dec 14, 2023

Uh oh!

erikgrinaker commented Dec 14, 2023

Uh oh!

erikgrinaker commented Dec 14, 2023

Uh oh!

renatolabs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

erikgrinaker commented Dec 14, 2023

Uh oh!

renatolabs commented Dec 14, 2023

Uh oh!

srosenberg commented Dec 14, 2023

Uh oh!

erikgrinaker commented Dec 14, 2023

Uh oh!

erikgrinaker commented Dec 18, 2023

Uh oh!

erikgrinaker commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

craig bot commented Dec 18, 2023

Uh oh!

erikgrinaker commented Jan 10, 2024

Uh oh!

blathers-crl bot commented Jan 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

erikgrinaker commented Dec 12, 2023 •

edited

Loading

erikgrinaker commented Dec 12, 2023 •

edited

Loading

pav-kv commented Dec 12, 2023 •

edited

Loading

erikgrinaker commented Dec 14, 2023 •

edited

Loading

erikgrinaker commented Dec 18, 2023 •

edited

Loading