Skip to content

roachtest: run consistency checks after tests#116218

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:roachtest-consistency-check
Dec 18, 2023
Merged

roachtest: run consistency checks after tests#116218
craig[bot] merged 1 commit intocockroachdb:masterfrom
erikgrinaker:roachtest-consistency-check

Conversation

@erikgrinaker
Copy link
Copy Markdown
Contributor

@erikgrinaker erikgrinaker commented Dec 12, 2023

This patch runs replica consistency checks during post-test assertions. Previously, this only checked for stats inconsistencies, it now checks for actual data inconsistencies. This is useful to detect data corruption bugs, typically in e.g. Pebble or Raft snapshots.

The consistency check runs at full speed for up to 20 minutes before timing out, and will report any inconsistencies found during a partial run. This is sufficient to check about 200 GB of data. Tests can opt out by skipping registry.PostValidationReplicaDivergence.

Storage checkpoints are not yet taken when inconsistencies are detected, since it requires additional support in SQL builtins to take them at the same Raft log index across nodes. This will be addressed separately.

Touches #115770.
Epic: none
Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Dec 12, 2023

@renatolabs Submitting an initial draft to get test-eng's thoughts on this. We'd like to run actual consistency checks across all roachtests, to detect data corruption bugs like #114421. We currently only do cheap stats checks, which are insufficient for this purpose.

A final PR would likely include:

  • Logic to export corrupt stores to test artifacts for debugging.
  • Bounding the consistency check time (20 minutes?), but still properly report inconsistencies best-effort for ranges we did manage to scan.
  • Emit progress info to stdout for long-running checks.
  • Anything else?

Wdyt?

@pav-kv
Copy link
Copy Markdown
Collaborator

pav-kv commented Dec 12, 2023

Ah right, there is more. Disregard my LGTM-in-principle until other questions are addressed.

  • Logic to export corrupt stores to test artifacts for debugging.

Alternatively, export just the checkpoints (which are "narrow" now, and take O(range size) space).

Concern: if many roachtests fail with a consistency check simultaneously, exporting all the corresponding stores can be expensive. Or even a single test, but with particularly large stores.

@srosenberg
Copy link
Copy Markdown
Member

Concern: if many roachtests fail with a consistency check simultaneously, exporting all the corresponding stores can be expensive. Or even a single test, but with particularly large stores.

Another option is to keep the cluster(s) running up to some reasonable upper bound; e.g., at most 3 clusters are retained upon failing the consistency check.

@srosenberg
Copy link
Copy Markdown
Member

  • Bounding the consistency check time (20 minutes?), but still properly report inconsistencies best-effort for ranges we did manage to scan.

What's the actual average throughput relative to server.consistency_check.max_rate? What's the highest we can hit in say 3-node, 8 vCPU cluster? I am trying to get a sense of how quickly this scan can complete assuming nothing else is running.

  • Emit progress info to stdout for long-running checks.

Why is this useful when most of the runs are unattended? Is this to help with debugging the consistency checker itself?

  • Anything else?

Why are we not choosing to run these more aggressively during other non-benchmark workloads?

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Dec 14, 2023

Alternatively, export just the checkpoints (which are "narrow" now, and take O(range size) space).

Yeah, we should take narrow checkpoints. We'll have to add support for that to cockroach debug pebble db checkpoint. We can limit the number of inconsistent ranges per test to 5 or something, in case we see widespread inconsistencies. The per-test cost should be fairly low in that case (about a GB).

Another option is to keep the cluster(s) running up to some reasonable upper bound; e.g., at most 3 clusters are retained upon failing the consistency check.

I think we want to keep narrow checkpoints in the artifacts in any case, for posterity and because we often may not get around to investigating these for a week or two.

Keeping the clusters around would be helpful to retain the entire store, which could help with debugging beyond the checkpoints. How would this work? Do we have any prior art for keeping failed clusters around?

What's the actual average throughput relative to server.consistency_check.max_rate? What's the highest we can hit in say 3-node, 8 vCPU cluster? I am trying to get a sense of how quickly this scan can complete assuming nothing else is running.

I'll do a couple of manual runs to get some rough figures. Should be close to the disk's read bandwidth in the common case, unless we saturate CPU (this is single-threaded).

Emit progress info to stdout for long-running checks.

Why is this useful when most of the runs are unattended? Is this to help with debugging the consistency checker itself?

For attended tests. If I'm running a random roachtest manually, I'd like to know why it's "hanging" for 10 minutes at the end. It would also give us some idea about how much progress it's making for unattended runs, if we need visibility into it.

Why are we not choosing to run these more aggressively during other non-benchmark workloads?

It may introduce a lot of flake, and the test/workload may trample the evidence before we get around to checkpointing it. We also already have post-test assertion infrastructure in place (including opt-out settings), while we'd have to build additional infrastructure for this otherwise.

The consistency queue does run in all roachtests as usual though, but it's very lax (targets a scan every 24 hours at 8 MB/s). We could consider running regular background consistency checks more aggressively.

I'm inclined to start with the straightforward option of post-test assertions, and we can consider expanding it later. That should get us 90% of the way there, with very little effort.

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Yeah, we should take narrow checkpoints. We'll have to add support for that to cockroach debug pebble db checkpoint.

On second thought, I think we want to be able to do this via a SQL builtin, so that we take the snapshots at the same Raft index -- otherwise, we're going to get spurious diffs if there are any writes to the range.

@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch from 03f13a0 to 4bbe75c Compare December 14, 2023 12:47
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

What's the actual average throughput relative to server.consistency_check.max_rate?

Checked 92 GB across 180 ranges in 7m14s. That works out to 212 MB/s.

@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch from 4bbe75c to 6b3302c Compare December 14, 2023 13:44
@erikgrinaker erikgrinaker marked this pull request as ready for review December 14, 2023 13:44
@erikgrinaker erikgrinaker requested a review from a team as a code owner December 14, 2023 13:44
@erikgrinaker erikgrinaker requested review from DarrylWong and removed request for a team December 14, 2023 13:44
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

This should be ready for review now. I'll add checkpointing separately, since it requires changes to the crdb_internal.check_consistency() builtin that we may not want to backport. The current PR will at least detect whether we have a bug to find.

@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch 2 times, most recently from ac6e115 to 8262b68 Compare December 14, 2023 14:59
@erikgrinaker erikgrinaker requested a review from pav-kv December 14, 2023 15:18
Copy link
Copy Markdown

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments:

  • Thanks for doing this! I agree that this is a good first step, and we can improve the debugging experience and scope of our consistency checks in follow-up work.
  • Currently, our nightly builds on master have a ~20h duration. I'm wondering how much of an impact this will have on the overall build time; it could easily start taking over 24h if a non-negligible percentage of tests start running 20 mins longer. Thoughts on running consistency checks 50% (or so) of times, and running the previous stats-only check otherwise?

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

Currently, our nightly builds on master have a ~20h duration. I'm wondering how much of an impact this will have on the overall build time; it could easily start taking over 24h if a non-negligible percentage of tests start running 20 mins longer. Thoughts on running consistency checks 50% (or so) of times, and running the previous stats-only check otherwise?

Depends on how large the datasets are, I suppose. I don't expect this to add significant runtime for most roachtests, which have a small dataset.

We could also bound the check time to 10 minutes instead -- I think I prefer that to a 50% coin toss, since we get better coverage across all tests, and we'll still be able to check up to 100 GB of data for large tests. Another option would be to disable this assertion for some very large roachtests.

@renatolabs
Copy link
Copy Markdown

I don't expect this to add significant runtime for most roachtests

Yes, that's also my intuition, although it's hard to predict. I think it's fine to merge as-is and observe the impact on the next nightly build and adapt accordingly if necessary.

@srosenberg
Copy link
Copy Markdown
Member

What's the actual average throughput relative to server.consistency_check.max_rate?

Checked 92 GB across 180 ranges in 7m14s. That works out to 212 MB/s.

I am confused... server.consistency_check.max_rate controls the maximum (scan) throughput per range/replica; it maxes out at 8 MB/s. Is there another knob that controls concurrency of range scans?

@erikgrinaker
Copy link
Copy Markdown
Contributor Author

We set the rate to 1 GB/s here, basically unbounded.

@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch from 8262b68 to ab93b38 Compare December 15, 2023 11:13
@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch 2 times, most recently from e54737a to c45a875 Compare December 18, 2023 09:58
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

I think this should be good to go now, @renatolabs.

This patch runs replica consistency checks during post-test assertions.
Previously, this only checked for stats inconsistencies, it now checks
for actual data inconsistencies. This is useful to detect data
corruption bugs, typically in e.g. Pebble or Raft snapshots.

The consistency check runs at full speed for up to 20 minutes before
timing out, and will report any inconsistencies found during a partial
run. This is sufficient to check about 200 GB of data. Tests can opt out
by skipping `registry.PostValidationReplicaDivergence`.

Storage checkpoints are not yet taken when inconsistencies are detected,
since it requires additional support in SQL builtins to take them at
the same Raft log index across nodes. This will be addressed separately.

Epic: none
Release note: None
@erikgrinaker erikgrinaker force-pushed the roachtest-consistency-check branch from c45a875 to e675216 Compare December 18, 2023 20:00
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

erikgrinaker commented Dec 18, 2023

TFTR! Let's how the nightlies go.

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Dec 18, 2023

Build succeeded:

@craig craig bot merged commit fec5213 into cockroachdb:master Dec 18, 2023
@erikgrinaker
Copy link
Copy Markdown
Contributor Author

blathers backport 23.2

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jan 10, 2024

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from e675216 to blathers/backport-release-23.2-116218: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 23.2 failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants