roachtest: add operation to probe ranges#144781
roachtest: add operation to probe ranges#144781craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
4264428 to
762fdf9
Compare
762fdf9 to
a18afdd
Compare
DarrylWong
left a comment
There was a problem hiding this comment.
LGTM, although I'd get a stamp from someone in DRP or SRE before merging
a18afdd to
7b239d6
Compare
45a5285 to
5007b9e
Compare
|
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
5007b9e to
b94bab7
Compare
| func runProbeRanges( | ||
| ctx context.Context, o operation.Operation, c cluster.Cluster, writeProbe bool, | ||
| ) registry.OperationCleanup { | ||
| rng, _ := randutil.NewPseudoRand() |
There was a problem hiding this comment.
an aside: it doesn't seem ideal for operations to be managing their own RNG. e.g. if we want to reproduce a specific operation run, it's up to each individual operation to make sure to log the seed which we aren't doing here (or in any of the operations).
Instead, the top level operation runner should be calling randutil.NewPseudoRand() which uses that to generate+log a new seed for each individual operation.
Anyway, outside the scope of this PR, just mentioning it.
b94bab7 to
b9bf366
Compare
54cda2d to
ec5187e
Compare
|
@shailendra-patel Thank you for review, and apologies for not providing enough context in description. I have responded to questions and added some more background info to the PR, as well as making the suggested change to fail if we detect range errors. PTAL and let me know if any general questions still remain on the overall approach. |
Since SRE uses crdb_internal.probe_ranges to test for prod cluster health, we would like to add this as a roach operation to make the DRT cluster as realistic as possible, and test for potential issues with crdb_internal.probe_ranges. Fixes: cockroachdb#102034 Release note: None Epic: None
ec5187e to
14598a5
Compare
|
bors r+ |
|
This PR was included in a batch that was canceled, it will be automatically retried |
Since SRE uses crdb_internal.probe_ranges to test for prod cluster health, we would like to add this as a roach operation
to make the DRT cluster as realistic as possible, and test for potential issues with crdb_internal.probe_ranges, so we know ASAP if our alerting coverage drops.
Background
crdb_internal.probe_ranges is a virtual table that quickly probes the entire keyspace of the KV layer to return a table of schema (range_id | error | end_to_end_latency_ms). It has minimal dependencies, so it functions even when a cluster is quite broken. And since it probes the entire keyspace, it is useful when something has already gone wrong, in narrowing down an issue to specific ranges.
What will this roach operation can catch?
If this roach operation fails, there is either a bug in
crdb_internal.probe_ranges, so SRE is short a critical tool, or there is a serious bug is present in KV layer, and KV team will need to know asap. Ideally SRE would be first to know if there is an issue, and can hand off to KV if necessary.Testing PR
Tested that it works on roachtest cluster with
roachtest run-operation noahthompsoncockroachlabscom-test probe-ranges, and also was able to test that it successfully failed by forcing a range_error in DB, w/Future Work
We would like to also enable KVProber cluster setting to test this from a different angle, this should be a very easy change.
Fixes: #102034
Release note: None
Epic: None