roachtest: add operation to probe ranges#144781

Merged

craig[bot] merged 1 commit intocockroachdb:masterfrom

noahstho:noahstho/roach-operation-probe-ranges

May 6, 2025

Contributor

noahstho commented Apr 21, 2025 •

edited

Loading

Since SRE uses crdb_internal.probe_ranges to test for prod cluster health, we would like to add this as a roach operation
to make the DRT cluster as realistic as possible, and test for potential issues with crdb_internal.probe_ranges, so we know ASAP if our alerting coverage drops.

Background
crdb_internal.probe_ranges is a virtual table that quickly probes the entire keyspace of the KV layer to return a table of schema (range_id | error | end_to_end_latency_ms). It has minimal dependencies, so it functions even when a cluster is quite broken. And since it probes the entire keyspace, it is useful when something has already gone wrong, in narrowing down an issue to specific ranges.

What will this roach operation can catch?
If this roach operation fails, there is either a bug in crdb_internal.probe_ranges, so SRE is short a critical tool, or there is a serious bug is present in KV layer, and KV team will need to know asap. Ideally SRE would be first to know if there is an issue, and can hand off to KV if necessary.

Testing PR
Tested that it works on roachtest cluster with
roachtest run-operation noahthompsoncockroachlabscom-test probe-ranges, and also was able to test that it successfully failed by forcing a range_error in DB, w/


Running operation probe-ranges on noahthompsoncockroachlabscom-test2.

2025/04/29 20:00:46 run_operation.go:145: [1] operation status: checking if operation probe-ranges/read dependencies are met
2025/04/29 20:00:47 run_operation.go:145: [1] operation status: running operation probe-ranges/read with run id 12821170976295052991
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: executing crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: found 1 errors while executing crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: error on node 3 on range 4: test range error
2025/04/29 20:00:47 operation_impl.go:138: [1] operation failure #1: Found range errors when probing via crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 run_operation.go:229: recovered from panic: o.Fatal() was called

Future Work
We would like to also enable KVProber cluster setting to test this from a different angle, this should be a very easy change.

Fixes: #102034
Release note: None
Epic: None

Member

cockroach-teamcity commented Apr 21, 2025

This change is

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 4264428 to 762fdf9 Compare

April 21, 2025 16:49

noahstho requested a review from DarrylWong

April 21, 2025 16:54

DarrylWong reviewed

View reviewed changes

pkg/cmd/roachtest/registry/owners.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

DarrylWong reviewed

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 762fdf9 to a18afdd Compare

April 24, 2025 17:23

DarrylWong approved these changes

View reviewed changes

Contributor

DarrylWong left a comment

LGTM, although I'd get a stamp from someone in DRP or SRE before merging

pkg/cmd/roachtest/operations/probe_ranges.go Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

DarrylWong reviewed

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from a18afdd to 7b239d6 Compare

April 24, 2025 18:47

noahstho commented

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 45a5285 to 5007b9e Compare

April 24, 2025 19:01

blathers-crl bot commented Apr 24, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 5007b9e to b94bab7 Compare

April 24, 2025 19:09

noahstho marked this pull request as ready for review

April 24, 2025 19:12

noahstho requested review from a team, sambhav-jain-16 and vidit-bhat and removed request for a team

April 24, 2025 19:12

DarrylWong reviewed

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go

+              func runProbeRanges(
+              	ctx context.Context, o operation.Operation, c cluster.Cluster, writeProbe bool,
+              ) registry.OperationCleanup {
+              	rng, _ := randutil.NewPseudoRand()

Contributor

DarrylWong Apr 24, 2025 •

edited

Loading

an aside: it doesn't seem ideal for operations to be managing their own RNG. e.g. if we want to reproduce a specific operation run, it's up to each individual operation to make sure to log the seed which we aren't doing here (or in any of the operations).

Instead, the top level operation runner should be calling randutil.NewPseudoRand() which uses that to generate+log a new seed for each individual operation.

Anyway, outside the scope of this PR, just mentioning it.

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from b94bab7 to b9bf366 Compare

April 24, 2025 20:57

shailendra-patel requested changes

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/operations/probe_ranges.go Show resolved Hide resolved

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch 3 times, most recently from 54cda2d to ec5187e Compare

April 29, 2025 20:58

Contributor Author

noahstho commented Apr 29, 2025

@shailendra-patel Thank you for review, and apologies for not providing enough context in description. I have responded to questions and added some more background info to the PR, as well as making the suggested change to fail if we detect range errors.

PTAL and let me know if any general questions still remain on the overall approach.

noahstho requested a review from shailendra-patel

April 29, 2025 21:02

shailendra-patel reviewed

View reviewed changes

pkg/cmd/roachtest/operations/probe_ranges.go Outdated Show resolved Hide resolved

shailendra-patel approved these changes

View reviewed changes


          roachtest: add operation to query crdb_internal.probe_ranges

14598a5

Since SRE uses crdb_internal.probe_ranges to test for prod
cluster health, we would like to add this as a roach operation
to make the DRT cluster as realistic as possible, and test for
potential issues with crdb_internal.probe_ranges.

Fixes: cockroachdb#102034
Release note: None
Epic: None

noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from ec5187e to 14598a5 Compare

May 6, 2025 17:07

noahstho removed request for sambhav-jain-16 and vidit-bhat

May 6, 2025 17:29

Contributor Author

noahstho commented May 6, 2025

bors r+

Contributor

craig bot commented May 6, 2025

This PR was included in a batch that was canceled, it will be automatically retried

Contributor

craig bot commented May 6, 2025

Build succeeded:

craig bot merged commit a26e4d8 into cockroachdb:master

22 checks passed

celeste-cockroachdb bot added the target-release-25.3.0 label

celeste-cockroachdb bot added v25.3.0-prerelease and removed target-release-25.3.0 labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v25.3.0-prerelease