Skip to content

roachtest: add operation to probe ranges#144781

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
noahstho:noahstho/roach-operation-probe-ranges
May 6, 2025
Merged

roachtest: add operation to probe ranges#144781
craig[bot] merged 1 commit intocockroachdb:masterfrom
noahstho:noahstho/roach-operation-probe-ranges

Conversation

@noahstho
Copy link
Copy Markdown
Contributor

@noahstho noahstho commented Apr 21, 2025

Since SRE uses crdb_internal.probe_ranges to test for prod cluster health, we would like to add this as a roach operation
to make the DRT cluster as realistic as possible, and test for potential issues with crdb_internal.probe_ranges, so we know ASAP if our alerting coverage drops.

Background
crdb_internal.probe_ranges is a virtual table that quickly probes the entire keyspace of the KV layer to return a table of schema (range_id | error | end_to_end_latency_ms). It has minimal dependencies, so it functions even when a cluster is quite broken. And since it probes the entire keyspace, it is useful when something has already gone wrong, in narrowing down an issue to specific ranges.

What will this roach operation can catch?
If this roach operation fails, there is either a bug in crdb_internal.probe_ranges, so SRE is short a critical tool, or there is a serious bug is present in KV layer, and KV team will need to know asap. Ideally SRE would be first to know if there is an issue, and can hand off to KV if necessary.

Testing PR
Tested that it works on roachtest cluster with
roachtest run-operation noahthompsoncockroachlabscom-test probe-ranges, and also was able to test that it successfully failed by forcing a range_error in DB, w/


Running operation probe-ranges on noahthompsoncockroachlabscom-test2.

2025/04/29 20:00:46 run_operation.go:145: [1] operation status: checking if operation probe-ranges/read dependencies are met
2025/04/29 20:00:47 run_operation.go:145: [1] operation status: running operation probe-ranges/read with run id 12821170976295052991
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: executing crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: found 1 errors while executing crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 probe_ranges.go:92: [1] operation status: error on node 3 on range 4: test range error
2025/04/29 20:00:47 operation_impl.go:138: [1] operation failure #1: Found range errors when probing via crdb_internal.probe-ranges read statement against node 3
2025/04/29 20:00:47 run_operation.go:229: recovered from panic: o.Fatal() was called

Future Work
We would like to also enable KVProber cluster setting to test this from a different angle, this should be a very easy change.

Fixes: #102034
Release note: None
Epic: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 4264428 to 762fdf9 Compare April 21, 2025 16:49
@noahstho noahstho requested a review from DarrylWong April 21, 2025 16:54
@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 762fdf9 to a18afdd Compare April 24, 2025 17:23
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I'd get a stamp from someone in DRP or SRE before merging

@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from a18afdd to 7b239d6 Compare April 24, 2025 18:47
@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 45a5285 to 5007b9e Compare April 24, 2025 19:01
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 24, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from 5007b9e to b94bab7 Compare April 24, 2025 19:09
@noahstho noahstho marked this pull request as ready for review April 24, 2025 19:12
@noahstho noahstho requested review from a team, sambhav-jain-16 and vidit-bhat and removed request for a team April 24, 2025 19:12
func runProbeRanges(
ctx context.Context, o operation.Operation, c cluster.Cluster, writeProbe bool,
) registry.OperationCleanup {
rng, _ := randutil.NewPseudoRand()
Copy link
Copy Markdown
Contributor

@DarrylWong DarrylWong Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an aside: it doesn't seem ideal for operations to be managing their own RNG. e.g. if we want to reproduce a specific operation run, it's up to each individual operation to make sure to log the seed which we aren't doing here (or in any of the operations).

Instead, the top level operation runner should be calling randutil.NewPseudoRand() which uses that to generate+log a new seed for each individual operation.

Anyway, outside the scope of this PR, just mentioning it.

@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from b94bab7 to b9bf366 Compare April 24, 2025 20:57
@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch 3 times, most recently from 54cda2d to ec5187e Compare April 29, 2025 20:58
@noahstho
Copy link
Copy Markdown
Contributor Author

@shailendra-patel Thank you for review, and apologies for not providing enough context in description. I have responded to questions and added some more background info to the PR, as well as making the suggested change to fail if we detect range errors.

PTAL and let me know if any general questions still remain on the overall approach.

Since SRE uses crdb_internal.probe_ranges to test for prod
cluster health, we would like to add this as a roach operation
to make the DRT cluster as realistic as possible, and test for
potential issues with crdb_internal.probe_ranges.

Fixes: cockroachdb#102034
Release note: None
Epic: None
@noahstho noahstho force-pushed the noahstho/roach-operation-probe-ranges branch from ec5187e to 14598a5 Compare May 6, 2025 17:07
@noahstho
Copy link
Copy Markdown
Contributor Author

noahstho commented May 6, 2025

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 6, 2025

This PR was included in a batch that was canceled, it will be automatically retried

@craig
Copy link
Copy Markdown
Contributor

craig bot commented May 6, 2025

@craig craig bot merged commit a26e4d8 into cockroachdb:master May 6, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: test kvprober including crdb_internal.probe_ranges

4 participants