Skip to content

kvprober: implement "shadow write" probes #67112

@joshimhoff

Description

@joshimhoff

Is your feature request related to a problem? Please describe.
We have a kvprober that sends point read requests to "random" ranges. We should extent that prober to test the availability of a range at a write level. We can call this a "shadow write".

Describe the solution you'd like
Strawman proposal:

  1. Implement a raft command called Probe / ShadowWrite and make available via the kvclient public API.
  2. The MVP implementation of the command does nothing.
  3. Extend kvprober to make Probe / ShadowWrite requests to "random" ranges.

The test of kv is decent. The Probe / ShadowWrite command needs to get proposed, agreed upon, applied, etc. (Am I using these words, correctly?) A write to the raft log will happen, so availability of the disk is checked.

The test of pebble is minimal, as no actual write happens at Probe command apply time. Note though that we could change this in future CRDB versions. One can imagine writing to pebble but in a way that doesn't lead to user-visible side effects, in order to improve the realistic of the probe (in order to match the actual CRDB write codepath more closely).

CC @tbg @andreimatei @knz @bdarnell @jreut @logston for review of the strawman proposal. I hope for a naming bikeshed.

Also, KV folks: How hard of a time do you think I will have implementing this? It's hard for me to scope the add Probe / ShadowWrite command part. My sense from talking with Ben a while back is that it's not technically hard really but lots of boilerplate and also a new command hasn't been added in a while so may be tricky to figure out all the places to make changes.

Describe alternatives you've considered

  • We should also implement the stuck applied index + failing probe alert, which has faster mean time to detect, so long as the symptom experienced is a stuck applied index. Can't a link to an issue for that but it has been discussed.
  • We should consider other similar approaches to above, where some internal detail that is suspect (a stuck applied index) leads us to probe a specific range, leading to faster mean time to detect.

These aren't really alternatives tho. Blackbox approaches like this one are complimented by whitebox approaches.

Additional context
#61074
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go

Epic CC-4054

Metadata

Metadata

Assignees

Labels

A-kv-observabilityC-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)O-sreFor issues SRE opened or otherwise cares about tracking.T-kvKV Team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions