Is your feature request related to a problem? Please describe.
We have a kvprober that sends point read requests to "random" ranges. We should extent that prober to test the availability of a range at a write level. We can call this a "shadow write".
Describe the solution you'd like
Strawman proposal:
- Implement a raft command called
Probe / ShadowWrite and make available via the kvclient public API.
- The MVP implementation of the command does nothing.
- Extend
kvprober to make Probe / ShadowWrite requests to "random" ranges.
The test of kv is decent. The Probe / ShadowWrite command needs to get proposed, agreed upon, applied, etc. (Am I using these words, correctly?) A write to the raft log will happen, so availability of the disk is checked.
The test of pebble is minimal, as no actual write happens at Probe command apply time. Note though that we could change this in future CRDB versions. One can imagine writing to pebble but in a way that doesn't lead to user-visible side effects, in order to improve the realistic of the probe (in order to match the actual CRDB write codepath more closely).
CC @tbg @andreimatei @knz @bdarnell @jreut @logston for review of the strawman proposal. I hope for a naming bikeshed.
Also, KV folks: How hard of a time do you think I will have implementing this? It's hard for me to scope the add Probe / ShadowWrite command part. My sense from talking with Ben a while back is that it's not technically hard really but lots of boilerplate and also a new command hasn't been added in a while so may be tricky to figure out all the places to make changes.
Describe alternatives you've considered
- We should also implement the stuck applied index + failing probe alert, which has faster mean time to detect, so long as the symptom experienced is a stuck applied index. Can't a link to an issue for that but it has been discussed.
- We should consider other similar approaches to above, where some internal detail that is suspect (a stuck applied index) leads us to probe a specific range, leading to faster mean time to detect.
These aren't really alternatives tho. Blackbox approaches like this one are complimented by whitebox approaches.
Additional context
#61074
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go
Epic CC-4054
Is your feature request related to a problem? Please describe.
We have a
kvproberthat sends point read requests to "random" ranges. We should extent that prober to test the availability of a range at a write level. We can call this a "shadow write".Describe the solution you'd like
Strawman proposal:
Probe/ShadowWriteand make available via thekvclientpublic API.kvproberto makeProbe/ShadowWriterequests to "random" ranges.The test of
kvis decent. TheProbe/ShadowWritecommand needs to get proposed, agreed upon, applied, etc. (Am I using these words, correctly?) A write to the raft log will happen, so availability of the disk is checked.The test of
pebbleis minimal, as no actual write happens atProbecommand apply time. Note though that we could change this in future CRDB versions. One can imagine writing to pebble but in a way that doesn't lead to user-visible side effects, in order to improve the realistic of the probe (in order to match the actual CRDB write codepath more closely).CC @tbg @andreimatei @knz @bdarnell @jreut @logston for review of the strawman proposal. I hope for a naming bikeshed.
Also, KV folks: How hard of a time do you think I will have implementing this? It's hard for me to scope the add
Probe/ShadowWritecommand part. My sense from talking with Ben a while back is that it's not technically hard really but lots of boilerplate and also a new command hasn't been added in a while so may be tricky to figure out all the places to make changes.Describe alternatives you've considered
These aren't really alternatives tho. Blackbox approaches like this one are complimented by whitebox approaches.
Additional context
#61074
https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go
Epic CC-4054