kvprober: introduce a package for probing the KV layer that does reads only#58124
kvprober: introduce a package for probing the KV layer that does reads only#58124craig[bot] merged 1 commit intocockroachdb:masterfrom
Conversation
jreut
left a comment
There was a problem hiding this comment.
I think this PoC is a great start to the kind of fine-grained metrics we operators want to create more actionable alerts. Thanks for writing this up, @joshimhoff!
Reviewed 3 of 3 files at r1.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @knz, and @tbg)
pkg/server/prober/kvprober.go, line 28 at r1 (raw file):
It feels wrong to me for this to live in the SQL pods, as there is one of them per tenant and the code under test is the KV layer. OTOH, do the KV pods have access to the SQL internal executor?
It seems we use the SQL layer only to randomly sample the set of ranges. From what I understand of CRDB architecture, we'd need to scan and sample the well-known meta ranges ourselves if we wanted to omit our dependency on SQL. I think we probably don't need to be in the business of reimplementing our SQL executor :)
In that case, could we simply run a single SQL pod for the sole purpose of this monitoring?
pkg/server/prober/kvprober.go, line 197 at r1 (raw file):
nil, /* txn */
What happens when a range disappears between selecting it and probing it? I guess we can just tune a background failure rate?
eac646e to
a245136
Compare
tbg
left a comment
There was a problem hiding this comment.
This looks pretty good. There are two dimensions:
The first is very straightforwardly that the explicit dependencies can be trimmed down through use of abstractions for the range selection and the probing.
The second is more about the design - is this only ever going to probe KV (I think so)? How should the range selection work (random vs directed) and at which pace (fixed probes/interval or adaptive on cluster size?). I think only the first really needs an answer right now and if the answer is "only kv", then I like ./pkg/kv/kvprober
Reviewed 1 of 3 files at r1, 3 of 3 files at r2.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, and @knz)
pkg/server/prober/kvprober.go, line 28 at r1 (raw file):
Previously, jreut (jordan ryan reuter) wrote…
It feels wrong to me for this to live in the SQL pods, as there is one of them per tenant and the code under test is the KV layer. OTOH, do the KV pods have access to the SQL internal executor?
It seems we use the SQL layer only to randomly sample the set of ranges. From what I understand of CRDB architecture, we'd need to scan and sample the well-known meta ranges ourselves if we wanted to omit our dependency on SQL. I think we probably don't need to be in the business of reimplementing our SQL executor :)
In that case, could we simply run a single SQL pod for the sole purpose of this monitoring?
As written, we don't really need a full dependency on the executor here. We can pass in an abstraction. As an overly simplistic straw man, just something like
type Cursor interface{} // for meta2 impl: EndKey of last span seen before, nil initially
func(Cursor) []roachpb.Spanwhich you call in a loop and then go read probe the spans it gives you back. These would then either come from crdb_internal or meta2, but honestly if we're serious about wanting to probe kv here (and not some mish-mash of sql and kv) then the source should be a scan of meta2, just like it's done in various places such as
cockroach/pkg/migration/helper.go
Lines 183 to 219 in 89781b1
pkg/server/prober/kvprober.go, line 47 at r2 (raw file):
} // TODO(josh): Public to set from test. Better way?
Testing knobs are the standard way for these. But since this is an experimental feature (if this merges any time soon) then it's good to leave it turned off by default so I'm not opposed to a setting.
pkg/server/prober/kvprober.go, line 84 at r2 (raw file):
// heat map. I have't implemented this; I wonder what folks think about the // idea. Is location in keyspace the right thing to measure or would location // in the "rangespace" be better?
I think it will be difficult to make these useful in practice, simply because the resulting number doesn't mean anything as the data is laid out very nonuniformly around the keyspace and various probes failing will smear the result further. In particular, you don't get a heat map from just one value - you need a density function, i.e. at least a discrete set of values (ok, in a sense one value is the coarsest heatmap there is, but I think you'll agree that it doesn't tell you anything except the temperature).
For a real heatmap, you need to keep a reservoir of key spans that you weren't able to probe, but I don't think that at the KV level, a heat map will add much on top of just printing the list of samples. However, as you think about heat maps, make sure to check in with @asubiotto who is thinking about (not sure that it made the cut though) adding a heatmap for KV contention in the context of #57736. Once we have the ability to have any kind of heatmap in the UI, conceivably it will be much easier to have a second one.
pkg/server/prober/kvprober.go, line 138 at r2 (raw file):
// I don't understand the situation exactly. Is this an okay way to avoid // the problem? indexIntoMeta2: keys.Meta2Prefix.Next(),
Yes, though I'd crib the approach from
cockroach/pkg/migration/migrations.go
Lines 108 to 115 in 89781b1
pkg/server/prober/kvprober.go, line 142 at r2 (raw file):
} // TODO(josh): Not sure what the standard way to recover from panics with
If you want to recover, you have to recover() yourself. This is not something we typically do btw.
pkg/server/prober/plan.go, line 42 at r2 (raw file):
// Probing based on these plans will make quiescent ranges active. queryMeta2 // Make a plan by querying the rangecache. Raphael suggested this but I
I don't think this is worth pursuing for now, unless we think it is valuable to probe based on the cached entries. Probing should never be high enough frequency so that the unquiescing would constitute a problem (and in particular, unquiescing meta ranges shouldn't ever matter as these are expected to be woken up frequently anyway).
pkg/server/prober/plan.go, line 65 at r2 (raw file):
// given the test utilities being depended on. func (p *KVProber) MakePlan(ctx context.Context) (Plan, error) {
You can put your test in the prober_test package and add a file helpers_test.go which exports the method. So basically you rename this one to makePlanInternal and then in helpers_test.go you add (*KVProber).MakePlan which just calls into makePlanInternal
That said, you should consider using proper abstractions for both the supply of what to probe as well for the method to invoke as the probe (i.e. the p.db.Scan). That way you can write most tests as true unit tests and throw in one integration test that uses the real things just for good measure, but without trying to exercise edge cases through it.
pkg/server/prober/plan.go, line 89 at r2 (raw file):
return nil, err } sk := tree.DBytes(rangeDesc.StartKey)
Revisiting this, I think it makes more sense to just double down on meta2 as a source of ranges to probe (but still introduce the proper abstraction which could in principle support crdb_internal).
pkg/server/prober/plan.go, line 116 at r2 (raw file):
ctx, "kvprober", nil, /* txn */ sessiondata.InternalExecutorOverride{User: security.NodeUserName()}, // TODO(josh): Probing for range ID 1 leads to following error:
See my other comment about p.indexIntoMeta2.
knz
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, and @tbg)
pkg/server/prober/plan.go, line 89 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Revisiting this, I think it makes more sense to just double down on meta2 as a source of ranges to probe (but still introduce the proper abstraction which could in principle support crdb_internal).
FWIW, Rohan moved the range scan logic to kvclient in this PR: #58314
I'm not 100% sure this interferes with your code here, but it feels like both facilities should live in the same place in the end.
joshimhoff
left a comment
There was a problem hiding this comment.
Thanks for the look. I have NOT addressed all comments yet but they all make good sense to me. In particular I like the cursor suggestion.
is this only ever going to probe KV (I think so)?
Yes I think so.
I think only the first really needs an answer right now and if the answer is "only kv", then I like ./pkg/kv/kvprober
Ack I will move it. I like that package structure also.
I might spend a Friday on a doc / RFC. Working on this POC was really productive (& FUN); I feel much clearer about the details now. There are some design Qs I have in mind that are best discussed in a doc / RFC IMO.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @jreut, and @tbg)
pkg/server/prober/kvprober.go, line 197 at r1 (raw file):
Previously, jreut (jordan ryan reuter) wrote…
nil, /* txn */What happens when a range disappears between selecting it and probing it? I guess we can just tune a background failure rate?
At the end of the day a scan is done from some start to end key. That is always ok to do. It is possible that we plan a scan based on an outdated idea of which range the start and/or end key are contained within. But that's fine; we just intend to cover the whole "rangespace" over time; an occasional scan that goes to the "wrong" range doesn't threaten that goal really, I don't think.
If there is no data between the start & end key, the scan will still return without error; it will just return no data. Perhaps we could alert on this happening more than is expected.
pkg/server/prober/kvprober.go, line 84 at r2 (raw file):
simply because the resulting number doesn't mean anything as the data is laid out very nonuniformly around the keyspace and various probes failing will smear the result further.
True but what if the distribution was over the normalized "rangespace"? Glancing at a view like [1] in grafana, I expect we'd be able to see whether errors are coming from just one range or a few ranges or many ranges. Wouldn't be useful for a probe count, only errors. For a probe count, we'd expect to always see the uniform distribution, given that this is a design goal of the prober.
https://grafana.com/docs/grafana/latest/panels/visualizations/heatmap/
In particular, you don't get a heat map from just one value - you need a density function, i.e. at least a discrete set of values (ok, in a sense one value is the coarsest heatmap there is, but I think you'll agree that it doesn't tell you anything except the temperature).
I'm not sure I'm following here. Ack re: checking in with @asubiotto!
pkg/server/prober/kvprober.go, line 142 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
If you want to recover, you have to
recover()yourself. This is not something we typically do btw.
In SQL they recover as much as they can, no? Ack. I'll leave a TODO for now.
pkg/server/prober/plan.go, line 42 at r2 (raw file):
Probing should never be high enough frequency so that the unquiescing would constitute a problem
Let's say we probed once a second. Is that a reasonable frequency in your mind?
in particular, unquiescing meta ranges shouldn't ever matter as these are expected to be woken up frequently anyway
My worry was not about unquiescing meta ranges but unquiescing other ranges such as ones that store user data. If we probe once a second, that's 3600 probes per hour. Won't that mean that no ranges will ever stay quiescient?
tbg
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @jreut, and @tbg)
pkg/server/prober/kvprober.go, line 197 at r1 (raw file):
Perhaps we could alert on this happening more than is expected.
No, this is pretty much innocuous. If it happens a lot it means the cluster is aggressively redrawing range boundaries. Which is fine! It needs to do that sometimes. No need for the black box prober to get excited.
pkg/server/prober/kvprober.go, line 84 at r2 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
simply because the resulting number doesn't mean anything as the data is laid out very nonuniformly around the keyspace and various probes failing will smear the result further.
True but what if the distribution was over the normalized "rangespace"? Glancing at a view like [1] in grafana, I expect we'd be able to see whether errors are coming from just one range or a few ranges or many ranges. Wouldn't be useful for a probe count, only errors. For a probe count, we'd expect to always see the uniform distribution, given that this is a design goal of the prober.
https://grafana.com/docs/grafana/latest/panels/visualizations/heatmap/
In particular, you don't get a heat map from just one value - you need a density function, i.e. at least a discrete set of values (ok, in a sense one value is the coarsest heatmap there is, but I think you'll agree that it doesn't tell you anything except the temperature).
I'm not sure I'm following here. Ack re: checking in with @asubiotto!
I'm not anti-heatmap (ok maybe a little bit, I need to be convinced it's useful!), I just don't see how we can shoehorn this into metrics. A single metric as you have it here is just a number (that changes over time). A heatmap requires different buckets. I think what we really need is some way to map the ranges into a histogram. We could do this: we make a linear histogram with, say, 1000 buckets. The Nth range maps to the N/1000th bucket, and we record to the histogram with each failure. You are right that we can then visualize this as a heatmap with Grafana out-of-the-box (on the down side, we are entertaining a 1000 bucket metric; maybe 100 buckets will do? But really the number of buckets you want is proportional to the range count, which can go into the tens of thousands). It's a lot of finagling for a metric that may not be that useful - if the heatmap shows a cluster of red, don't you want to know which range that is? You'll have to compute that it's the 17052th range and then find out what the corresponding rangeID is - not fun.
I think this would have a better life outside of metrics. But I also don't know if it provides the actual value that justifies the work - after we see any prober failures, wouldn't we look at the list right away to spot patterns, which we would if we just sorted it by range start key?
pkg/server/prober/kvprober.go, line 142 at r2 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
In SQL they recover as much as they can, no? Ack. I'll leave a TODO for now.
Yes, on the goroutines that are started by clients, they do try that.
pkg/server/prober/plan.go, line 42 at r2 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Probing should never be high enough frequency so that the unquiescing would constitute a problem
Let's say we probed once a second. Is that a reasonable frequency in your mind?
in particular, unquiescing meta ranges shouldn't ever matter as these are expected to be woken up frequently anyway
My worry was not about unquiescing meta ranges but unquiescing other ranges such as ones that store user data. If we probe once a second, that's 3600 probes per hour. Won't that mean that no ranges will ever stay quiescient?
Unquiesced ranges typically quiesce again within a few seconds, so no! Most ranges will be quiescent with that probing interval. Besides, pure reads don't even unquiesce the range (though we do want to exercise Raft here ultimately).
We do however want to avoid scanning meta2 every second, but the cursoring approach would basically avoid that by moving slowly through the keyspace.
Another consideration is that what I'm proposing here is a linear scan. This may not be the ideal way to probe; ideally we'd jump around a bit. That "jumping around" could be implemented as opening ~log(#ranges) cursors and pulling from them one by one. I think when we pull from a cursor we want to pull a handful of ranges, so that we can see at a glance whether only one of the ranges or failed or whether its neighbors did too (in which case we can reasonably conclude that even more ranges in the vicinity are down; many failure modes will strike whole swathes of neighboring ranges since neighbors typically use the same zone configs).
30cd2eb to
f2479a6
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
PTAL!
I have made a number of changes:
- I've removed all deps on SQL & moved the code to pkg/kv/kvprober.
- I've moved to a call to kv.Get on the selected range's start key as per discussion with Ben Darnell about how this won't interfere with user-traffic and is still a good test.
- I've written some unit tests & trimmed down the integration tests.
One thing that is fun about this is I believe we could roll out a dedicated prober instance similar to a SQL instance on free-tier to get some early signal on how this approach works in production.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jreut, and @tbg)
pkg/server/prober/kvprober.go, line 28 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
As written, we don't really need a full dependency on the executor here. We can pass in an abstraction. As an overly simplistic straw man, just something like
type Cursor interface{} // for meta2 impl: EndKey of last span seen before, nil initially func(Cursor) []roachpb.Spanwhich you call in a loop and then go read probe the spans it gives you back. These would then either come from crdb_internal or meta2, but honestly if we're serious about wanting to probe kv here (and not some mish-mash of sql and kv) then the source should be a scan of meta2, just like it's done in various places such as
cockroach/pkg/migration/helper.go
Lines 183 to 219 in 89781b1
I've cleaned up the planner interface. I've removed the dep on the internal executor entirely since we are now planning by scanning meta2. I think it looks nicer with the new interface. Thanks for the ideas.
pkg/server/prober/kvprober.go, line 47 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Testing knobs are the standard way for these. But since this is an experimental feature (if this merges any time soon) then it's good to leave it turned off by default so I'm not opposed to a setting.
Ack.
pkg/server/prober/kvprober.go, line 84 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm not anti-heatmap (ok maybe a little bit, I need to be convinced it's useful!), I just don't see how we can shoehorn this into metrics. A single metric as you have it here is just a number (that changes over time). A heatmap requires different buckets. I think what we really need is some way to map the ranges into a histogram. We could do this: we make a linear histogram with, say, 1000 buckets. The Nth range maps to the N/1000th bucket, and we record to the histogram with each failure. You are right that we can then visualize this as a heatmap with Grafana out-of-the-box (on the down side, we are entertaining a 1000 bucket metric; maybe 100 buckets will do? But really the number of buckets you want is proportional to the range count, which can go into the tens of thousands). It's a lot of finagling for a metric that may not be that useful - if the heatmap shows a cluster of red, don't you want to know which range that is? You'll have to compute that it's the 17052th range and then find out what the corresponding rangeID is - not fun.
I think this would have a better life outside of metrics. But I also don't know if it provides the actual value that justifies the work - after we see any prober failures, wouldn't we look at the list right away to spot patterns, which we would if we just sorted it by range start key?
This might be a bit of an SRE / cloud vs. dev / on-prem thing? I think the reason for a heatmap is that even though approximate, you quickly get a sense of the basics of an outage happening in the present moment. Also you can see the dynamics of the outage. That is, you can see the problem evolve over time, which is quite powerful. I agree you will need to turn to logs or some other more fine-grained problem statement to move forward with root cause analysis & often mitigations. Certainly doing the math to go from histogram bucket to range ID is not a good idea.
Basically approximate is still good IMHO, when it comes to metrics. Metrics don't replace logs; they compliment logs.
I may not be understanding all the limitations of the histogram representation! I'm used to a bucketing scheme that implies approximating the underlying distribution (of course distributions are always approximated in a sense), but I have not poked at the CRDB metrics library in detail yet!
pkg/server/prober/kvprober.go, line 138 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yes, though I'd crib the approach from
instead.cockroach/pkg/migration/migrations.go
Lines 108 to 115 in 89781b1
I attempted this but my integration tests found some issues. First I seem to return range ID=1 twice from meta2; second I seemed to be scanning more than meta2, which wasn't my intention.
I strongly expect I botched executing on the suggestion. Anyway, if you can take a look at the current code & either LGTM or suggest more specifically the code change to make, I'd appreciate that!
pkg/server/prober/kvprober.go, line 142 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yes, on the goroutines that are started by clients, they do try that.
I added a recover even though I hear you on not needed since no user input processed in the flow.
pkg/server/prober/plan.go, line 42 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Unquiesced ranges typically quiesce again within a few seconds, so no! Most ranges will be quiescent with that probing interval. Besides, pure reads don't even unquiesce the range (though we do want to exercise Raft here ultimately).
We do however want to avoid scanning meta2 every second, but the cursoring approach would basically avoid that by moving slowly through the keyspace.
Another consideration is that what I'm proposing here is a linear scan. This may not be the ideal way to probe; ideally we'd jump around a bit. That "jumping around" could be implemented as opening
~log(#ranges)cursors and pulling from them one by one. I think when we pull from a cursor we want to pull a handful of ranges, so that we can see at a glance whether only one of the ranges or failed or whether its neighbors did too (in which case we can reasonably conclude that even more ranges in the vicinity are down; many failure modes will strike whole swathes of neighboring ranges since neighbors typically use the same zone configs).
Ah! Makes sense re: unquiesced ranges. Thank you.
Plz take a look at the current approach. It does a linear scan with batches but it randomizes the order the prober probes ranges within a batch. Maybe good enough to start? Interested in your thoughts even if good enough to start.
pkg/server/prober/plan.go, line 65 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You can put your test in the
prober_testpackage and add a filehelpers_test.gowhich exports the method. So basically you rename this one tomakePlanInternaland then inhelpers_test.goyou add(*KVProber).MakePlanwhich just calls intomakePlanInternalThat said, you should consider using proper abstractions for both the supply of what to probe as well for the method to invoke as the probe (i.e. the
p.db.Scan). That way you can write most tests as true unit tests and throw in one integration test that uses the real things just for good measure, but without trying to exercise edge cases through it.
Added unit tests. Thank you for the push.
I've left in a number of integration tests, as they have caught a lot of bugs, some (at least to me) rather subtle. Two examples: 1. For a while, I was never probing the last range in meta2. 2. After a refactor, I had the encode a key pulled a range descriptor slightly off in a way that meant the prober was not actually probing the right range. Except for integration tests, I didn't get much signal on that since it's mostly not an error to call kv.Get regardless of key passed in.
pkg/server/prober/plan.go, line 89 at r2 (raw file):
Previously, knz (kena) wrote…
FWIW, Rohan moved the range scan logic to
kvclientin this PR: #58314I'm not 100% sure this interferes with your code here, but it feels like both facilities should live in the same place in the end.
Introduced a clearer abstraction for planning! Removed all approaches except for scanning meta2!
Re: Rohan's PR, I gave it a quick look. Neither ScanMetaKVs nor GetRangeWithID look like they'd help me. I could imagine pulling some of the plan code I have now into pkg/kv/kvclient/scan_meta.go somehow but right now I'm not seeing a concrete way of doing that that I like (possibly due to my unfamiliarity with the codebase). Some relevant plan code I have is:
kvs, err := p.db.Scan(ctx, p.cursor, keys.Meta2KeyMax, planNProbesAtATime /*maxRows*/)
if err != nil {
return Plan{}, err
}
p.plans = make([]Plan, len(kvs))
var rangeDesc roachpb.RangeDescriptor
for i, kv := range kvs {
if err := kv.ValueProto(&rangeDesc); err != nil {
return Plan{}, err
}
pkg/server/prober/plan.go, line 116 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
See my other comment about
p.indexIntoMeta2.
Done!
knz
left a comment
There was a problem hiding this comment.
I reviewed this based on your latest changes.
The new structure of the code is very good. I think I still want other kv team members to continue chime in on architecture. a few notes below
Reviewed 7 of 7 files at r3.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @jreut, and @tbg)
pkg/kv/kvprober/kvprober.go, line 17 at r3 (raw file):
) type Prober struct {
here and below, this will want docstrings to explain the structs and fields.
pkg/kv/kvprober/kvprober.go, line 79 at r3 (raw file):
Help: "Distribution of KV read probe failures across rangespace", Measurement: "Normalized location in rangespace ([0, 1])", Unit: metric.Unit_UNSET,
what's this unit about?
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
// Doesn't return an error. Instead increments error type specific metrics. func (p *Prober) probe(ctx context.Context, db db) { defer func() {
use defer logcrash.RecoverAndReportNonfatalPanic() here.
pkg/kv/kvprober/kvprober.go, line 174 at r3 (raw file):
pl, err := p.Planner.Plan(ctx) if err != nil { log.Errorf(ctx, "Can't make a plan: %v", err)
here and throughout the PR: log messages start with a lowercase later.
pkg/kv/kvprober/kvprober.go, line 179 at r3 (raw file):
} // TODO(josh): Is INFO a reasonable level to log at?
seems good to me. However probably log.Health.Infof
(Might consider log.VEventf - do we need this logging to occur by default?)
pkg/kv/kvprober/plan.go, line 26 at r3 (raw file):
// A Planner that scans meta2 to make plans. type meta2Planner struct { db *kv.DB
would be useful to explain in comments the meaning and use of the fields in the struct.
pkg/kv/kvprober/plan.go, line 29 at r3 (raw file):
cursor roachpb.Key plans []Plan planNProbesAtATime int
especially this one.
pkg/kv/kvprober/plan.go, line 60 at r3 (raw file):
// ranges in the plan. This is avoid all nodes probing the same ranges at // the same time. Jitter is also added to the sleep between probe time // to de-synchronize different nodes' probe loops.
I'd like to discuss how this approach fares with clusters with many ranges e.g. 100,000 ranges and how it constrains RAM usage by the prober itself.
b29ed8d to
6116e5a
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jreut, @knz, and @tbg)
pkg/kv/kvprober/kvprober.go, line 17 at r3 (raw file):
Previously, knz (kena) wrote…
here and below, this will want docstrings to explain the structs and fields.
Done. I omitted a few where I wasn't sure what more could be said, e.g. settings. Let me know if you want more.
pkg/kv/kvprober/kvprober.go, line 79 at r3 (raw file):
Previously, knz (kena) wrote…
what's this unit about?
I haven't implemented this yet but looking over the available units, I don't see one that obviously fits. I am thinking the metric will be a distribution of values from 0 to 1, where 0 means the first range and 1 means the last range and 0.5 means the "middle" range. In other words, a normalized location in the "rangespace". metric.Unit_PERCENT is probably the best unit type other than metric.Unit_UNSET but I think calling this value a percent is misleading?
I can also just delete this metric for now since it's not yet implemented.
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
Previously, knz (kena) wrote…
use
defer logcrash.RecoverAndReportNonfatalPanic()here.
Ah thanks!
pkg/kv/kvprober/kvprober.go, line 174 at r3 (raw file):
Previously, knz (kena) wrote…
here and throughout the PR: log messages start with a lowercase later.
Fixed!
pkg/kv/kvprober/kvprober.go, line 179 at r3 (raw file):
Previously, knz (kena) wrote…
seems good to me. However probably
log.Health.Infof(Might consider
log.VEventf- do we need this logging to occur by default?)
I'm a bit unsure if it should be info level. I vote we start with it on at this level, as the prober isn't enabled in prod anyway. We can adjust if / when we feel it makes sense to do so.
Moved to log.Health!
pkg/kv/kvprober/plan.go, line 26 at r3 (raw file):
Previously, knz (kena) wrote…
would be useful to explain in comments the meaning and use of the fields in the struct.
Done!
pkg/kv/kvprober/plan.go, line 29 at r3 (raw file):
Previously, knz (kena) wrote…
especially this one.
Done! Thx for push.
pkg/kv/kvprober/plan.go, line 60 at r3 (raw file):
Previously, knz (kena) wrote…
I'd like to discuss how this approach fares with clusters with many ranges e.g. 100,000 ranges and how it constrains RAM usage by the prober itself.
Much appreciated!
knz
left a comment
There was a problem hiding this comment.
Reviewed 2 of 3 files at r4.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @jreut, @knz, @logston, and @tbg)
pkg/kv/kvprober/kvprober.go, line 79 at r3 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
I haven't implemented this yet but looking over the available units, I don't see one that obviously fits. I am thinking the metric will be a distribution of values from 0 to 1, where 0 means the first range and 1 means the last range and 0.5 means the "middle" range. In other words, a normalized location in the "rangespace".
metric.Unit_PERCENTis probably the best unit type other thanmetric.Unit_UNSETbut I think calling this value a percent is misleading?I can also just delete this metric for now since it's not yet implemented.
Yes you could delete it since it's not yet implemented,
however if/when you want it, you would then define a new unit in the proto file with the new definition.
The name for this unit is "Spectrum".
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Ah thanks!
You're not yet done. You do not need the outer defer here, you can defer the logcrash function directly under probe.
pkg/kv/kvprober/kvprober.go, line 24 at r4 (raw file):
// kvprober increments metrics that SRE & other operators can use as alerting // signals. It also writes to logs to help narrow down the problem (e.g. which // range(s) are acting up.
nit: missing closing paren
pkg/kv/kvprober/kvprober.go, line 214 at r4 (raw file):
// perspective of the user. // TODO(josh): What we set this timeout to likely needs more thought. ctx, cancel := context.WithTimeout(ctx, 10 * time.Second)
this number needs to be configurable, e.g. via a cluster setting. (At the very minimum the constant should be in the global scope and named. We don't put magic constants inside the code.)
pkg/kv/kvprober/kvprober.go, line 223 at r4 (raw file):
_, err = db.Get(ctx, pl.StartKey) if err != nil { log.Health.Errorf(ctx, "kv.Get(%v), r=%v failed with: %v", roachpb.PrettyPrintKey(nil, pl.StartKey.AsRawKey()), pl.RangeID, err)
If we mean these log messages to be traceable then let's make them structured already.
The file you'd edit is pkg/util/log/eventpb/cluster_events.proto
Then you'd add a new section with a category name "Health probes" and channel HEALTH
and under that the events for probe results.
Then use the structured events here with log.Structured.
pkg/kv/kvprober/plan.go, line 60 at r3 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Much appreciated!
Well that was a request to extend the comment with an explanation of the memory space complexity of this function, and which data structures are used.
pkg/kv/kvprober/plan.go, line 25 at r4 (raw file):
} // A Planner that scans meta2 to make plans.
please extend the comment to explain what kind of plans we are talking about.
6116e5a to
72c490c
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
PTAL!
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @jreut, @knz, @logston, and @tbg)
pkg/kv/kvprober/kvprober.go, line 79 at r3 (raw file):
Previously, knz (kena) wrote…
Yes you could delete it since it's not yet implemented,
however if/when you want it, you would then define a new unit in the proto file with the new definition.
The name for this unit is "Spectrum".
Deleted! Thanks for the note.
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
Previously, knz (kena) wrote…
You're not yet done. You do not need the outer
deferhere, you can defer the logcrash function directly underprobe.
Er ya. My bad. Fixed.
I could also add a unit test, though I think I'd need to tweak RecoverAndReportNonfatalPanic to not panic (to treat test as release build) in certain cases. For example, we could add a public var to logcrash called DontRePanicEvenThoTest or similar that defaults to false. If set to true, which we can do from unit tests when needed, then RecoverAndReportNonfatalPanic won't panic even if test. Then I can add a unit test like this:
t.Run("panic caught", func(t *testing.T) {
logcrash.DontRePanicEvenThoTest = true
defer func() { logcrash.DontRePanicEvenThoTest = false }()
ReadEnabled.Override(&p.settings.SV, true)
m := &mock{}
clearMetrics(p)
p.Planner = nil // want panic
p.probe(ctx, m)
require.Equal(t, int64(1), p.Metrics.ProbePlanAttempts.Count())
require.Zero(t, p.Metrics.ReadProbeAttempts.Count())
require.Zero(t, p.Metrics.ProbePlanFailures.Count())
require.Zero(t, p.Metrics.ReadProbeFailures.Count())
})
Let me know what you think about this. If you are into it & don't mind a follow up PR, I can also do it that way.
pkg/kv/kvprober/kvprober.go, line 24 at r4 (raw file):
Previously, knz (kena) wrote…
nit: missing closing paren
Fixed.
pkg/kv/kvprober/kvprober.go, line 214 at r4 (raw file):
Previously, knz (kena) wrote…
this number needs to be configurable, e.g. via a cluster setting. (At the very minimum the constant should be in the global scope and named. We don't put magic constants inside the code.)
Ya good call. Cluster setting added. Also added a cluster setting for planNProbesAtATime. I think there are no more magic constants!
pkg/kv/kvprober/kvprober.go, line 223 at r4 (raw file):
Previously, knz (kena) wrote…
If we mean these log messages to be traceable then let's make them structured already.
The file you'd edit ispkg/util/log/eventpb/cluster_events.protoThen you'd add a new section with a category name "Health probes" and channel HEALTH
and under that the events for probe results.
Then use the structured events here withlog.Structured.
This looks really useful (can tally up the per-range results in splunk!) but if you don't object I'd prefer to address in a follow up PR.
pkg/kv/kvprober/plan.go, line 60 at r3 (raw file):
Previously, knz (kena) wrote…
Well that was a request to extend the comment with an explanation of the memory space complexity of this function, and which data structures are used.
Ah. I see. Added!
pkg/kv/kvprober/plan.go, line 25 at r4 (raw file):
Previously, knz (kena) wrote…
please extend the comment to explain what kind of plans we are talking about.
Done.
72c490c to
63aaf58
Compare
|
Also, I see the lint & test failures & will fix soon. |
1f9af34 to
1eb60a2
Compare
tbg
left a comment
There was a problem hiding this comment.
I left a ton of comments, but none of them are substantial. I think we should ship this? It's an auxiliary system that is off by default so if you have the cycles in the coming week or two it could make 21.1. It's fine to aggressively put off work for later (since nothing here was substantial and the benefit of getting what you have in outweighs a desire for perfection).
Reviewed 1 of 5 files at r5, 6 of 6 files at r6.
Dismissed @joshimhoff and @knz from a discussion.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @knz, and @logston)
pkg/kv/kvprober/kvprober.go, line 223 at r4 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
This looks really useful (can tally up the per-range results in splunk!) but if you don't object I'd prefer to address in a follow up PR.
I'm fine doing this in a follow-up, but file an issue against KV (KV 21.1 proj and KV backlog).
I don't think you need to explicitly invoke PrettyPrintKey here. Printing pl.StartKey via %s should do the trick. If it doesn't, it's probably just not happening in your tests. Import the pkg/keys package to make it work (there is some dep injection going on). In a full crdb binary it would definitely just work.
pkg/kv/kvprober/kvprober.go, line 1 at r6 (raw file):
// Copyright 2017 The Cockroach Authors.
21
btw you can set up your editor to auto-generate these (#goland if that's what you're using)
pkg/kv/kvprober/kvprober.go, line 11 at r6 (raw file):
// licenses/APL.txt. package kvprober
Can you add a doc comment here with a high-level overview of how the prober works? It needs to start with "Package kvprober ...."
pkg/kv/kvprober/kvprober.go, line 50 at r6 (raw file):
// ProberOpts provides knobs to control kvprober.Prober. type ProberOpts struct {
nit: you could just call this Options since you're in package kvprober. I think that would be "the Go way"
pkg/kv/kvprober/kvprober.go, line 63 at r6 (raw file):
"kv.prober.read.enabled", "whether the KV read prober is enabled", false)
BTW you could special-case a ReadInterval of 0 as disabling it. This would also avoid the uncomfortable question of what would happen if I enabled it and set the read interval to zero now.
pkg/kv/kvprober/kvprober.go, line 65 at r6 (raw file):
false) // ReadInterval controls how often each node sends a read probe.
Maybe put these settings into a file settings.go? Just a suggestion based on my personal preference, it's nice to know where to go look.
pkg/kv/kvprober/kvprober.go, line 70 at r6 (raw file):
var ReadInterval = settings.RegisterDurationSetting( "kv.prober.read.interval", "how often each node sends a read probe to the KV layer",
Explain what this is in details. Are we sleeping 2s between probes? Or is there a ticker that fires every 2s and we'll send a new probe even if another one is still pending. (i.e. with default settings up to five probes in flight at any given time, assuming the 10s timeout tears down the probing attempt quickly).
I think as is, since you're using a ticker, you will initiate a probe every 2s, but only if probes take less than 2s. Once they take 2s or longer, the ongoing probe will delay the next probe. (I think that's fine, but I want to get us into the habit of making these metrics truly well documented). Also mention that the duration here is jittered for each attempt.
pkg/kv/kvprober/kvprober.go, line 80 at r6 (raw file):
// perspective of the user. "if this much time elapses without success, a KV read probe will be treated as an error", 10*time.Second)
Probably want to validate this to be > 0
pkg/kv/kvprober/kvprober.go, line 85 at r6 (raw file):
metaReadProbeAttempts = metric.Metadata{ Name: "kv.prober.read.attempts", Help: "Number of KV read probes made",
Try to be more specific:
Number of attempts made to probe a KV range, regardless of outcome.
pkg/kv/kvprober/kvprober.go, line 86 at r6 (raw file):
Name: "kv.prober.read.attempts", Help: "Number of KV read probes made", Measurement: "Queries",
Is that right?
pkg/kv/kvprober/kvprober.go, line 91 at r6 (raw file):
metaReadProbeFailures = metric.Metadata{ Name: "kv.prober.read.failures", Help: "Number of KV read probes that failed",
Number of probes to KV ranges made that failed (i.e. returned errors or timeouts).
pkg/kv/kvprober/kvprober.go, line 92 at r6 (raw file):
Name: "kv.prober.read.failures", Help: "Number of KV read probes that failed", Measurement: "Queries",
Is that right?
pkg/kv/kvprober/kvprober.go, line 97 at r6 (raw file):
metaReadProbeLatency = metric.Metadata{ Name: "kv.prober.read.latency", Help: "KV read probe latency (latency of failures not tracked)",
"Latency of successful KV read probes"?
pkg/kv/kvprober/kvprober.go, line 103 at r6 (raw file):
metaProbePlanAttempts = metric.Metadata{ Name: "kv.prober.planning_attempts", Help: "Number of attempts at planning out probes made",
Can you explain more here? I feel like if I looked at this metric in three months time I'd be scratching my head.
pkg/kv/kvprober/kvprober.go, line 109 at r6 (raw file):
metaProbePlanFailures = metric.Metadata{ Name: "kv.prober.planning_failures", Help: "Number of attempts at planning out probes that failed",
There's something odd here in that planning out probes is a kind of probe itself? How would you alert on this? Reading the meta ranges can block on intents, so applying the 10s timeout from above may not be entirely fair (though 10s ought to be enough...)
Possibly any meta read should also count as a read probe, so that you can alert on probe failures only? But maybe that convolutes things too much / leads to unclear semantics.
pkg/kv/kvprober/kvprober.go, line 146 at r6 (raw file):
// Start causes kvprober to start probing KV. Start returns immediately. Start // returns an error only if stopper.RunAsyncTask returns an error.
I think you can avoid returning the error here. Acting on the error from RunAsyncTask is only useful in cases where you have some cleanup to do that results from the task not starting. This is not the case here.
pkg/kv/kvprober/kvprober.go, line 214 at r6 (raw file):
// but that is okay. Even if there is no data at the key, the prober still // executes a basic read operation on the range. // TODO(josh): Trace the probes.
+1 would like an issue for that one as well.
pkg/kv/kvprober/kvprober.go, line 215 at r6 (raw file):
// executes a basic read operation on the range. // TODO(josh): Trace the probes. _, err = db.Get(ctx, pl.StartKey)
But there may be data, and there may be contention on that data - I think we should be reading a key that's guaranteed to not be populated or latched over. Consider making a version of
Lines 392 to 396 in d1c91e0
that instead uses a suffix of "prbe" with a description that it is a key guaranteed to be empty and not usually latched over by anything (which isn't true - splits will still latch it but can't have it all! We can make a dedicated probe KV command at some point but for now the Get will do just fine)
pkg/kv/kvprober/kvprober.go, line 225 at r6 (raw file):
} runtime := timeutil.Since(start)
nit: runtime is also a prominent Go pkg, I'd call this duration or dur or d or something like that.
pkg/kv/kvprober/kvprober.go, line 226 at r6 (raw file):
runtime := timeutil.Since(start) log.Health.Infof(ctx, "kv.Get(%v), r=%v returned success in %v", roachpb.PrettyPrintKey(nil, pl.StartKey.AsRawKey()), pl.RangeID, runtime)
Isn't this too verbose? Seems like the kind of thing we'd put behind a if log.V(1)
pkg/kv/kvprober/kvprober_test.go, line 26 at r6 (raw file):
) // TODO(josh): If I have some extra time, would folks want me to look into
I'd be interested in seeing it. I've never used a mocking framework, but if nothing else it encourages the use of clean interfaces, which in itself might already be worth it. Most of our existing KV code is probably too legacy to adapt it easily, but one can dream! Having you champion it in this little kvprober sandbox might be a nice way for us to lick blood. That said, probably a better follow-up and unclear if you will actually have the extra time so no pressure :-)
pkg/kv/kvprober/kvprober_test.go, line 41 at r6 (raw file):
Settings: cluster.MakeTestingClusterSettings(), }) p.Planner = nil
why?
pkg/kv/kvprober/kvprober_test.go, line 44 at r6 (raw file):
t.Run("want default disabled", func(t *testing.T) { m := &mock{
Yeah I can see how gomock would help you here.
pkg/kv/kvprober/kvprober_test.go, line 49 at r6 (raw file):
noGet: true, } p.Planner = m
Could you make a new p for each subtest? Reuse makes these tests hard to follow & more error-prone. Just make a little helper that you can call in each of the subtests to give you a *p.
pkg/kv/kvprober/kvprober_test.go, line 89 at r6 (raw file):
require.Zero(t, p.Metrics.ReadProbeAttempts.Count()) require.Equal(t, int64(1), p.Metrics.ProbePlanFailures.Count()) require.Zero(t, p.Metrics.ReadProbeFailures.Count())
Extend this test with another failed plan attempt and then a third one that succeds?
pkg/kv/kvprober/kvprober_test.go, line 106 at r6 (raw file):
require.Equal(t, int64(1), p.Metrics.ReadProbeAttempts.Count()) require.Zero(t, p.Metrics.ProbePlanFailures.Count()) require.Equal(t, int64(1), p.Metrics.ReadProbeFailures.Count())
Ditto about extending a bit
pkg/kv/kvprober/kvprober_test.go, line 145 at r6 (raw file):
desc string in time.Duration intn func(n int) int
time.Duration is an int64, so shouldn't this (and the prod impl) be a func(n int64) int64)?
pkg/kv/kvprober/kvprober_test.go, line 176 at r6 (raw file):
}, } for _, tc := range cases {
🤔 almost seems like we could use quick.Check here but it doesn't seem to buy much since the code snippet is so small.
pkg/kv/kvprober/plan.go, line 25 at r4 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Done.
nit: since you have a type Plan that you're referring to here, capitalize the word to make it clear that that's what you're doing.
pkg/kv/kvprober/plan.go, line 1 at r6 (raw file):
// Copyright 2017 The Cockroach Authors.
https://www.youtube.com/watch?v=2mkeaDjlz1o
pkg/kv/kvprober/plan.go, line 27 at r6 (raw file):
// Plan is a decision on what range to probe, including info needed by kvprober // to execute on the plan, such as the range's start key. type Plan struct {
Intuitively my mind wants to think of a plan as more than a decision for a single range to probe. Probing a single range just seems like a step in the plan? Don't do anything if that doesn't sound right to you. Otherwise, consider a round of rename bikeshedding, e.g.
type Planner interface {
Next(ctx context.Context) (Step, error)
}
pkg/kv/kvprober/plan.go, line 37 at r6 (raw file):
// Plan returns a plan for the prober to execute on. Executing on the // plan repeatedly should lead to an even distribution of probes over ranges, // at least in the limit.
Mention what to do if there's an error (errors are all to be considered temporary)
pkg/kv/kvprober/plan.go, line 65 at r6 (raw file):
var PlanNProbesAtATime = settings.RegisterIntSetting( "kv.prober.planner.n_probes_at_a_time", "the number of probes to plan out at once",
I don't know, this might be confusing for people who are not familiar with the code. Ultimately this is about prefetching? NumPrefetchPlans
Doc suggestion:
The number of ranges for which plans are created in bulk to be probed. Plans are randomized within each prefetch cycle, so setting this to a small number will lead to close-to-lexical probing. Prefetched plans are held in memory, so large values are advised against.
Also can't this be unexported?
pkg/kv/kvprober/plan.go, line 114 at r6 (raw file):
// CPU: // - Again scales with the the kv.prober.planner.n_probes_at_a_time cluster // setting. Note that we sort a slice of size
We don't sort, right? We shuffle (and I assume it's Fisher-Yates). The cost is negligible compared to the cost of actually populating the slice, where we do unmarshalling!
pkg/kv/kvprober/plan.go, line 123 at r6 (raw file):
ctx, p.db, PlanNProbesAtATime.Get(&p.settings.SV), p.cursor) if err != nil { return Plan{}, errors.Wrapf(err, "failed to get meta2 rows")
It's interesting to note that the planner will get stuck on a broken meta2 range. I don't see a way around that thanks to the cursor, but it's interesting to note.
pkg/kv/kvprober/plan.go, line 129 at r6 (raw file):
plans, err := meta2KVsToPlans(kvs) if err != nil { return Plan{}, errors.Wrapf(err, "failed to make plans from meta2 rows")
Let's assume we hit this, we'd probably hit this every time we see the failing descriptor, what happens to the prober? This message will be logged somewhere at the caller, right?
pkg/kv/kvprober/plan.go, line 155 at r6 (raw file):
var kvs []kv.KeyValue for n > 0 { newkvs, err := db.Scan(ctx, cursor, keys.Meta2KeyMax.Next(), n /*maxRows*/)
Add:
// NB: keys.Meta2KeyMax stores a descriptor, so we want to include it.
pkg/kv/kvprober/plan.go, line 161 at r6 (raw file):
n = n - int64(len(newkvs)) kvs = append(kvs, newkvs...)
I know it shouldn't happen, but be resilient to newkvs being empty.
pkg/kv/kvprober/plan.go, line 186 at r6 (raw file):
// back this error if it's attempted: "attempted access to empty key" if rangeDesc.RangeID == 1 { plans[i].StartKey = plans[i].StartKey.Next()
More idiomatic:
cockroach/pkg/migration/migrations/truncated_state.go
Lines 46 to 52 in 126ea38
Heh, I saw that we had this discussion above. I still think referencing LocalMax explicitly is the way to go here.
pkg/kv/kvprober/kvprober_integration_test.go, line 1 at r6 (raw file):
// Copyright 2017 The Cockroach Authors.
21
pkg/kv/kvprober/kvprober_integration_test.go, line 42 at r6 (raw file):
) func TestMain(m *testing.M) {
Mind putting this into main_test.go like the other pkgs do?
pkg/kv/kvprober/kvprober_integration_test.go, line 65 at r6 (raw file):
require.NoError(t, p.Start(ctx, s.Stopper())) time.Sleep(1 * time.Second)
I'd avoid arbitrary sleeps (test builds can get quite slow) and do something like
testutils.SucceedsSoon(t, func() error {
// Check that you've seen at least 10 probes, or return error
// Check that you've seen no failures, or `t.Fatal()`
})pkg/kv/kvprober/kvprober_integration_test.go, line 98 at r6 (raw file):
// This is enough time to hit the time-series range multiple times, thus // triggering multiple probe failures. time.Sleep(1 * time.Second)
Ditto SucceedsSoon
pkg/kv/kvprober/kvprober_integration_test.go, line 136 at r6 (raw file):
// Wait 500 milliseconds then make DB (mostly) unavailable. time.Sleep(500 * time.Millisecond)
Ditto
pkg/kv/kvprober/kvprober_integration_test.go, line 142 at r6 (raw file):
mu.Unlock() time.Sleep(500 * time.Millisecond)
Ditto
pkg/kv/kvprober/kvprober_integration_test.go, line 144 at r6 (raw file):
time.Sleep(500 * time.Millisecond) // Expect 50% error rate but require >25% errors to reduce chance of test flake.
I know it's tempting to do this thresholding, but if you want to do that I would suggest writing tests that explicitly plan probes and verify that the metrics get updated accordingly, i.e. take all of the time dependency out of the test itself. You could do this by not calling .Start but invoking .probe manually, which I assume you do outside of the integration tests.
Btw there's this nifty trick: you can put a public method into helpers_test.go (pkg kvprober_test) and that will allow you to call that method from this test (since they're both kvprober_test) without adding an ugly exported method to kvprober.
pkg/kv/kvprober/kvprober_integration_test.go, line 163 at r6 (raw file):
var numRanges int64 require.NoError(t, sqlDB.QueryRow(
Again use SucceedsSoon here, and just on the off chance that ranges split or merge during the test, compute the expected count in each iteration.
pkg/kv/kvprober/kvprober_integration_test.go, line 218 at r6 (raw file):
kvprober.PlanNProbesAtATime.Override(&s.ClusterSettings().SV, 10) p.Metrics.ReadProbeAttempts.Clear()
Why do you need to clear all these?
knz
left a comment
There was a problem hiding this comment.
Reviewed 1 of 5 files at r5, 6 of 6 files at r6.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @knz, and @logston)
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
Previously, joshimhoff (Josh Imhoff) wrote…
Er ya. My bad. Fixed.
I could also add a unit test, though I think I'd need to tweak
RecoverAndReportNonfatalPanicto not panic (to treat test as release build) in certain cases. For example, we could add a public var tologcrashcalledDontRePanicEvenThoTestor similar that defaults to false. If set to true, which we can do from unit tests when needed, thenRecoverAndReportNonfatalPanicwon't panic even if test. Then I can add a unit test like this:t.Run("panic caught", func(t *testing.T) { logcrash.DontRePanicEvenThoTest = true defer func() { logcrash.DontRePanicEvenThoTest = false }() ReadEnabled.Override(&p.settings.SV, true) m := &mock{} clearMetrics(p) p.Planner = nil // want panic p.probe(ctx, m) require.Equal(t, int64(1), p.Metrics.ProbePlanAttempts.Count()) require.Zero(t, p.Metrics.ReadProbeAttempts.Count()) require.Zero(t, p.Metrics.ProbePlanFailures.Count()) require.Zero(t, p.Metrics.ReadProbeFailures.Count()) })Let me know what you think about this. If you are into it & don't mind a follow up PR, I can also do it that way.
What's wrong with a panic in the test? Inside the test you can do defer ...{ if recover() != nil ... } and check for the panic object.
pkg/kv/kvprober/kvprober.go, line 210 at r6 (raw file):
// Slow enough response times are not different than errors from the // perspective of the user. err = contextutil.RunWithTimeout(ctx, "db.Get", ReadTimeout.Get(&p.settings.SV), func(ctx context.Context) error {
nit: would be slightly easier to read with readTimeout := ReadTimeout.Get() first then the RunWithTimeout call on the next line.
pkg/kv/kvprober/kvprober_test.go, line 1 at r6 (raw file):
// Copyright 2017 The Cockroach Authors.
2021
4ca7efa to
f9f15ac
Compare
joshimhoff
left a comment
There was a problem hiding this comment.
TTFR! Learned a lot. GH issue here with follow up tasks: #61074
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @asubiotto, @joshimhoff, @jreut, @knz, @logston, and @tbg)
pkg/kv/kvprober/kvprober.go, line 160 at r3 (raw file):
Previously, knz (kena) wrote…
What's wrong with a panic in the test? Inside the test you can do
defer ...{ if recover() != nil ... }and check for the panic object.
The goal of the test would be to make sure that panics are caught by kvprober. So want no panic that bubbles up to the test. Not sure if time will permit but if it does I'll put up a tiny follow up PR and we can discuss more.
pkg/kv/kvprober/kvprober.go, line 223 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm fine doing this in a follow-up, but file an issue against KV (KV 21.1 proj and KV backlog).
I don't think you need to explicitly invoke
PrettyPrintKeyhere. Printingpl.StartKeyvia%sshould do the trick. If it doesn't, it's probably just not happening in your tests. Import thepkg/keyspackage to make it work (there is some dep injection going on). In a full crdb binary it would definitely just work.
Noted here: #61074. Opened up a single issue with a checklist. I can open up sub-issues if desired.
Not calling PrettyPrintKey anymore. Thanks!
pkg/kv/kvprober/kvprober.go, line 1 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
21
btw you can set up your editor to auto-generate these (#goland if that's what you're using)
Done!
pkg/kv/kvprober/kvprober.go, line 11 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you add a doc comment here with a high-level overview of how the prober works? It needs to start with "Package kvprober ...."
Done!
pkg/kv/kvprober/kvprober.go, line 50 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: you could just call this
Optionssince you're in packagekvprober. I think that would be "the Go way"
Done.
pkg/kv/kvprober/kvprober.go, line 63 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
BTW you could special-case a ReadInterval of
0as disabling it. This would also avoid the uncomfortable question of what would happen if I enabled it and set the read interval to zero now.
I've made that not allowed via a validateFn. I explored your suggestion; it complicates the code a bit since need a special sleep time when waiting for enablement. I think I also slightly prefer the clarity of a boolean off switch. No particularly strong feelings tho.
pkg/kv/kvprober/kvprober.go, line 65 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Maybe put these settings into a file
settings.go? Just a suggestion based on my personal preference, it's nice to know where to go look.
Done!
pkg/kv/kvprober/kvprober.go, line 70 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Explain what this is in details. Are we sleeping 2s between probes? Or is there a ticker that fires every 2s and we'll send a new probe even if another one is still pending. (i.e. with default settings up to five probes in flight at any given time, assuming the 10s timeout tears down the probing attempt quickly).
I think as is, since you're using a ticker, you will initiate a probe every 2s, but only if probes take less than 2s. Once they take 2s or longer, the ongoing probe will delay the next probe. (I think that's fine, but I want to get us into the habit of making these metrics truly well documented). Also mention that the duration here is jittered for each attempt.
Added more detail. Good points. Didn't consider this when setting the kv.prober.read.timeout default. Agreed that max one read probe will be running at a time, implying max delay equal to kv.prober.read.timeout.
pkg/kv/kvprober/kvprober.go, line 80 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Probably want to validate this to be > 0
Added validations for all settings that need it. What a nice thing to be able to do.
Does this get run at cluster setting set time? I figure it does but checking.
pkg/kv/kvprober/kvprober.go, line 85 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Try to be more specific:
Number of attempts made to probe a KV range, regardless of outcome.
Done.
pkg/kv/kvprober/kvprober.go, line 86 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is that right?
Think so? Seems lined up with the uneven way this field is used in codebase today. For example, I see metric code in SQL that sets field to "SQL statements". It's not clear to me exactly what this field is about, or why it is useful, given help text in particular.
pkg/kv/kvprober/kvprober.go, line 91 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Number of probes to KV ranges made that failed (i.e. returned errors or timeouts).
Done.
pkg/kv/kvprober/kvprober.go, line 92 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is that right?
See other comment.
pkg/kv/kvprober/kvprober.go, line 97 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
"Latency of successful KV read probes"?
Done.
pkg/kv/kvprober/kvprober.go, line 103 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Can you explain more here? I feel like if I looked at this metric in three months time I'd be scratching my head.
Done. Also added more to the failure metric.
pkg/kv/kvprober/kvprober.go, line 109 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
There's something odd here in that planning out probes is a kind of probe itself? How would you alert on this? Reading the meta ranges can block on intents, so applying the 10s timeout from above may not be entirely fair (though 10s ought to be enough...)
Possibly any meta read should also count as a read probe, so that you can alert on probe failures only? But maybe that convolutes things too much / leads to unclear semantics.
Ya I feel separate metrics is better as "can't plan probes possibly due to actual issues with meta2" is a different thing than "reading the start key of randomly chosen ranges is registering a lot of errors". The latter is def something we want to page on; that is the whole point of this work. The former might be too, but I want to see what the error rate is like in prod, and causes errors also. To start, I suggest we page on the former but ticket on the latter. With experience we will figure out what is best.
pkg/kv/kvprober/kvprober.go, line 146 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think you can avoid returning the error here. Acting on the error from
RunAsyncTaskis only useful in cases where you have some cleanup to do that results from the task not starting. This is not the case here.
Do we want the server to start if the RunAsyncTask call fails here? Since it seems not expected to ever happen [1], I figured we'd want to return an error here, and I also figured given the code in pkg/server CRDB would fail to start if an error was returned. Basically, the intended current state is that starting kvprober is a hard dep at server startup time, with "starting kvprober" meaning just that RunAsyncTask returns nil.
Thoughts? Soft deps are nice sometimes but I figure only introduce them when needed (when errors are actually expected), since that choice introduces a silent failure mode (silent by design) that requires separate alerting (which of course we could add).
[1] The only error I see that it can return is:
if !s.runPrelude() {
return ErrUnavailable
}
pkg/kv/kvprober/kvprober.go, line 210 at r6 (raw file):
Previously, knz (kena) wrote…
nit: would be slightly easier to read with
readTimeout := ReadTimeout.Get()first then the RunWithTimeout call on the next line.
Done!
pkg/kv/kvprober/kvprober.go, line 214 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
+1 would like an issue for that one as well.
Will be fun to learn more about the stuff you are doing (always on probing, etc.).
I opened a master ticket here for kvprober: #61074. It has a checklist of work to do, and tracing is included. If you want individual issues or have other thoughts about this, let me know.
pkg/kv/kvprober/kvprober.go, line 215 at r6 (raw file):
I shall leave this one to a follow up PR. Added to the GH issue: #61074
splits will still latch it
Splits as in range splits? I think that's a feature, not a bug. If range splits take long enough that a bunch of prober errors happen, that would be good to know about IMHO, as that would also affect user traffic and isn't on the user to manage (unlike contention on a user table).
pkg/kv/kvprober/kvprober.go, line 225 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit:
runtimeis also a prominent Go pkg, I'd call thisdurationordurordor something like that.
Done!
pkg/kv/kvprober/kvprober.go, line 226 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Isn't this too verbose? Seems like the kind of thing we'd put behind a
if log.V(1)
Er. I'm not sure. I think we want this log line to end up in CC splunk and I don't have a good enough mental map of logging yet (on the CC side or the CRDB side) to know how that ties into your suggestion here. Since kvprober is disabled right now, and since the cluster setting is private, I vote we leave as is. I'm gonna experiment in CC by turning the prober on, and if noisy that's okay in the CC context IMO. In doing that I think we'll figure out more about how we want to handle logging.
I also asked some Qs about log.V(...) here to better understand things: https://cockroachlabs.slack.com/archives/C0HM2DZ34/p1614185970049700
pkg/kv/kvprober/kvprober_test.go, line 1 at r6 (raw file):
Previously, knz (kena) wrote…
2021
Done.
pkg/kv/kvprober/kvprober_test.go, line 26 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd be interested in seeing it. I've never used a mocking framework, but if nothing else it encourages the use of clean interfaces, which in itself might already be worth it. Most of our existing KV code is probably too legacy to adapt it easily, but one can dream! Having you champion it in this little kvprober sandbox might be a nice way for us to lick blood. That said, probably a better follow-up and unclear if you will actually have the extra time so no pressure :-)
Ack!
pkg/kv/kvprober/kvprober_test.go, line 41 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
why?
Not really needed. Fixed.
pkg/kv/kvprober/kvprober_test.go, line 44 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yeah I can see how gomock would help you here.
Ack!
pkg/kv/kvprober/kvprober_test.go, line 49 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you make a new
pfor each subtest? Reuse makes these tests hard to follow & more error-prone. Just make a little helper that you can call in each of the subtests to give you a*p.
Good call. Done.
pkg/kv/kvprober/kvprober_test.go, line 89 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Extend this test with another failed plan attempt and then a third one that succeds?
I don't really see the value. This is a unit test of a single call to probe. Why call probe multiple times from the test? What are we gonna discover by doing that, especially given the integration tests?
pkg/kv/kvprober/kvprober_test.go, line 106 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto about extending a bit
See my above comment. Happy to do so but don't yet understand the value.
pkg/kv/kvprober/kvprober_test.go, line 145 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
time.Durationis an int64, so shouldn't this (and the prod impl) be afunc(n int64) int64)?
Done!
pkg/kv/kvprober/kvprober_test.go, line 176 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
🤔 almost seems like we could use
quick.Checkhere but it doesn't seem to buy much since the code snippet is so small.
I'm gonna leave the test as is, but I didn't know about this and it looks really cool!
pkg/kv/kvprober/kvprober_integration_test.go, line 42 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mind putting this into
main_test.golike the other pkgs do?
Done!
pkg/kv/kvprober/kvprober_integration_test.go, line 65 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'd avoid arbitrary sleeps (test builds can get quite slow) and do something like
testutils.SucceedsSoon(t, func() error { // Check that you've seen at least 10 probes, or return error // Check that you've seen no failures, or `t.Fatal()` })
Ah that's much nicer looking also. Done.
pkg/kv/kvprober/kvprober_integration_test.go, line 98 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto SucceedsSoon
Done.
pkg/kv/kvprober/kvprober_integration_test.go, line 136 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto
Done!
pkg/kv/kvprober/kvprober_integration_test.go, line 142 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto
Done.
pkg/kv/kvprober/kvprober_integration_test.go, line 144 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I know it's tempting to do this thresholding, but if you want to do that I would suggest writing tests that explicitly plan probes and verify that the metrics get updated accordingly, i.e. take all of the time dependency out of the test itself. You could do this by not calling
.Startbut invoking.probemanually, which I assume you do outside of the integration tests.Btw there's this nifty trick: you can put a public method into
helpers_test.go(pkg kvprober_test) and that will allow you to call that method from this test (since they're bothkvprober_test) without adding an ugly exported method tokvprober.
Wow that is a nifty trick. Thank you for showing me that. Wait but isn't helpers_test.go in pkg kvprober actually so it has access to kvprober private stuff? I followed this guide: https://medium.com/@robiplus/golang-trick-export-for-test-aa16cbd7b8cd. It seems to be working.
Done.
pkg/kv/kvprober/kvprober_integration_test.go, line 163 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Again use SucceedsSoon here, and just on the off chance that ranges split or merge during the test, compute the expected count in each iteration.
Computing expected count each iteration. require.Eventually is MUCH faster than SucceedsSoon in this case (looks like due to retry policy), so gonna stick with it.
pkg/kv/kvprober/kvprober_integration_test.go, line 218 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Why do you need to clear all these?
Er good point. We don't. Fixed.
pkg/kv/kvprober/plan.go, line 25 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: since you have a type Plan that you're referring to here, capitalize the word to make it clear that that's what you're doing.
Done!
pkg/kv/kvprober/plan.go, line 1 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Haha.
pkg/kv/kvprober/plan.go, line 27 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Intuitively my mind wants to think of a plan as more than a decision for a single range to probe. Probing a single range just seems like a step in the plan? Don't do anything if that doesn't sound right to you. Otherwise, consider a round of rename bikeshedding, e.g.
type Planner interface { Next(ctx context.Context) (Step, error) }
I like that! Especially the function name Next. This was bugging me also. Thanks.
pkg/kv/kvprober/plan.go, line 37 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention what to do if there's an error (errors are all to be considered temporary)
Done.
pkg/kv/kvprober/plan.go, line 65 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I don't know, this might be confusing for people who are not familiar with the code. Ultimately this is about prefetching?
NumPrefetchPlansDoc suggestion:
The number of ranges for which plans are created in bulk to be probed. Plans are randomized within each prefetch cycle, so setting this to a small number will lead to close-to-lexical probing. Prefetched plans are held in memory, so large values are advised against.
Also can't this be unexported?
I like the docs! I changed the name of the setting to kv.prober.planner.num_steps_to_plan_at_once.
I've un-exported also using the trick from your other comment (as I set this setting in integration tests).
pkg/kv/kvprober/plan.go, line 114 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We don't sort, right? We shuffle (and I assume it's Fisher-Yates). The cost is negligible compared to the cost of actually populating the slice, where we do unmarshalling!
Er interesting. I didn't know that. Updated comment to clarify that a shuffle is done.
pkg/kv/kvprober/plan.go, line 123 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's interesting to note that the planner will get stuck on a broken meta2 range. I don't see a way around that thanks to the cursor, but it's interesting to note.
Ya. I think we see how often this happens in practice, what it means for cluster heath, and how actionable it is for SRE. Then iterate.
Interestingly, problems unmarshalling a single descriptor will not block prober really, as the Plan run will scan from an updated cursor; cursor update happens BEFORE calling meta2KVsToPlans. More accident than a design choice I made on purpose, but I think a good one maybe?
I think we should think about this stuff more. Will note that in the GH issue.
pkg/kv/kvprober/plan.go, line 129 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Let's assume we hit this, we'd probably hit this every time we see the failing descriptor, what happens to the prober? This message will be logged somewhere at the caller, right?
If we hit this, the planner error rate metric will be incremented, and also we will log the error, yes. SRE will alert on that metric, though probably ticket (think biz hours response) rather than page to start, so we can get some experience with planning failures and what they mean for overall cluster health.
pkg/kv/kvprober/plan.go, line 155 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Add:
// NB: keys.Meta2KeyMax stores a descriptor, so we want to include it.
Done!
pkg/kv/kvprober/plan.go, line 161 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I know it shouldn't happen, but be resilient to
newkvsbeing empty.
Ah. Good point. I am much happier with an error in this case rather than an infinite loop. Thank you. Fixed.
pkg/kv/kvprober/plan.go, line 186 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
More idiomatic:
cockroach/pkg/migration/migrations/truncated_state.go
Lines 46 to 52 in 126ea38
Heh, I saw that we had this discussion above. I still think referencing LocalMax explicitly is the way to go here.
I gave this a shot previously and seemed to cause actual test failures. Search for "I attempted this but my integration tests found some issues". I think I'll note this for follow up PR here:
67ea9d0 to
aacb92e
Compare
|
bors r+ |
|
🔒 Permission denied Existing reviewers: click here to make joshimhoff a reviewer |
|
bors r+ |
|
Build failed: |
|
You need to get a release justification going i think
…On Sat, Feb 27, 2021, 09:45 craig[bot] ***@***.***> wrote:
Build failed:
- GitHub CI (Cockroach)
<https://teamcity.cockroachdb.com/viewLog.html?buildId=2726412&buildTypeId=Cockroach_UnitTests>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#58124 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZHR3BKK7LWQVKPRAF3TBCWKHANCNFSM4VEIENKA>
.
|
Release justification: Auxiliary system that is off by default. This commmit introduces a package for probing the KV layer. The goal of sending such probes is greater obserability into the reliability of the KV layer & below. In particular probes may make for a good SLI metric. The control we have over the prober workload will generate data with a high signal to noise ratio. This lets us alert operators with confidence. How does the prober work? Right now, it just does reads. In a loop, it randomly selects a range to probe by scanning meta2. It then does a single row read of the start key of that range via *kv.DB. Note that the prober is disabled by default. This commit introduces a private cluster setting that defaults to off. Release note: None.
aacb92e to
531cdae
Compare
|
bors r+ |
|
🔒 Permission denied Existing reviewers: click here to make joshimhoff a reviewer |
|
Huh I thought I had that already, and I am surprised that the lint failure happens at bors time but not before. I've rewritten the commit message to format the release justification slightly differently. Maybe the linter is picky about it all being on one line or something! |
|
The release justification note must also be present in the PR description at the top. |
|
bors r+ |
|
Build succeeded: |
https://docs.google.com/document/d/1NqsIgizseyMxUBimpE10m6sSnbebQ25VyJRUwxdmJyM/edit
kv: introduce a package for probing the KV layer
This commmit introduces a package for probing the KV layer. The goal
of sending such probes is greater obserability into the reliability of
the KV layer & below. In particular probes may make for a good SLI
metric. The control we have over the prober workload will generate data
with a high signal to noise ratio. This lets us alert operators with
confidence.
How does the prober work? Right now, it just does reads. In a
loop, it randomly selects a range to probe by scanning meta2.
It then does a single row read of the start key of that range via
*kv.DB.
Note that the prober is disabled by default. This commit introduces a
private cluster setting that defaults to off.
Release justification: auxiliary system that is off by default
Release note: None.