Skip to content

kv: update kvprober with quarantine pool#87436

Merged
craig[bot] merged 4 commits intocockroachdb:masterfrom
Santamaura:kvprober-quarantine-pool
Sep 23, 2022
Merged

kv: update kvprober with quarantine pool#87436
craig[bot] merged 4 commits intocockroachdb:masterfrom
Santamaura:kvprober-quarantine-pool

Conversation

@Santamaura
Copy link
Copy Markdown
Contributor

These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves #74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch 2 times, most recently from 5a3c55a to 7b8a1d7 Compare September 6, 2022 17:49
@Santamaura
Copy link
Copy Markdown
Contributor Author

Santamaura commented Sep 6, 2022

Spin-off of #76081. Adding you @joshimhoff for early feedback on this draft, since I'm not too sure if this is in the right direction of what we want.

Copy link
Copy Markdown
Collaborator

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @Santamaura)


pkg/kv/kvprober/kvprober.go line 60 at r1 (raw file):

	// quarantine pools keep track of ranges that timed/errored when probing and
	// continually probe those ranges until killed/probe is succesful
	quarantineReadPool  *quarantinePool

I think we only need a quarantine pool for writes. For single range outages, I've never seen one that affected read but not writes.


pkg/kv/kvprober/kvprober.go line 64 at r1 (raw file):

}

type quarantinePool struct {

Can we pull this new abstraction into a new file?


pkg/kv/kvprober/kvprober.go line 67 at r1 (raw file):

	steps        []Step
	size         int64
	entryTimeMap map[roachpb.RangeID]time.Time

I wonder if we should worry about CRDB restarts. As is, since the quarantine pool is an in-memory structure, restarts will both clear the pool & the gauges that track time since last healthy. This could lead to false negative pages, e.g. if some maintainance op like a k8s upgrade runs at the "wrong" time. In particular, SREs often restarts nodes IN ORDER to resolve KV outages. As is, doing this might make an outage look resolved even if it isn't, which is unfortunate. Of course, we could teach SREs to use crdb_internal.probe_ranges in this situation. Still, I think the overall point stands as something to consider.

To handle this, we could write a serialization of the state of the quota pool to pebble directly from kvprober. Similar to what is done at https://github.com/cockroachdb/cockroach/pull/78092/files#diff-3078d8ca3bbb5b411fa96cdeaf05f1c16e59e872498ddb049e3265d1f65230e4R559.

This might be over-engineering. CC @tbg to weigh in too.


pkg/kv/kvprober/kvprober.go line 96 at r1 (raw file):

		log.Health.Errorf(ctx, "cannot remove range %s from quarantine pool, not found", step.RangeID.String())
	}
	// expensive op if pool size is very large.

I don't think there is a benefit to making the pool size very large in prod, since the main idea here is to catch outages with very small blast radius, e.g. one range, and since we want to page SRE even in case of a one range outage.


pkg/kv/kvprober/kvprober.go line 102 at r1 (raw file):

func (qp *quarantinePool) next() (Step, error) {
	if len(qp.steps) > 0 {
		return qp.steps[rand.Intn(len(qp.steps))], nil

I'd vote we just iterate thru the list of steps. Do you see a benefit to randomizing?


pkg/kv/kvprober/kvprober.go line 471 at r1 (raw file):

}

func (p *Prober) quarantineProbeImpl(ctx context.Context, ops proberOpsI, txns proberTxn, isWrite bool) {

I think we should consider having quarantine pool implement planner. If we do that, I think we could reuse p.writeProbe rather than introducing a new function called p.quarantineProbe, which is very similar to p.writeProbe.


pkg/kv/kvprober/settings.go line 185 at r1 (raw file):

	false)

var quarantineWriteInterval = settings.RegisterDurationSetting(

I wonder if we worry about over-probing a single broken range. Let's say we have a 200 node cluster, and a single range is borked. We'd expect that all nodes eventually probe that one range. Let's say this setting is configured to probe the quarantine pool at 1 QPS. Then we'd have 200 QPS on the single borked range.

I'm not sure how big of a worry this is. CC @tbg to weigh in. One "solution" is to set this quarantine pool write interval conservatively.

If we want a truer solution, we could make this a QPS target for the cluster as a whole. We'd then divide the target by the number of ranges to get an interval that we sleep between probe attempts. Then we'd have confidence we wouldn't over-probe borked ranges, regardless of changes in number of nodes.

@joshimhoff joshimhoff self-requested a review September 7, 2022 13:35
@tbg tbg self-requested a review September 7, 2022 13:37
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed in detail yet, but the approach looks good!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @Santamaura)


pkg/kv/kvprober/kvprober.go line 64 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Can we pull this new abstraction into a new file?

+1 let's move this to quarantine_pool.go


pkg/kv/kvprober/kvprober.go line 67 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I wonder if we should worry about CRDB restarts. As is, since the quarantine pool is an in-memory structure, restarts will both clear the pool & the gauges that track time since last healthy. This could lead to false negative pages, e.g. if some maintainance op like a k8s upgrade runs at the "wrong" time. In particular, SREs often restarts nodes IN ORDER to resolve KV outages. As is, doing this might make an outage look resolved even if it isn't, which is unfortunate. Of course, we could teach SREs to use crdb_internal.probe_ranges in this situation. Still, I think the overall point stands as something to consider.

To handle this, we could write a serialization of the state of the quota pool to pebble directly from kvprober. Similar to what is done at https://github.com/cockroachdb/cockroach/pull/78092/files#diff-3078d8ca3bbb5b411fa96cdeaf05f1c16e59e872498ddb049e3265d1f65230e4R559.

This might be over-engineering. CC @tbg to weigh in too.

I would avoid persistence for now, as this also opens up new failure modes for the prober. But I see the problem and it might make sense as a follow-up.

But in a sense, don't SREs almost want to really capitalize on kvprober by probing all ranges liberally when there's any doubt about the cluster being fully functional? This should automatically be tacked on to any rolling restart or emergency node restart. That would reduce the role of the quarantine pool to pinning broken ranges within the lifetime of a single process.

Btw, is "max time in pool > 30min" the signal SREs should probe on? Should it be "size of pool > 0 for >30 minutes" instead? Curious to hear the argument, but it doesn't materially affect the earlier discussion since a node restart could lose the one unavailable range and not find it again for a while, unless we do a proactive scan.


pkg/kv/kvprober/kvprober.go line 84 at r1 (raw file):

func (qp *quarantinePool) remove(ctx context.Context, step Step) {
	if len(qp.steps) < 1 {
		log.Health.Errorf(ctx, "cannot remove range %s from quarantine pool, pool is empty", step.RangeID.String())

do you want to return here?


pkg/kv/kvprober/kvprober.go line 313 at r1 (raw file):

func (p *Prober) Start(ctx context.Context, stopper *stop.Stopper) error {
	ctx = logtags.AddTag(ctx, "kvprober", nil /* value */)
	startLoop := func(ctx context.Context, opName string, probe func(context.Context, *kv.DB, planner), qProbe func(context.Context, *kv.DB, bool), pl planner, interval *settings.DurationSetting, isQuarantine bool, isWrite bool) error {

With the distinction between read and write gone, can we unwind some of the parameters here? For example, would be nice only to keep the old signature, i.e. not letting the prober instance know whether it's quarantine probing or regular probing.


pkg/kv/kvprober/kvprober.go line 471 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think we should consider having quarantine pool implement planner. If we do that, I think we could reuse p.writeProbe rather than introducing a new function called p.quarantineProbe, which is very similar to p.writeProbe.

+1 I haven't reviewed in detail yet but I also think there is some way to reuse the existing interfaces so that the quarantine prober is just a prober pulling off the quarantine pool with some wrapping and less code duplication.


pkg/kv/kvprober/settings.go line 185 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I wonder if we worry about over-probing a single broken range. Let's say we have a 200 node cluster, and a single range is borked. We'd expect that all nodes eventually probe that one range. Let's say this setting is configured to probe the quarantine pool at 1 QPS. Then we'd have 200 QPS on the single borked range.

I'm not sure how big of a worry this is. CC @tbg to weigh in. One "solution" is to set this quarantine pool write interval conservatively.

If we want a truer solution, we could make this a QPS target for the cluster as a whole. We'd then divide the target by the number of ranges to get an interval that we sleep between probe attempts. Then we'd have confidence we wouldn't over-probe borked ranges, regardless of changes in number of nodes.

Before getting into communication between different kvprobers, I would rather add a per-range rate limiter that fast-fails probes when above a qps threshold assuming that the recent probes also failed. That way, the probers get what they want (information that the range fails) and the range isn't going to be put over the top by the many probers looking at it. We could go as far as removing the range from the q-pool if it got fail-fasted since presumably many other nodes are looking at it as well.
But this is a can we can kick down the road, I think. The quarantine probe interval should certainly be conservative, though.

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from 7b8a1d7 to 74cdeff Compare September 8, 2022 14:16
@joshimhoff joshimhoff requested review from joshimhoff and tbg September 8, 2022 14:35
Copy link
Copy Markdown
Collaborator

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/kvprober.go line 67 at r1 (raw file):
Ya, I think they are some tradeoffs, and I don't have a strong view. Def agree to not touch persistence right now.

size of pool > 0 for >30 minutes

Hmm. I think (?) I agree re: this too.


pkg/kv/kvprober/settings.go line 185 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Before getting into communication between different kvprobers, I would rather add a per-range rate limiter that fast-fails probes when above a qps threshold assuming that the recent probes also failed. That way, the probers get what they want (information that the range fails) and the range isn't going to be put over the top by the many probers looking at it. We could go as far as removing the range from the q-pool if it got fail-fasted since presumably many other nodes are looking at it as well.
But this is a can we can kick down the road, I think. The quarantine probe interval should certainly be conservative, though.

Somewhere in memory is there a count of the number of nodes in the cluster? I feel that we shouldn't need communication between different kvprobers to limit overall QPS on problem ranges. If we had the count of the number of nodes in the cluster, we would just normalize our sleep interval by that. Obvi the count of the number of nodes in the cluster requires communication between CRDB nodes, but if we already have it, we already have it.

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from 74cdeff to 63a026f Compare September 8, 2022 20:43
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @tbg)


pkg/kv/kvprober/kvprober.go line 60 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think we only need a quarantine pool for writes. For single range outages, I've never seen one that affected read but not writes.

Ack, removed the quarantine read pool.


pkg/kv/kvprober/kvprober.go line 64 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

+1 let's move this to quarantine_pool.go

Done.


pkg/kv/kvprober/kvprober.go line 84 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

do you want to return here?

Yes, thanks for the catch.


pkg/kv/kvprober/kvprober.go line 96 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I don't think there is a benefit to making the pool size very large in prod, since the main idea here is to catch outages with very small blast radius, e.g. one range, and since we want to page SRE even in case of a one range outage.

Ack, if you want I can remove the comment.


pkg/kv/kvprober/kvprober.go line 102 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I'd vote we just iterate thru the list of steps. Do you see a benefit to randomizing?

Randomizing was based on that old draft, I agree it makes more sense to just iterate.


pkg/kv/kvprober/kvprober.go line 313 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

With the distinction between read and write gone, can we unwind some of the parameters here? For example, would be nice only to keep the old signature, i.e. not letting the prober instance know whether it's quarantine probing or regular probing.

Done, signature is back to original state.


pkg/kv/kvprober/kvprober.go line 471 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

+1 I haven't reviewed in detail yet but I also think there is some way to reuse the existing interfaces so that the quarantine prober is just a prober pulling off the quarantine pool with some wrapping and less code duplication.

Done, this func is essentially just a wrapper for writeProbe now. We just use it to verify that quarantine pool is enabled.

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from 63a026f to ed1f52b Compare September 8, 2022 21:02
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @tbg)


pkg/kv/kvprober/settings.go line 185 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

Somewhere in memory is there a count of the number of nodes in the cluster? I feel that we shouldn't need communication between different kvprobers to limit overall QPS on problem ranges. If we had the count of the number of nodes in the cluster, we would just normalize our sleep interval by that. Obvi the count of the number of nodes in the cluster requires communication between CRDB nodes, but if we already have it, we already have it.

But then if you have 500 nodes in the cluster, then you're only going to reprobe the maybe single range you have in the quarantine pool each 500 minutes as opposed to every minute, so now your alerts don't make sense, etc, ...
Not saying there isn't a problem to be solved here but if we make what actually happens to dynamic it is a little hard to reason about the numbers it produces.

@tbg tbg self-requested a review September 9, 2022 13:59
@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch 3 times, most recently from 80f6ace to 5323aa0 Compare September 12, 2022 16:38
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Heads up I started reviewing but now the buses to Camp Roach are leaving. I'll post a commit upon my return with some improvements that would be less efficient to communicate through comments.

Dismissed @joshimhoff from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @Santamaura)


pkg/kv/kvprober/kvprober.go line 60 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Ack, removed the quarantine read pool.

I re-worded the comment slightly since there's only one pool now.


pkg/kv/kvprober/kvprober.go line 126 at r4 (raw file):

		Unit:        metric.Unit_COUNT,
	}
	metaWriteProbeQuarantineOldestDuration = metric.Metadata{

I moved this up a bit, next to the other write metrics.

@joshimhoff joshimhoff requested a review from tbg September 14, 2022 21:05
Copy link
Copy Markdown
Collaborator

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking nice!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/kvprober.go line 42 at r4 (raw file):

// Prober sends queries to KV in a loop. See package docstring for more.
type Prober struct {

Time to add some tests? I see this is in draft mode still.


pkg/kv/kvprober/kvprober.go line 294 at r4 (raw file):

		return err
	}
	return startLoop(ctx, "quarantine write probe loop", p.quarantineProbe, p.quarantineWritePool, quarantineWriteInterval)

I suggest a very short blurb about what the purpose of the quarantine pool is here. Something about detecting outages affecting a very small number of ranges at high confidence.


pkg/kv/kvprober/kvprober.go line 391 at r4 (raw file):

		log.Health.Errorf(ctx, "kv.Txn(Put(%s); Del(-)), r=%v failed with: %v", step.Key, step.RangeID, err)
		p.metrics.WriteProbeFailures.Inc(1)
		// Add to quarantine pool if not in the entry map.

I think all this stuff can go inside add. That is, I think the quarantine pool's internal structure need not be exposed in kvprober.go. Also, I think the quarantine pool can be responsible for setting the WriteProbeQuarantineOldestDuration metric. Wdyt about those ideas?


pkg/kv/kvprober/quarantine_pool.go line 11 at r4 (raw file):

// licenses/APL.txt.

// Package kvprober sends queries to KV in a loop, with configurable sleep

Can you delete these copy & pasted comments?


pkg/kv/kvprober/quarantine_pool.go line 31 at r4 (raw file):

)

type quarantinePool struct {

Add some docs on goal & implementation?


pkg/kv/kvprober/quarantine_pool.go line 38 at r4 (raw file):

func newQuarantinePool(settings *cluster.Settings) *quarantinePool {
	poolSize := quarantinePoolSize.Get(&settings.SV)

In my mental model, it is not common for a change to a CRDB cluster setting to require a process restart to take effect. Looks like that is how we have it here?


pkg/kv/kvprober/quarantine_pool.go line 56 at r4 (raw file):

func (qp *quarantinePool) remove(ctx context.Context, step Step) {
	if len(qp.steps) < 1 {

Is this if statement needed? Without it, I think the zero steps case would just hit the cannot remove range %s from quarantine pool line down below.


pkg/kv/kvprober/quarantine_pool.go line 68 at r4 (raw file):

	}
	if idx == -1 {
		log.Health.Errorf(ctx, "cannot remove range %s from quarantine pool, not found", step.RangeID.String())

The presence of this log line indicates there is a bug in kvprober. Not expected ever. Right?


pkg/kv/kvprober/quarantine_pool.go line 81 at r4 (raw file):

		return step, nil
	}
	return Step{}, errors.New("there are no keys in quarantine")

As written, this will lead to planning failures, which SRE alerts on. See https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go#L352. To solve, I'd consider returning an error of a known kind via errors.Mark or similar and then checking for it at https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go#L352. Could also update the planner interface e.g. to return a nullable *Step. Ignoring implementation, IMO, we must not execute the following lines if error is that there are no ranges in quarantine, since that is expected to happen quite often:

		log.Health.Errorf(ctx, "can't make a plan: %v", err)
		p.metrics.ProbePlanFailures.Inc(1)

pkg/kv/kvprober/settings.go line 185 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But then if you have 500 nodes in the cluster, then you're only going to reprobe the maybe single range you have in the quarantine pool each 500 minutes as opposed to every minute, so now your alerts don't make sense, etc, ...
Not saying there isn't a problem to be solved here but if we make what actually happens to dynamic it is a little hard to reason about the numbers it produces.

The assumption I was making is that if a range is unavailable from one node's kvclient then it is unavailable from all nodes. In this case, eventually all nodes will probe the broken range, leading to a known & configurable QPS on the broken range.

Clearly, the assumption I was making is not a good one to make in general. We've seen many network-y outages involving unavailability from one node's kvclient but not another. With that said, I feel most outages of this kind were node failures (e.g. network partitions affecting a subset of nodes) rather than single range failures, in which case quarantine probing is not needed to catch the outage anyway, since the kvprober overall error rate is high enough to page SRE.

Anyway, +1 to kicking this can down the road a bit.


pkg/kv/kvprober/settings.go line 147 at r4 (raw file):

	settings.TenantWritable,
	"kv.prober.quarantine.write.enabled",
	"whether the KV write prober is enabled for the quaranatine pool",

Can we add a bit more commentary in the enable cluster setting doc string about what quarantine probing is?

@joshimhoff joshimhoff self-requested a review September 14, 2022 21:05
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Santamaura I just pushed two fixup commits to this branch, see if you like them. Commit message has descriptions.

Reviewed 1 of 4 files at r4, 2 of 3 files at r5, 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/quarantine_pool.go line 38 at r4 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

In my mental model, it is not common for a change to a CRDB cluster setting to require a process restart to take effect. Looks like that is how we have it here?

This is fixed in the commit I added.

@tbg
Copy link
Copy Markdown
Member

tbg commented Sep 15, 2022

PS the quarantine prober still needs some tests I think (I realize this is a draft PR).

@Santamaura Santamaura marked this pull request as ready for review September 19, 2022 21:31
@Santamaura Santamaura requested a review from a team September 19, 2022 21:31
@Santamaura Santamaura requested a review from a team as a code owner September 19, 2022 21:31
@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from 2c8d4e0 to 7ae9db8 Compare September 19, 2022 21:45
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 7 files at r7, 1 of 1 files at r8, all commit messages.
Dismissed @joshimhoff from 16 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/quarantine_pool.go line 81 at r4 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Should be ok now from Tobi's changes.

Are you sure? We still return return Step{}, errors.New("quarantine pool is empty") above. We should allow a (nil, nil) result that we handle accordingly here:

https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go#L351

Also make sure it's being unit tested.


pkg/kv/kvprober/settings.go line 147 at r8 (raw file):

	settings.TenantWritable,
	"kv.prober.quarantine.write.enabled",
	"whether the KV write prober is enabled for the quaranatine pool; The "+

typo: quaranatine

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch 2 times, most recently from 7272907 to cd4e059 Compare September 20, 2022 15:09
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/quarantine_pool.go line 81 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are you sure? We still return return Step{}, errors.New("quarantine pool is empty") above. We should allow a (nil, nil) result that we handle accordingly here:

https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvprober/kvprober.go#L351

Also make sure it's being unit tested.

Ah you are correct, good catch. I've modified it and added a check in writeProbeImpl.


pkg/kv/kvprober/settings.go line 147 at r8 (raw file):

Previously, tbg (Tobias Grieger) wrote…

typo: quaranatine

Fixed, thanks.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Last round of comments but LGTM already assuming those are getting addressed.

Reviewed 5 of 5 files at r9, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @Santamaura)


pkg/kv/kvprober/kvprober.go line 378 at r9 (raw file):

	p.metrics.ProbePlanAttempts.Inc(1)

	step, err := pl.next(ctx)

Please also handle the step.RangeID == 0 case in all other callers to pl.next(). I understand they only show up in this path currently, but it's better to be consistent in light of possible future changes that may be carried out by someone else.


pkg/kv/kvprober/kvprober.go line 380 at r9 (raw file):

	step, err := pl.next(ctx)
	// In the case where the quarantine pool is empty don't record a planning failure since
	// it isn't an actua; plan failure.

typo: actua;


pkg/kv/kvprober/kvprober.go line 381 at r9 (raw file):

	// In the case where the quarantine pool is empty don't record a planning failure since
	// it isn't an actua; plan failure.
	if reflect.DeepEqual(step, Step{}) && err == nil {

Just check if step.RangeID == 0. I understand that you can't use step == Step{} due to the slice contained in Step but let's try to avoid DeepEqual.

nit: I'd say err == nil && step.RangeID == 0 (i.e. reverse the order) because it's a little cleaner not to even look at step unless the error is nil.


pkg/kv/kvprober/kvprober_test.go line 234 at r9 (raw file):

	if !m.emptyQPool {
		step = Step{
			RangeID: 0,

Need to use nonzero here after following my suggestion in other comment.

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from cd4e059 to b7dd779 Compare September 21, 2022 14:22
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff and @tbg)


pkg/kv/kvprober/kvprober.go line 378 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Please also handle the step.RangeID == 0 case in all other callers to pl.next(). I understand they only show up in this path currently, but it's better to be consistent in light of possible future changes that may be carried out by someone else.

Can you clarify what case we would want to handle where just step.RangeID == 0?


pkg/kv/kvprober/kvprober.go line 380 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

typo: actua;

Fixed


pkg/kv/kvprober/kvprober.go line 381 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Just check if step.RangeID == 0. I understand that you can't use step == Step{} due to the slice contained in Step but let's try to avoid DeepEqual.

nit: I'd say err == nil && step.RangeID == 0 (i.e. reverse the order) because it's a little cleaner not to even look at step unless the error is nil.

Makes sense, changed to err == nil && step.RangeID == 0


pkg/kv/kvprober/kvprober_test.go line 234 at r9 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Need to use nonzero here after following my suggestion in other comment.

Done

@joshimhoff joshimhoff requested a review from tbg September 21, 2022 18:12
Copy link
Copy Markdown
Collaborator

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Super excited to run this in prod. I vote we backport it. Since it is off by default, and since all kvprober cluster settings are private, I think this is a reasonable practice.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @Santamaura and @tbg)


pkg/kv/kvprober/kvprober.go line 378 at r9 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

Can you clarify what case we would want to handle where just step.RangeID == 0?

I think Tobias means in readProbeImpl?


pkg/kv/kvprober/quarantine_pool_test.go line 44 at r10 (raw file):

		require.Equal(t, 1, len(p.quarantineWritePool.mu.steps))
		require.Equal(t, 1, len(p.quarantineWritePool.mu.entryTimeMap))

I'd suggest calling p.quarantineProbe here and confirm metrics look as expected. Should see a failure. This is a more a blackbox testing approach that confirms that the quarantine pool is working E2E in the expected way. And do same post removing from the quarantine pool at bottom of this test case.

@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from b7dd779 to 835f51b Compare September 21, 2022 19:55
Copy link
Copy Markdown
Contributor Author

@Santamaura Santamaura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshimhoff how far do we want to backport this to (obviously 22.2, 22.1 but any further)?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @joshimhoff, @Santamaura, and @tbg)


pkg/kv/kvprober/quarantine_pool_test.go line 44 at r10 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I'd suggest calling p.quarantineProbe here and confirm metrics look as expected. Should see a failure. This is a more a blackbox testing approach that confirms that the quarantine pool is working E2E in the expected way. And do same post removing from the quarantine pool at bottom of this test case.

Calling p.quarantineProbe in the test causes an invalid memory address or nil pointer dereference panic (I think related to the txns calls) which may be why other tests explicitly call writeProbeImpl. I added the check for the WriteProbeFailures metric but off the top of my head don't know of an easy way to call p.quarantineProbe.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM mod open comments.

(obviously 22.2, 22.1 but any further)?

22.2 would've been my expectation.

Reviewed 2 of 2 files at r10, 1 of 1 files at r11, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @joshimhoff and @Santamaura)


pkg/kv/kvprober/kvprober.go line 378 at r9 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think Tobias means in readProbeImpl?

step.RangeID==0 means no step was returned (i.e. the new path that was added). We should handle it in all callers to pl.next (so probably just the readProbeImpl).

@joshimhoff
Copy link
Copy Markdown
Collaborator

joshimhoff commented Sep 23, 2022

I vote 22.2 & 22.1. Those are the two major versions we will be supporting on cloud for the next six months or so. Since this functionality is off by default & guarded by a private cluster setting, I think it is okay to backport into 22.1.

Santamaura and others added 4 commits September 23, 2022 10:33
These changes update the kvprober to add ranges that
fail probing into a quarantine pool where they are
continuously probed. A metric which indicates the
duration of the longest tenured range has also been
added.

Resolves cockroachdb#74407

Release justification: low risk, high benefit changes to
existing functionality.

Release note: None
- use FunctionalGauge so that oldest time is reflected accurately
- make quarantinePool actually pay attention to changes if size cluster setting
- fix up some logging that would otherwise create noise at Error level
addressing comments

Release note: None
@Santamaura Santamaura force-pushed the kvprober-quarantine-pool branch from 066f0a2 to 165423b Compare September 23, 2022 14:33
@Santamaura
Copy link
Copy Markdown
Contributor Author

TFTRs! Unless more comments come up I'll merge EoD

@Santamaura
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Sep 23, 2022

Build succeeded:

@craig craig bot merged commit b6f8265 into cockroachdb:master Sep 23, 2022
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Sep 23, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 01034a4 to blathers/backport-release-22.1-87436: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@Santamaura Santamaura deleted the kvprober-quarantine-pool branch October 3, 2022 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvprober: find single range issues by repeatedly probing problem ranges with issues after randomly finding a candidate problem range

4 participants