Skip to content

kvprober: rate limit the planner#61275

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
joshimhoff:kv_prober_meta2_protect
Mar 4, 2021
Merged

kvprober: rate limit the planner#61275
craig[bot] merged 3 commits intocockroachdb:masterfrom
joshimhoff:kv_prober_meta2_protect

Conversation

@joshimhoff
Copy link
Copy Markdown
Collaborator

@joshimhoff joshimhoff commented Mar 1, 2021

#61255

kvprober: rate limit the planner

This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.

Release justification: Auxiliary system that is off by default.
Release note: None.

@joshimhoff joshimhoff requested review from jreut, knz and tbg March 1, 2021 18:34
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch 5 times, most recently from 73d136e to c6430b4 Compare March 2, 2021 00:51
@tbg tbg changed the title [small] kvprober: rate limit the planner via a cluster setting kvprober: rate limit the planner via a cluster setting Mar 2, 2021
Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random note: I missed earlier that we are using time.Ticker here, I think you want to use timeutil.NewTimer and after reading from t.C you want to do t.Read = true. Shouldn't change anything, but it's cleaner.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, and @knz)


pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):

.probe

pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

	})

var plannerRateLimit = settings.RegisterDurationSetting(

Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)

I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor 0.5 that I threw in there.

@joshimhoff joshimhoff requested a review from tbg March 2, 2021 14:15
Copy link
Copy Markdown
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @joshimhoff, @jreut, @knz, and @tbg)


pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Hmm. I know I was on board with this approach, but now that I see it it seems less simple than I anticipated. For one, with the cluster setting, we now have to wonder what the behavior is when you set it to a value smaller than the probing interval (I think it will just bounce a few probes every time it needs to do planning and pollute the metrics, so not too bad?)

I think the right thing here could be to avoid the cluster setting and to "hard"-code it as 0.5 times the probing interval times the prefetch size.
So if you're fetching up to 100 descs during planning, and you want to run a step every 2s, if you ever hit an error while fetching descs you would back off 0.5 * 100 * 2s = 100s. Or, in other words, the load on meta2 that results from a failure is at most 2x the load it would get if probing were to occur normally. As a result the meta ranges are protected and we don't have to introduce another knob that could be tuned incorrectly. We could even just drop the random factor 0.5 that I threw in there.

I think I like that better too. Thanks for the idea. Gonna try it out.

@joshimhoff joshimhoff changed the title kvprober: rate limit the planner via a cluster setting kvprober: rate limit the planner Mar 2, 2021
@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch from c6430b4 to c293123 Compare March 2, 2021 15:12
Copy link
Copy Markdown
Collaborator Author

@joshimhoff joshimhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL!

I'm gonna add moving to timeutil.NewTicket to my list of maintenance work to get done post 21.1 branch being cut. It's not totally clear to me what you mean re: t.Read and I want to get one other change into 21.1 ideally (add structured logging) and also do some e2e manual testing.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jreut, @knz, and @tbg)


pkg/kv/kvprober/planner.go, line 91 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…
.probe

Done.


pkg/kv/kvprober/settings.go, line 78 at r1 (raw file):

Previously, joshimhoff (Josh Imhoff) wrote…

I think I like that better too. Thanks for the idea. Gonna try it out.

Done. I like it.

Copy link
Copy Markdown
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I added two commits to your branch. You should be able to merge at will.

bors d+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jreut and @knz)

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 3, 2021

✌️ joshimhoff can now approve this pull request. To approve and merge a pull request, simply reply with bors r+. More detailed instructions are available here.

@joshimhoff
Copy link
Copy Markdown
Collaborator Author

TTTF and two commits, Tobias!

@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch 3 times, most recently from 2140436 to 8c35466 Compare March 4, 2021 15:54
joshimhoff and others added 3 commits March 4, 2021 12:01
This commit introduces a planning rate limit. This protects CRDB from
planning executing too often, due to either issues with CRDB (e.g.
meta2 unavailability) or bugs in kvprober. When planning does execute,
kvprober scans kv.prober.planner.num_steps_to_plan_at_once rows worth
of meta2 and unmarshalls the resulting range descriptors.

Release justification: Auxiliary system that is off by default.
Release note: None.
Release justification: bug fixes and low-risk updates to new functionality
Release note: None
A nonzero limit has the potential to cause issues, albeit theoretical
ones, if the clock doesn't advance even 1ns between two operations.
This is mostly hypothetical and with zero we still have issues if the
clock jumps backwards, but still.

Release justification: non-production code changes
Release note: None
@joshimhoff joshimhoff force-pushed the kv_prober_meta2_protect branch from 8c35466 to 56c54a4 Compare March 4, 2021 17:01
@joshimhoff
Copy link
Copy Markdown
Collaborator Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Mar 4, 2021

Build succeeded:

@craig craig bot merged commit c0d8340 into cockroachdb:master Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants