Skip to content

kvserver: add kv.closed_timestamp.policy_switch_latency_bucket_exceed_threshold#144115

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:dampen
Apr 22, 2025
Merged

kvserver: add kv.closed_timestamp.policy_switch_latency_bucket_exceed_threshold#144115
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:dampen

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Apr 8, 2025

This commit introduces a new cluster setting:
kv.closed_timestamp.policy_switch_latency_bucket_crossing_threshold. It defines
the fraction of the closed timestamp policy bucket width that must be exceeded
before a policy switch is triggered.

This helps prevent aggressive policy changes for ranges near bucket boundaries,
reducing excessive updates sent via side transport.

Part of: #143890
Release note: none
Epic: none

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 8, 2025

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 9, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 force-pushed the dampen branch 3 times, most recently from d4bbe76 to b4e5703 Compare April 10, 2025 18:36
@wenyihu6 wenyihu6 changed the title kvserver: add dampening kvserver: add kv.closed_timestamp.policy_switch_latency_bucket_crossing_threshold Apr 10, 2025
@wenyihu6 wenyihu6 marked this pull request as ready for review April 10, 2025 18:37
@wenyihu6 wenyihu6 requested a review from a team as a code owner April 10, 2025 18:37
@wenyihu6 wenyihu6 requested a review from arulajmani April 10, 2025 18:37
@wenyihu6 wenyihu6 force-pushed the dampen branch 8 times, most recently from c45ece2 to 7602164 Compare April 14, 2025 19:48
@wenyihu6 wenyihu6 changed the title kvserver: add kv.closed_timestamp.policy_switch_latency_bucket_crossing_threshold kvserver: add kv.closed_timestamp.policy_switch_latency_bucket_exceed_threshold Apr 14, 2025
@wenyihu6 wenyihu6 force-pushed the dampen branch 6 times, most recently from 2df8bda to c7316c3 Compare April 15, 2025 18:21
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Apr 15, 2025

@arulajmani friendly ping on this pr in case it slipped through the cracks

Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo comments. This did fall through the cracks, thanks for pinging!

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/closedts/policy_calculation.go line 31 at r1 (raw file):

// the new policy is returned.
//
// 2. If there's an adjacent bucket jump, the new policy is returned.

The wording here is a bit confusing. Could you say more about what this means?


EDIT: did you mean "jump to a non-adjacent bucket"?


pkg/kv/kvserver/closedts/policy_calculation.go line 87 at r1 (raw file):

			return newPolicy
		}
		return oldPolicy

Should we add metrics for this case (in a followup should be fine)?


pkg/kv/kvserver/closedts/setting.go line 103 at r1 (raw file):

	"the fraction of the closed timestamp policy bucket width which need be "+
		"exceeded before the closed timestamp policy will be changed",
	0,

Should we enable this by default? Maybe 20%?


pkg/kv/kvserver/closedts/policy_calculation_test.go line 496 at r1 (raw file):

		},
		{
			name:              "full dampening - multi-bucket jump",

Should we add a test case for multi-bucket jump with a lower dampening fraction as well?

@wenyihu6 wenyihu6 force-pushed the dampen branch 2 times, most recently from 4274d22 to 34abf04 Compare April 17, 2025 13:49
Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/kv/kvserver/closedts/policy_calculation.go line 31 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

The wording here is a bit confusing. Could you say more about what this means?


EDIT: did you mean "jump to a non-adjacent bucket"?

yes, the blame is on cursor : )


pkg/kv/kvserver/closedts/policy_calculation.go line 87 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we add metrics for this case (in a followup should be fine)?

Yeah, this will be addressed in a follow-up PR: #144518. I've been considering whether it makes more sense to add the metrics to the side transport layer instead, to track the number of range updates there. Or we could track at both places (at policy refresher + side transport). Wdyt?


pkg/kv/kvserver/closedts/setting.go line 103 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we enable this by default? Maybe 20%?

I changed the default to 20%. I plan to put up a separate pr to override it to 0 in the follower reads roachtests. I want to first stabilize and gain confidence in the current behavior using the roachtests. Adding dampening could introduce inaccuracies, making it hard to tell if test failures are due to the inadequacy of policy refresher itself or dampening. Does that sound good to you?


pkg/kv/kvserver/closedts/policy_calculation_test.go line 496 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Should we add a test case for multi-bucket jump with a lower dampening fraction as well?

Done.

@wenyihu6 wenyihu6 requested a review from arulajmani April 17, 2025 13:51
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/closedts/policy_calculation.go line 31 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

yes, the blame is on cursor : )

😞


pkg/kv/kvserver/closedts/policy_calculation.go line 87 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

Yeah, this will be addressed in a follow-up PR: #144518. I've been considering whether it makes more sense to add the metrics to the side transport layer instead, to track the number of range updates there. Or we could track at both places (at policy refresher + side transport). Wdyt?

Let's just do both. These are counters, so they're cheap. It's nice to be able to say "we didn't move X number of ranges because of the threshold"; contrast that to the side transport, where we may or may not move ranges for wider range of reasons.


pkg/kv/kvserver/closedts/setting.go line 103 at r1 (raw file):

Previously, wenyihu6 (Wenyi Hu) wrote…

I changed the default to 20%. I plan to put up a separate pr to override it to 0 in the follower reads roachtests. I want to first stabilize and gain confidence in the current behavior using the roachtests. Adding dampening could introduce inaccuracies, making it hard to tell if test failures are due to the inadequacy of policy refresher itself or dampening. Does that sound good to you?

Sounds reasonable to me!

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Apr 21, 2025

Last push was an empty commit to manually trigger the stuck CI - reported in https://cockroachlabs.slack.com/archives/CJ0H8Q97C/p1745243652330699.

…_threshold

This commit introduces a new cluster setting:
kv.closed_timestamp.policy_switch_latency_bucket_exceed_threshold. It defines
the fraction of the closed timestamp policy bucket width that must be exceeded
before a policy switch is triggered.

This helps prevent aggressive policy changes for ranges near bucket boundaries,
reducing excessive updates sent via side transport.

Part of: cockroachdb#143890
Release note: none
Epic: none
@wenyihu6
Copy link
Copy Markdown
Contributor Author

Last push was to rebase on master to pick up 8cb77d6 (more in https://cockroachlabs.slack.com/archives/C05B1BPMJLE/p1745253521900019).

@wenyihu6
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=arulajmani

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 22, 2025

This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried.

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 22, 2025

@craig craig bot merged commit 692b93c into cockroachdb:master Apr 22, 2025
23 checks passed
@wenyihu6 wenyihu6 deleted the dampen branch April 22, 2025 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants