Skip to content

kvserver/closedts: add metrics for policy refresher#144518

Merged
craig[bot] merged 3 commits intocockroachdb:masterfrom
wenyihu6:dampeningmetrics
Apr 24, 2025
Merged

kvserver/closedts: add metrics for policy refresher#144518
craig[bot] merged 3 commits intocockroachdb:masterfrom
wenyihu6:dampeningmetrics

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Apr 15, 2025

kvserver: add kv.closed_timestamp.policy_change

Previously, it was difficult to measure how often policies changed for ranges,
which is important because such changes can trigger additional range updates
sent in side transport.

This commit adds a metric to track the number of policy changes on replicas.

Part of: #143890
Release note: none


kvserver: add more metrics for policies

Previously, it was difficult to determine how many ranges fell into each latency
bucket policy. This commit adds 18 new metrics to StoreMetrics to track the
number of ranges per policy bucket for every store.

Part of: #143890
Release note: none


kvserver: add kv.closed_timestamp.policy_latency_info_missing

When a replica refreshes its policies, it looks up its peer replicas latency
info via a map passed by PolicyRefresher, which in turn periodically pulls node
latency info from RPCContext. If latency data for a node is missing, a default
hardcoded max RTT of 150ms is used.

Previously, it was hard to tell when this is happening. This commit adds metrics
to track how often the closed timestamp policy refresh falls back to the default
RTT due to missing node latency info. A high count might indicate the latency
cache isn’t refreshed frequently enough, suggesting we should consider lowering
kv.closed_timestamp.policy_latency_refresh_interval.

Resolves: #143890
Release note: none

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 force-pushed the dampeningmetrics branch 10 times, most recently from 6f7167f to 009dccf Compare April 15, 2025 22:52
@wenyihu6 wenyihu6 force-pushed the dampeningmetrics branch 2 times, most recently from 29955aa to d5c445e Compare April 17, 2025 13:36
@wenyihu6 wenyihu6 force-pushed the dampeningmetrics branch 3 times, most recently from dd36f87 to d614235 Compare April 23, 2025 19:24
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 23, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@wenyihu6 wenyihu6 force-pushed the dampeningmetrics branch 2 times, most recently from 62012ba to 6e1f757 Compare April 23, 2025 19:53
@wenyihu6 wenyihu6 marked this pull request as ready for review April 23, 2025 19:53
@wenyihu6 wenyihu6 requested a review from a team as a code owner April 23, 2025 19:53
@wenyihu6 wenyihu6 requested a review from arulajmani April 23, 2025 19:53
Copy link
Copy Markdown
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r1, 4 of 4 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


pkg/kv/kvserver/replica.go line 1385 at r1 (raw file):

	newPolicy := computeNewPolicy(oldPolicy)
	r.cachedClosedTimestampPolicy.Store(int32(newPolicy))
	if oldPolicy != newPolicy {

nit:

ctpb.RangeClosedTimestampPolicy(r.cachedClosedTimestampPolicy.Load()) != newPolic {
...
// update metric
// store new policy
}

Previously, it was difficult to measure how often policies changed for ranges,
which is important because such changes can trigger additional range updates
sent in side transport.

This commit adds a metric to track the number of policy changes on replicas.

Part of: cockroachdb#143890
Release note: none
Previously, it was difficult to determine how many ranges fell into each latency
bucket policy. This commit adds 18 new metrics to StoreMetrics to track the
number of ranges per policy bucket for every store.

Part of: cockroachdb#143890
Release note: none
When a replica refreshes its policies, it looks up its peer replicas latency
info via a map passed by PolicyRefresher, which in turn periodically pulls node
latency info from RPCContext. If latency data for a node is missing, a default
hardcoded max RTT of 150ms is used.

Previously, it was hard to tell when this is happening. This commit adds metrics
to track how often the closed timestamp policy refresh falls back to the default
RTT due to missing node latency info. A high count might indicate the latency
cache isn’t refreshed frequently enough, suggesting we should consider lowering
kv.closed_timestamp.policy_latency_refresh_interval.

Resolves: cockroachdb#143890
Release note: none
Copy link
Copy Markdown
Contributor Author

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)


pkg/kv/kvserver/replica.go line 1385 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit:

ctpb.RangeClosedTimestampPolicy(r.cachedClosedTimestampPolicy.Load()) != newPolic {
...
// update metric
// store new policy
}

Moved the policy update to be inside the if statement. I think this is what you had in mind, but lmk if I'm off. We need to instantiate oldPolicy as a separate variable because it's required by computeNewPolicy for dampening and also needed for policy comparison metrics update.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=arulajmani

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Apr 24, 2025

@craig craig bot merged commit fc4a855 into cockroachdb:master Apr 24, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

kvserver: follow up work for auto-tuning closed ts

3 participants