kvserver/closedts: add metrics for policy refresher#144518
kvserver/closedts: add metrics for policy refresher#144518craig[bot] merged 3 commits intocockroachdb:masterfrom
Conversation
6f7167f to
009dccf
Compare
29955aa to
d5c445e
Compare
dd36f87 to
d614235
Compare
|
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
62012ba to
6e1f757
Compare
arulajmani
left a comment
There was a problem hiding this comment.
Reviewed 3 of 3 files at r1, 4 of 4 files at r2, 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/replica.go line 1385 at r1 (raw file):
newPolicy := computeNewPolicy(oldPolicy) r.cachedClosedTimestampPolicy.Store(int32(newPolicy)) if oldPolicy != newPolicy {
nit:
ctpb.RangeClosedTimestampPolicy(r.cachedClosedTimestampPolicy.Load()) != newPolic {
...
// update metric
// store new policy
}Previously, it was difficult to measure how often policies changed for ranges, which is important because such changes can trigger additional range updates sent in side transport. This commit adds a metric to track the number of policy changes on replicas. Part of: cockroachdb#143890 Release note: none
Previously, it was difficult to determine how many ranges fell into each latency bucket policy. This commit adds 18 new metrics to StoreMetrics to track the number of ranges per policy bucket for every store. Part of: cockroachdb#143890 Release note: none
When a replica refreshes its policies, it looks up its peer replicas latency info via a map passed by PolicyRefresher, which in turn periodically pulls node latency info from RPCContext. If latency data for a node is missing, a default hardcoded max RTT of 150ms is used. Previously, it was hard to tell when this is happening. This commit adds metrics to track how often the closed timestamp policy refresh falls back to the default RTT due to missing node latency info. A high count might indicate the latency cache isn’t refreshed frequently enough, suggesting we should consider lowering kv.closed_timestamp.policy_latency_refresh_interval. Resolves: cockroachdb#143890 Release note: none
6e1f757 to
2375f70
Compare
wenyihu6
left a comment
There was a problem hiding this comment.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @arulajmani)
pkg/kv/kvserver/replica.go line 1385 at r1 (raw file):
Previously, arulajmani (Arul Ajmani) wrote…
nit:
ctpb.RangeClosedTimestampPolicy(r.cachedClosedTimestampPolicy.Load()) != newPolic { ... // update metric // store new policy }
Moved the policy update to be inside the if statement. I think this is what you had in mind, but lmk if I'm off. We need to instantiate oldPolicy as a separate variable because it's required by computeNewPolicy for dampening and also needed for policy comparison metrics update.
|
TFTR! bors r=arulajmani |
kvserver: add kv.closed_timestamp.policy_change
Previously, it was difficult to measure how often policies changed for ranges,
which is important because such changes can trigger additional range updates
sent in side transport.
This commit adds a metric to track the number of policy changes on replicas.
Part of: #143890
Release note: none
kvserver: add more metrics for policies
Previously, it was difficult to determine how many ranges fell into each latency
bucket policy. This commit adds 18 new metrics to StoreMetrics to track the
number of ranges per policy bucket for every store.
Part of: #143890
Release note: none
kvserver: add kv.closed_timestamp.policy_latency_info_missing
When a replica refreshes its policies, it looks up its peer replicas latency
info via a map passed by PolicyRefresher, which in turn periodically pulls node
latency info from RPCContext. If latency data for a node is missing, a default
hardcoded max RTT of 150ms is used.
Previously, it was hard to tell when this is happening. This commit adds metrics
to track how often the closed timestamp policy refresh falls back to the default
RTT due to missing node latency info. A high count might indicate the latency
cache isn’t refreshed frequently enough, suggesting we should consider lowering
kv.closed_timestamp.policy_latency_refresh_interval.
Resolves: #143890
Release note: none