kvserver: add cross-region snapshot byte metrics to StoreMetrics#104111
kvserver: add cross-region snapshot byte metrics to StoreMetrics#104111craig[bot] merged 2 commits intocockroachdb:masterfrom
Conversation
0cb5550 to
7382441
Compare
7382441 to
6ba2566
Compare
kvoli
left a comment
There was a problem hiding this comment.
Took a brief look, really great! Lets walk though in more detail tomorrow.
Reviewed 1 of 1 files at r1, 6 of 6 files at r2, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
pkg/kv/kvserver/allocator/storepool/store_pool.go line 1392 at r2 (raw file):
func (sp *StorePool) GetNodeLocalityWithString(nodeID roachpb.NodeID) localityWithString { nodeLocality := localityWithString{ str: "",
nit: remove the extra initialization beyond localityWithString
6ba2566 to
c0e047f
Compare
|
Previously, kvoli (Austen) wrote…
Done. |
kvoli
left a comment
There was a problem hiding this comment.
Great stuff! Left some suggestions but none are significant changes. Lmk if you have any questions.
Reviewed 3 of 3 files at r3, all commit messages.
Reviewable status:complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)
-- commits line 17 at r3:
The cross-region traffic is between stores residing in different region localities within a cluster.
Perhaps reword to "cross region snapshot traffic, between stores in a cluster.
-- commits line 20 at r3:
nit: reword to "this commit adds two new store metrics - "
-- commits line 21 at r3:
Use the metric metadata name instead of the golang struct field name e.g.
range.snapshots.cross-region.sent-bytes and range.snapshots.cross-region.rcvd-bytes
-- commits line 22 at r3:
nit: reword to make it clear these are aggregates per store, it is inferred by the counter but could be explicit e.g. "track the aggregate snapshot bytes sent from and received at a store".
-- commits line 25 at r3:
Is this supposed to mention cross-region batch activities or snapshots?
-- commits line 33 at r3:
Same as above for referring to metrics in PRs/commits.
pkg/kv/kvserver/replica_command.go line 3148 at r3 (raw file):
} maybeIncrementCrossRegionBatchMetrics := func(inc int64) {
Why name this ...CrossRegionBatchMetrics as opposed to ...CrossRegionSnapshotMetrics?
pkg/kv/kvserver/replica_command.go line 3151 at r3 (raw file):
storePool := r.store.cfg.StorePool if storePool == nil { log.Eventf(ctx, "store pool is nil")
When will the storepool be nil? Does this only occur in tests?
pkg/kv/kvserver/replica_command.go line 3155 at r3 (raw file):
isCrossRegion, err := storePool.IsCrossRegion(req.CoordinatorReplica, req.RecipientReplica) if err != nil { log.Eventf(ctx, "%v", err)
This could be spammy and would be better logged as a warning if necessary.
Consider logging at a vmodule level of 2 and adding context to the error.
if err != nil {
// We are unable to determine if the snapshot was sent to a store in a different
// region or not. This can occur when ...
log.VEventf(ctx, 2, "Unable to determine if batch is cross region %v", err)
}pkg/kv/kvserver/allocator/storepool/store_pool.go line 1408 at r3 (raw file):
// GetNodeLocality returns the locality information for the given node. func (sp *StorePool) GetNodeLocality(nodeID roachpb.NodeID) roachpb.Locality {
Should these methods also be exported on the AllocatorStorePool interface? I'm not certain they necessarily need to, what do you think?
Also I only see IsCrossRegion being used outside this pkg, do the other two methods need to be exported?
pkg/kv/kvserver/replica_learner_test.go line 2211 at r1 (raw file):
} } types := [4]string{".unknown", ".recovery", ".rebalancing", ""}
nit: move types inline with the loop
for _, v := range [4]string{...} { ... }Is the reason you are using the metric names as constant strings here, as opposed to from the metadata in
cockroach/pkg/kv/kvserver/metrics.go
Line 890 in 3ff847a
kvserver_test pkg?
pkg/kv/kvserver/replica_learner_test.go line 2413 at r3 (raw file):
} require.Equal(t, receiverMapExpected, receiverMapDelta)
Nice test extension! Would it be possible to add a case where region isn't set, but other locality attributes are such as zone and dc - then assert that the cross-region metrics are unchanged? It may be helpful to get a feel for how noisy the log messages added above are as well.
fe98fc7 to
b685267
Compare
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Done. |
Previously, kvoli (Austen) wrote…
Oops 😅 |
Previously, kvoli (Austen) wrote…
Done. |
|
Previously, kvoli (Austen) wrote…
Done. |
|
Previously, kvoli (Austen) wrote…
Discussed offline - this was for sanity check, but it should never happen. I've removed the check. |
|
Previously, kvoli (Austen) wrote…
Done. |
|
Previously, kvoli (Austen) wrote…
Done. |
bf3d71d to
8635c72
Compare
This commit refactors `getSnapshotBytesMetrics` in `replica_learner_test` to return a `map[string]snapshotBytesMetrics` instead of `map[SnapShotRequest_Priority]snapshotBytesMetrics`. This allows us to include and compare different types of snapshot metrics, removing the constraint of being limited to `SnapShotRequest_Priority`. This commit does not change any existing functionality, and the main purpose is to make future commits cleaner. Part of: cockroachdb#104124 Release note: none
8635c72 to
3f1351e
Compare
kvoli
left a comment
There was a problem hiding this comment.
@andrewbaptist would you be able to review as well? This code was last touched heavily by you for delegated snapshots.
Reviewed 6 of 6 files at r7, all commit messages.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @wenyihu6)
Previously, wenyihu6 (Wenyi Hu) wrote…
Done.
I know its very much implied by the name but adding "...aggregate cross-region snapshot bytes..." would be clearer.
-- commits line 25 at r7:
nit: "Resolves" is probably a better linking term, since this wasn't a bug - just a feature.
andrewbaptist
left a comment
There was a problem hiding this comment.
- This will cause a few conflicts in my PR for #102629 - but I will fix that up after this merges. That PR does also remove the UNKNOWN metric since it is never populated, so we could simplify this by removing this here as it is always checked against 0, but its not important to do.
Reviewable status:
complete! 2 of 0 LGTMs obtained (waiting on @wenyihu6)
544f83a to
ad867c7
Compare
Previously, there were no metrics to observe cross-region snapshot traffic between stores within a cluster. To improve this issue, this commit adds two new store metrics - `range.snapshots.cross-region.sent-bytes` and `range.snapshots.cross-region.rcvd-bytes`. These metrics track the aggregate of snapshot bytes sent from and received at a store across different regions. Resolves: cockroachdb#104124 Release note (ops change): Two new store metrics - `range.snapshots.cross-region.sent-bytes` and `range.snapshots.cross-region.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different regions. Note that these metrics require nodes’ localities to include a “region” tier key. If a node lacks this key but is involved in cross-region batch activities, an error message will be logged.
ad867c7 to
aaf7049
Compare
Previously, kvoli (Austen) wrote…
Done. |
|
@andrewbaptist Would you mind taking care of removing the UNKNOWN metrics? It seems to align more closely with your PR. But please let me know whether it would be cleaner in terms of merge conflicts if I create a separate PR for this. |
|
TFTRs! bors r=kvoli, andrewbaptist |
|
Build failed (retrying...): |
|
Build succeeded: |
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities:`range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: cockroachdb#103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities: `range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: cockroachdb#103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.
104417: kvserver: add cross-zone snapshot byte metrics to StoreMetrics r=kvoli a=wenyihu6 **kvserver: refactor getSnapshotBytesMetrics** This commit refactors `getSnapshotBytesMetrics` function in `replica_learner_test`. Instead of rigidly defining an array of metrics information to be extracted within the function, the refactoring now allows the caller to pass a metrics slice array to retrieve specific metrics of the caller’s choice. This commit does not change any existing functionality, and the main purpose is to make future commits cleaner. Part of: #103983 Release note: none --- **kvserver: add cross-zone snapshot byte metrics to StoreMetrics** Previously, we [added](#104111) cross-region snapshot byte metrics to track the aggregate of snapshot bytes sent from and received at a store across different regions. We should add metrics to track cross-zone snapshots as well. To improve this issue, this commit adds two new store metrics - ``` range.snapshots.cross-zone.sent-bytes range.snapshots.cross-zone.rcvd-bytes ``` These metrics track the aggregate of snapshot bytes sent from and received at a store across different zones within the same region if the zone and region tier keys are properly configured across nodes. To ensure accurate metrics and consistent error logging, it is important to follow the assumption when configuring locality tiers across nodes: 1. For cross-region metrics, region tier keys must be present. 2. For cross-zone metrics, zone tier keys must be present. It is also essential to maintain consistency in the zone tier key across nodes. (e.g. using different keys, such as “az” and “zone”, can lead to unexpected behavior). 3. Within a node locality, region and zone tier keys should be unique. 4. Ensure consistent configuration of region and zone tiers across nodes. If all nodes configure both region and zone tiers: Cross-region and cross-zone metrics can be used to track following information: a. Aggregate of cross-region, cross-zone activities: `range.snapshots.cross-region.(sent|received)-bytes` b. Aggregate of same-region, cross-zone activities: `range.snapshots.cross-zone.(sent|received)-bytes` c. Aggregate of same-region, same-zone activities: `range.snapshots.(sent|received)-bytes` - a - b d. Cross-region, same zone activities will be considered as misconfigured, and an error will be logged. If all nodes have zone tiers configured, but not regions: cross-zone metrics will still be accurate. Problems arise if some nodes have region tier keys configured while others do not. If all nodes have region tiers configured: cross-region metrics will be accurate regardless of the nodes’ zone tier setup. Part of: #103983 Release note (ops change): Two new store metrics - `range.snapshots.cross-zone.sent-bytes` and `range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate of snapshot bytes sent from and received at a store across different zones. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes. Co-authored-by: Wenyi <wenyi.hu@cockroachlabs.com>
kvserver: refactor getSnapshotBytesMetrics
This commit refactors
getSnapshotBytesMetricsinreplica_learner_testto return a
map[string]snapshotBytesMetricsinstead ofmap[SnapShotRequest_Priority]snapshotBytesMetrics. This allows us to includeand compare different types of snapshot metrics, removing the constraint of
being limited to
SnapShotRequest_Priority. This commit does not change anyexisting functionality, and the main purpose is to make future commits cleaner.
Part of: #104124
Release note: none
kvserver: add cross-region snapshot byte metrics to StoreMetrics
Previously, there were no metrics to observe cross-region snapshot traffic
between stores within a cluster.
To improve this issue, this commit adds two new store metrics -
range.snapshots.cross-region.sent-bytesandrange.snapshots.cross-region.rcvd-bytes. These metrics track the aggregate ofsnapshot bytes sent from and received at a store across different regions.
Resolves: #104124
Release note (ops change): Two new store metrics -
range.snapshots.cross-region.sent-bytesandrange.snapshots.cross-region.rcvd-bytes- are now added to track the aggregateof snapshot bytes sent from and received at a store across different regions.
Note that these metrics require nodes’ localities to include a “region” tier
key. If a node lacks this key but is involved in cross-region snapshot activities,
an error message will be logged.