Skip to content

kvserver: add cross-zone snapshot byte metrics to StoreMetrics#104417

Merged
craig[bot] merged 2 commits intocockroachdb:masterfrom
wenyihu6:xzone-snapshot
Jun 10, 2023
Merged

kvserver: add cross-zone snapshot byte metrics to StoreMetrics#104417
craig[bot] merged 2 commits intocockroachdb:masterfrom
wenyihu6:xzone-snapshot

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Jun 6, 2023

kvserver: refactor getSnapshotBytesMetrics

This commit refactors getSnapshotBytesMetrics function in
replica_learner_test. Instead of rigidly defining an array of metrics
information to be extracted within the function, the refactoring now allows the
caller to pass a metrics slice array to retrieve specific metrics of the
caller’s choice.

This commit does not change any existing functionality, and the main purpose is
to make future commits cleaner.

Part of: #103983
Release note: none


kvserver: add cross-zone snapshot byte metrics to StoreMetrics

Previously, we added cross-region snapshot byte metrics to track the
aggregate of snapshot bytes sent from and received at a store across different
regions. We should add metrics to track cross-zone snapshots as well.

To improve this issue, this commit adds two new store metrics -

range.snapshots.cross-zone.sent-bytes 
range.snapshots.cross-zone.rcvd-bytes 

These metrics track the aggregate of snapshot bytes sent from and received at a
store across different zones within the same region if the zone and region tier
keys are properly configured across nodes.

To ensure accurate metrics and consistent error logging, it is important to
follow the assumption when configuring locality tiers across nodes:

  1. For cross-region metrics, region tier keys must be present.
  2. For cross-zone metrics, zone tier keys must be present. It is also essential
    to maintain consistency in the zone tier key across nodes. (e.g. using different
    keys, such as “az” and “zone”, can lead to unexpected behavior).
  3. Within a node locality, region and zone tier keys should be unique.
  4. Ensure consistent configuration of region and zone tiers across nodes.
  5. Cross-region but same zone activities should be impossible.

If all nodes configure both region and zone tiers:
Cross-region and cross-zone metrics can be used to track following information:
a. Aggregate of cross-region, cross-zone activities:
range.snapshots.cross-region.(sent|received)-bytes
b. Aggregate of same-region, cross-zone activities:
range.snapshots.cross-zone.(sent|received)-bytes
c. Aggregate of same-region, same-zone activities:
range.snapshots.(sent|received)-bytes - a - b
d. Cross-region, same zone activities will be considered as misconfigured, and
an error will be logged.

If all nodes have zone tiers configured, but not regions: cross-zone metrics
will still be accurate. Problems arise if some nodes have region tier keys
configured while others do not.

If all nodes have region tiers configured: cross-region metrics will be accurate
regardless of the nodes’ zone tier setup.

Part of: #103983

Release note (ops change): Two new store metrics -
range.snapshots.cross-zone.sent-bytes and
range.snapshots.cross-zone.rcvd-bytes - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different zones.

For accurate metrics, follow these assumptions:

  • Configure region and zone tier keys consistently across nodes.
  • Within a node locality, ensure unique region and zone tier keys.
  • Maintain consistent configuration of region and zone tiers across nodes.

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 requested a review from kvoli June 6, 2023 15:56
@wenyihu6 wenyihu6 marked this pull request as ready for review June 6, 2023 16:39
@wenyihu6 wenyihu6 requested a review from a team as a code owner June 6, 2023 16:39
@wenyihu6 wenyihu6 self-assigned this Jun 6, 2023
@wenyihu6 wenyihu6 removed the request for review from a team June 6, 2023 16:40
@wenyihu6 wenyihu6 force-pushed the xzone-snapshot branch 4 times, most recently from ebe5321 to d5d780c Compare June 7, 2023 12:23
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Jun 7, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Jun 7, 2023
@wenyihu6 wenyihu6 force-pushed the xzone-snapshot branch 3 times, most recently from 17c4bc9 to 075eea9 Compare June 7, 2023 18:24
@wenyihu6 wenyihu6 removed the O-community Originated from the community label Jun 7, 2023
Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 7 of 7 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)


-- commits line 29 at r2:
Could you mention the locality-tier keys which are recognized as a "zone" here.


-- commits line 35 at r2:
Could we shorten the release note and move some of this up above.

I think it suffices to mention the supported zone locality-tier keys and then the requirement that there be at most one and consistent across nodes. The example is good!


pkg/kv/kvserver/store_snapshot.go line 977 at r2 (raw file):

// shouldIncrementCrossLocalitySnapshotMetrics returns (bool, bool) - indicating
// if the two given replicas are cross-region and cross-zone respectively.
func (s *Store) shouldIncrementCrossLocalitySnapshotMetrics(

nit: It may be worth mentioning why this function deals with both x-region and x-zone are handled in the same method, perf - so that in the future someone doesn't increase the overhead because they were unaware.


pkg/roachpb/metadata.go line 622 at r2 (raw file):

// getFirstRegionFirstZone iterates through the locality tiers and returns
// multiple values containing:
// 1.The value of the first encountered "region" tier.

nit: add some spacing between "1.The..."


pkg/roachpb/metadata.go line 628 at r2 (raw file):

func (l Locality) getFirstRegionFirstZone() (string, bool, string, string, bool) {
	hasRegion, hasZone := false, false
	firstRegionValue, firstZoneKey, firstZoneValue := "", "", ""

nit: probably more idiomatic to do something like

var ( 
    hasRegion, hasZone bool
    firstRegionValue, firstZoneKey, firstZoneValue string
)

Or get that for free if you name the function return values like

func (l Locality) getFirstRegionFirstZone()
(firstRegion string, hasRegion bool, ...) {

pkg/roachpb/metadata.go line 654 at r2 (raw file):

// 2. Error indicating if either locality does not have a "region" tier key.
// 3. A boolean value indicating if this and the provided locality are
// cross-region.

cross-zone*


pkg/roachpb/metadata.go line 667 at r2 (raw file):

// be checked thoroughly for duplicate zone tier keys and key mismatches.
// However, due to frequent invocation of this function, we prefer to terminate
// the check after examining the first encountered zone tier key-value pairs.

Just checking my understanding. Ignore possibility of any invalid locality keys, so they're all either region or zone.

Is the intention for x-region to imply x-zone? i.e. should it ever be possible to have two localities which are x-region but not x-zone?

When you want to estimate out how many bytes (roughly) you are paying for at x-region rates (excluding cheaper x-zone), you would do sum(x-region) - sum(x-zone)?

If so, do we need to additionally assume clusters won't have a mix of locality=region=...,zone=..., locality=region=... and locality=zone=...?

For example

L1: region=us-east, zone=us-east-1
L2:                 zone=us-east-2
L3: region=us-west

L1 x L2 = x-zone
L1 x L3 = x-region
L2 x L3 = none

Presume 3 snaps/requests each with value 1 for each of the Li x Lj enumerated above. Thesum(x-region)-sum(x-zone) metric query would return 0 despite there really being 1 x-region and 1 x-zone.

I know we previously chatted about the x-zone condition requiring a region flag also be set, what are your thoughts on this?


pkg/roachpb/metadata.go line 668 at r2 (raw file):

// However, due to frequent invocation of this function, we prefer to terminate
// the check after examining the first encountered zone tier key-value pairs.
func (l Locality) IsCrossRegionCrossZone(other Locality) (bool, error, bool, error) {

nit: Consider naming the return values.


pkg/roachpb/metadata_test.go line 224 at r2 (raw file):

}

func TestLocalityIsCrossRegionCrossZone(t *testing.T) {

Great test!

@wenyihu6 wenyihu6 force-pushed the xzone-snapshot branch 2 times, most recently from 35fbfe1 to d49f7d7 Compare June 8, 2023 17:48
@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

-- commits line 29 at r2:

Previously, kvoli (Austen) wrote…

Could you mention the locality-tier keys which are recognized as a "zone" here.

Done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

-- commits line 35 at r2:

Previously, kvoli (Austen) wrote…

Could we shorten the release note and move some of this up above.

I think it suffices to mention the supported zone locality-tier keys and then the requirement that there be at most one and consistent across nodes. The example is good!

I wasn't sure the level of details we want to include for the commit and release notes. Let me know if this is too wordy.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

pkg/kv/kvserver/store_snapshot.go line 977 at r2 (raw file):

Previously, kvoli (Austen) wrote…

nit: It may be worth mentioning why this function deals with both x-region and x-zone are handled in the same method, perf - so that in the future someone doesn't increase the overhead because they were unaware.

I added a comment for the locality helper instead. I think if someone is looking to add more locality tiers comparison here, they would come across that function eventually. I'm adding shouldIncrementMetrics for other PRs as well, so I'm trying to avoid adding this comment redundantly.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

pkg/roachpb/metadata.go line 654 at r2 (raw file):

Previously, kvoli (Austen) wrote…

cross-zone*

Done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

pkg/roachpb/metadata_test.go line 224 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Great test!

Done.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 8, 2023

pkg/roachpb/metadata.go line 667 at r2 (raw file):

Previously, kvoli (Austen) wrote…

Just checking my understanding. Ignore possibility of any invalid locality keys, so they're all either region or zone.

Is the intention for x-region to imply x-zone? i.e. should it ever be possible to have two localities which are x-region but not x-zone?

When you want to estimate out how many bytes (roughly) you are paying for at x-region rates (excluding cheaper x-zone), you would do sum(x-region) - sum(x-zone)?

If so, do we need to additionally assume clusters won't have a mix of locality=region=...,zone=..., locality=region=... and locality=zone=...?

For example

L1: region=us-east, zone=us-east-1
L2:                 zone=us-east-2
L3: region=us-west

L1 x L2 = x-zone
L1 x L3 = x-region
L2 x L3 = none

Presume 3 snaps/requests each with value 1 for each of the Li x Lj enumerated above. Thesum(x-region)-sum(x-zone) metric query would return 0 despite there really being 1 x-region and 1 x-zone.

I know we previously chatted about the x-zone condition requiring a region flag also be set, what are your thoughts on this?

Discussed this offline:

  1. X-region activities should imply x-zone. Failing to adhere to this assumption leads to error logging.
  2. Consistent zone tier configuration across nodes is also required. See more details in the PR notes.

In addition, the initial commit always increments cross-zone metrics.

If we want to calculate for cross zone within same region activities, we would have to use
sum(cross-zone within the same region) = sum(cross-zone) - sum(cross-region).

Details
Given the assumption that x-region implies x-zone, sum(cross-zone between different regions) == sum(cross-region). 

sum(cross-zone within the same region) = sum(cross-zone) - sum(cross-zone between different regions)
=> sum(cross-zone within the same region) = sum(cross-zone) - sum(cross-region) by the assumption

The new commit now only increments if it is cross-zone, within-region activities for properly configured region tiers. This makes the metrics more meaningful.

@wenyihu6 wenyihu6 requested a review from kvoli June 8, 2023 18:36
@wenyihu6 wenyihu6 force-pushed the xzone-snapshot branch 2 times, most recently from 1fcd7e4 to ad60de0 Compare June 8, 2023 18:47
@wenyihu6 wenyihu6 force-pushed the xzone-snapshot branch 3 times, most recently from c719251 to 6c70039 Compare June 9, 2023 13:12
Copy link
Copy Markdown
Contributor

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with some small suggestions.

Reviewed 4 of 6 files at r3, 2 of 2 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)


-- commits line 19 at r4:
nit: github will pickup the link fine with just #104111. No need for markdown.


-- commits line 33 at r4:
nit: Are these commit messages formatted to have correct line length?


pkg/kv/kvserver/store_snapshot.go line 1004 at r4 (raw file):

	isCrossRegion, isCrossZone := s.shouldIncrementCrossLocalitySnapshotMetrics(ctx, firstReplica, secReplica)
	switch sentOrReceived {
	case "sent":

Should this instead be a bool? It would be better to be as strict as possible.

@wenyihu6
Copy link
Copy Markdown
Contributor Author

wenyihu6 commented Jun 9, 2023

pkg/kv/kvserver/store_snapshot.go line 1004 at r4 (raw file):

Previously, kvoli (Austen) wrote…

Should this instead be a bool? It would be better to be as strict as possible.

Done.

wenyihu6 added 2 commits June 9, 2023 19:17
This commit refactors `getSnapshotBytesMetrics` function in
`replica_learner_test`. Instead of rigidly defining an array of metrics
information to be extracted within the function, the refactoring now allows the
caller to pass a metrics slice array to retrieve specific metrics of the
caller’s choice.

This commit does not change any existing functionality, and the main purpose is
to make future commits cleaner.

Part of: cockroachdb#103983
Release note: none
Previously, we [added](cockroachdb#104111) cross-region snapshot byte metrics to track the
aggregate of snapshot bytes sent from and received at a store across different
regions. We should add metrics to track cross-zone snapshots as well.

To improve this issue, this commit adds two new store metrics -
```
range.snapshots.cross-zone.sent-bytes
range.snapshots.cross-zone.rcvd-bytes
```
These metrics track the aggregate of snapshot bytes sent from and received at a
store across different zones within the same region if the zone and region tier
keys are properly configured across nodes.

To ensure accurate metrics and consistent error logging, it is important to
follow the assumption when configuring locality tiers across nodes:
1. For cross-region metrics, region tier keys must be present.
2. For cross-zone metrics, zone tier keys must be present. It is also essential
to maintain consistency in the zone tier key across nodes. (e.g. using different
keys, such as “az” and “zone”, can lead to unexpected behavior).
3. Within a node locality, region and zone tier keys should be unique.
4. Ensure consistent configuration of region and zone tiers across nodes.

If all nodes configure both region and zone tiers:
Cross-region and cross-zone metrics can be used to track following information:
a. Aggregate of cross-region, cross-zone activities:
`range.snapshots.cross-region.(sent|received)-bytes`
b. Aggregate of same-region, cross-zone activities:
`range.snapshots.cross-zone.(sent|received)-bytes`
c. Aggregate of same-region, same-zone activities:
`range.snapshots.(sent|received)-bytes` - a - b
d. Cross-region, same zone activities will be considered as misconfigured, and
an error will be logged.

If all nodes have zone tiers configured, but not regions: cross-zone metrics
will still be accurate. Problems arise if some nodes have region tier keys
configured while others do not.

If all nodes have region tiers configured: cross-region metrics will be accurate
regardless of the nodes’ zone tier setup.

Part of: cockroachdb#103983

Release note (ops change): Two new store metrics -
`range.snapshots.cross-zone.sent-bytes` and
`range.snapshots.cross-zone.rcvd-bytes` - are now added to track the aggregate
of snapshot bytes sent from and received at a store across different zones.

For accurate metrics, follow these assumptions:
- Configure region and zone tier keys consistently across nodes.
- Within a node locality, ensure unique region and zone tier keys.
- Maintain consistent configuration of region and zone tiers across nodes.
@wenyihu6
Copy link
Copy Markdown
Contributor Author

TFTR!

bors r=kvoli

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Jun 10, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants