server: add x-region, x-zone metrics to Node by wenyihu6 · Pull Request #104585 · cockroachdb/cockroach

wenyihu6 · 2023-06-08T13:11:57Z

Previously, there were no metrics to observe cross-region, cross-zone traffic in
batch requests / responses processed at receiver nodes.

To improve this issue, this commit adds four new node metrics -

"batch_requests.bytes"
"batch_responses.bytes"
"batch_requests.cross_region.bytes"
"batch_responses.cross_region.bytes"
"batch_requests.cross_zone.bytes",
"batch_responses.cross_zone.bytes"

The first two metrics track the total byte count of batch requests processed and
batch responses received at node. Additionally, there are four metrics to track
the aggregate counts processed and received across different regions and zones.
Note that these metrics only track the receiver node since the node here
represents the destination range node but not the gateway node.

Part of: #103983

Release note (ops change): Six new metrics -
"batch_requests.bytes",
"batch_responses.bytes",
"batch_requests.cross_region.bytes",
"batch_responses.cross_region.bytes",
"batch_requests.cross_zone.bytes",
"batch_responses.cross_zone.bytes" - are now added to Node metrics.

For accurate metrics, follow these assumptions:

Configure region and zone tier keys consistently across nodes.
Within a node locality, ensure unique region and zone tier keys.
Maintain consistent configuration of region and zone tiers across nodes.

cockroach-teamcity · 2023-06-08T13:12:11Z

This change is

kvoli

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @wenyihu6)

pkg/server/node.go line 148 at r1 (raw file):

	metaBatchRequestsBytes = metric.Metadata{
		Name:        "batch_requests.bytes",

Have you considered prefixing with exec.? I see that some of the other node metrics for batch requests have that prefix such as exec.error.

pkg/server/node.go line 1332 at r1 (raw file):

func (n *Node) isCrossRegionCrossZoneBatch(
	ctx context.Context, ba *kvpb.BatchRequest,
) (bool, bool) {

nit: name the response variables or use an enum. The bools are exclusive now. No need to do in this PR (as it applies to the other x-zone/x-region stuff) but it would be nice to use an enum around the place, with states for "same-region,same-zone" and "bad config" essentially.

pkg/server/node.go line 1422 at r1 (raw file):

	n.incrementBatchCounters(args)

	shouldIncrement := true

nit: a comment here would be helpful for the reader to quickly figure out what the interceptor is for and why shouldInc is true initially.

pkg/server/node_test.go line 790 at r1 (raw file):

// cross-region, cross-zone byte count metrics for batch requests sent and batch
// responses received.
func TestNodeBatchMetrics(t *testing.T) {

Nice test!

Previously, there were no metrics to observe cross-region, cross-zone traffic in batch requests / responses processed at receiver nodes. To improve this issue, this commit adds four new node metrics - ``` "batch_requests.bytes" "batch_responses.bytes" "batch_requests.cross_region.bytes" "batch_responses.cross_region.bytes" "batch_requests.cross_zone.bytes", "batch_responses.cross_zone.bytes" ``` The first two metrics track the total byte count of batch requests processed and batch responses received at node. Additionally, there are four metrics to track the aggregate counts processed and received across different regions and zones. Note that these metrics only track the receiver node since the node here represents the destination range node but not the gateway node. Part of: cockroachdb#103983 Release note (ops change): Six new metrics - "batch_requests.bytes", "batch_responses.bytes", "batch_requests.cross_region.bytes", "batch_responses.cross_region.bytes", "batch_requests.cross_zone.bytes", "batch_responses.cross_zone.bytes" - are now added to Node metrics. For accurate metrics, follow these assumptions: - Configure region and zone tier keys consistently across nodes. - Within a node locality, ensure unique region and zone tier keys. - Maintain consistent configuration of region and zone tiers across nodes.

wenyihu6 · 2023-06-13T13:47:15Z

pkg/server/node.go line 148 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Have you considered prefixing with exec.? I see that some of the other node metrics for batch requests have that prefix such as exec.error.

Discussed this more offline -

The metrics with prefix exec are incremented at the end of batchInternal and are for batches that are guaranteed to have been executed without an early return. At the moment, we think these metrics are designed to measure byte count of processed batch requests (rather than executed) and get a sense of proportion of cross-region and cross-zone batches among them.

wenyihu6 · 2023-06-13T13:48:38Z

pkg/server/node.go line 1332 at r1 (raw file):

Previously, kvoli (Austen) wrote…

nit: name the response variables or use an enum. The bools are exclusive now. No need to do in this PR (as it applies to the other x-zone/x-region stuff) but it would be nice to use an enum around the place, with states for "same-region,same-zone" and "bad config" essentially.

That sounds like a good idea. I will make another PR to refactor the logic for all relevant PRs.

wenyihu6 · 2023-06-13T13:49:18Z

pkg/server/node.go line 1422 at r1 (raw file):

Previously, kvoli (Austen) wrote…

nit: a comment here would be helpful for the reader to quickly figure out what the interceptor is for and why shouldInc is true initially.

Done.

kvoli

Reviewed 2 of 2 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @wenyihu6)

wenyihu6 · 2023-06-13T21:21:31Z

bors r=kvoli

TFTRs!

craig · 2023-06-13T22:24:01Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-06-14T00:03:33Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-06-14T01:15:21Z

Build succeeded:

Bazel Essential CI (Cockroach)

The X-locality log events were added in cockroachdb#104585 to the Node batch receive path, to alert when localities were misconfigured. In some clusters, especially test clusters, these events are unnecessarily verbose in traces. Change the log from `VEvent(5)` to `VInfo(5)` in the node batch path. Part of: cockroachdb#110648 Epic: none Release note: None

111140: roachtest: harmonize GCE and AWS machine types r=erikgrinaker,herkolategan,renatolabs a=srosenberg Previously, same (performance) roachtest executed in GCE and AWS may have used a different memory (per CPU) multiplier and/or cpu family, e.g., cascade lake vs ice lake. In the best case, this resulted in different performance baselines on an otherwise equivalent machine type. In the worst case, this resulted in OOMs due to VMs in AWS having 2x less memory per CPU. This change harmozines GCE and AWS machine types by making them as isomorphic as possible, wrt memory, cpu family and price. The following heuristics are used depending on specified `MemPerCPU`: `Standard` yields 4GB/cpu, `High` yields 8GB/cpu, `Auto` yields 4GB/cpu up to and including 16 vCPUs, then 2GB/cpu. `Low` is supported _only_ in GCE. Consequently, `n2-standard` maps to `m6i`, `n2-highmem` maps to `r6i`, `n2-custom` maps to `c6i`, modulo local SSDs in which case `m6id` is used, etc. Note, we also force `--gce-min-cpu-platform` to `Ice Lake`; isomorphic AWS machine types are exclusively on `Ice Lake`. Roachprod is extended to show cpu family and architecture on `List`. Cost estimation now correctly deals with _custom_ machine types. Finally, we change the default zone allocation in GCE from exclusively `us-east1-b` to ~25% `us-central1-b` and ~75% `us-east1-b`. This is inteded to balance the quotas for local SSDs until we eventually switch to PD-SSDs. Epic: none Fixes: #106570 Release note: None 111442: server,kvcoord: change x-locality log from vevent to vinfo r=arulajmani a=kvoli The X-locality log events were added in #104585 to the Node batch receive path, to alert when localities were misconfigured. In some clusters, especially test clusters, these events are unnecessarily verbose in traces. Change the log from `VEvent(5)` to `VInfo(5)` in the node batch path. The X-locality log events were added in #103963 for the dist sender, to alert when localities were misconfigured. In some clusters, especially test clusters, these events are unnecessarily verbose in traces. Change the log from `VEvent(5)` to `VInfo(5)` in the dist sender path. Resolves: #110648 Epic: none Release note: None 111475: server,settingswatcher: fix the local persisted cache r=stevendanna,aliher1911 a=knz There's two commits here, fixing 2 separate issues. Epic: CRDB-6671 ### server,settingswatcher: properly evict entries from the local persisted cache Fixes #70567. Supersedes #101472. (For context, on each node there is a local persisted cache of cluster setting customizations. This exists to ensure that configured values can be used even before a node has fully started up and can start reading customizations from `system.settings`.) Prior to this patch, entries were never evicted from the local persisted cache: when a cluster setting was reset, any previously saved entry in the cache would remain there. This is a very old bug, which was long hidden and was recently revealed when commit 2f5d717 was merged. In a nutshell, before this recent commit the code responsible to load the entries from the cache didn't fully work and so the stale entries were never restored from the cache. That commit fixed the loader code, and so the stale entries became active, which made the old bug visible. To fix the old bug, this present commit modifies the settings watcher to preserve KV deletion events, and propagates them to the persisted cache. (There is no release note because there is no user-facing release where the bug was visible.) ### settingswatcher: write-through to the persisted cache Fixes #111422. Fixes #111328. Prior to this patch, the rangefeed watcher over `system.settings` was updating the in-RAM value store before it propagated the updates to the persisted local cache. In fact, the update to the persisted local cache was lagging quite a bit behind, because the rangefeed watcher would buffer updates and only flush them after a while. As a result, the following sequence was possible: 1. client updates a cluster setting. 2. server is immediately shut down. The persisted cache has not been updated yet. 3. server is restarted. For a short while (until the settings watcher has caught up), the old version of the setting remains active. This recall of ghost values of a setting was simply a bug. This patch fixes that, by ensuring that the persisted cache is written through before the in-RAM value store. By doing this, we give up on batching updates to the persisted local store. This is deemed acceptable because cluster settings are not updated frequently. Co-authored-by: Stan Rosenberg <stan.rosenberg@gmail.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

The X-locality log events were added in cockroachdb#104585 to the Node batch receive path, to alert when localities were misconfigured. In some clusters, especially test clusters, these events are unnecessarily verbose in traces. Change the log from `VEvent(5)` to `VInfo(5)` in the node batch path. Part of: cockroachdb#110648 Epic: none Release note: None

wenyihu6 mentioned this pull request Jun 8, 2023

kvclient: add x-region, x-zone metrics to DistSender #103963

Merged

wenyihu6 force-pushed the xlocality-node branch 6 times, most recently from 5d22835 to 8181159 Compare June 12, 2023 04:41

wenyihu6 marked this pull request as ready for review June 12, 2023 13:40

wenyihu6 requested review from a team as code owners June 12, 2023 13:40

wenyihu6 requested a review from kvoli June 12, 2023 13:40

wenyihu6 force-pushed the xlocality-node branch from 8181159 to 6d9a9d5 Compare June 12, 2023 20:48

kvoli reviewed Jun 13, 2023

View reviewed changes

wenyihu6 force-pushed the xlocality-node branch from 6d9a9d5 to a1bea39 Compare June 13, 2023 13:43

wenyihu6 requested a review from kvoli June 13, 2023 13:49

kvoli approved these changes Jun 13, 2023

View reviewed changes

craig bot merged commit cbac036 into cockroachdb:master Jun 14, 2023

cockroach-teamcity mentioned this pull request Jun 14, 2023

PR #104585 - server: add x-region, x-zone metrics to Node cockroachdb/docs#17249

Closed

kvoli mentioned this pull request Sep 14, 2023

kvserver: stop printing locality-mismatch error in traces #110648

Closed

yuzefovich mentioned this pull request Sep 22, 2023

server: regression in BenchmarkColBatchScan due to x-region, x-zone metrics #111148

Closed

kvoli mentioned this pull request Sep 28, 2023

server,kvcoord: change x-locality log from vevent to vinfo #111442

Merged

wenyihu6 deleted the xlocality-node branch October 30, 2023 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add x-region, x-zone metrics to Node#104585

server: add x-region, x-zone metrics to Node#104585
craig[bot] merged 1 commit intocockroachdb:masterfrom
wenyihu6:xlocality-node

wenyihu6 commented Jun 8, 2023 •

edited

Loading

Uh oh!

cockroach-teamcity commented Jun 8, 2023

Uh oh!

kvoli left a comment

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

kvoli left a comment

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

craig bot commented Jun 13, 2023

Uh oh!

craig bot commented Jun 14, 2023

Uh oh!

craig bot commented Jun 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wenyihu6 commented Jun 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented Jun 8, 2023

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

kvoli left a comment

Choose a reason for hiding this comment

Uh oh!

wenyihu6 commented Jun 13, 2023

Uh oh!

craig bot commented Jun 13, 2023

Uh oh!

craig bot commented Jun 14, 2023

Uh oh!

craig bot commented Jun 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenyihu6 commented Jun 8, 2023 •

edited

Loading