Skip to content

kvserver: add metrics to track snapshot queueing and application#84947

Merged
craig[bot] merged 1 commit intocockroachdb:masterfrom
AlexTalks:decomm_metrics
Aug 5, 2022
Merged

kvserver: add metrics to track snapshot queueing and application#84947
craig[bot] merged 1 commit intocockroachdb:masterfrom
AlexTalks:decomm_metrics

Conversation

@AlexTalks
Copy link
Copy Markdown
Contributor

@AlexTalks AlexTalks commented Jul 22, 2022

While we previously had one metric to track "in-progress" snapshots,
replicas.reserved, this metric is used by both senders and receivers,
and does not track the number of snapshots queued on its semaphore. This
change adds the following new metrics, separated by sender and receiver:

range.snapshots.send-queue
range.snapshots.recv-queue
range.snapshots.send-in-progress
range.snapshots.recv-in-progress
range.snapshots.send-total-in-progress
range.snapshots.recv-total-in-progress

These metrics specifically track the number of queued non-empty
snapshots on a sender or receiver store, as well as the number of
snapshots currently being sent or received by a given store. The
first four metrics specifically only track non-empty snapshots, because
empty snapshots are not throttled at all, and thus do not compete for the
semaphore. The final two metrics track all snapshots, including empty
ones.

Release note (ops change): Added new metrics
range.snapshots.(send|recv)-queue and
range.snapshots.(send|recv)-in-progress to track the number of queued
and in-progress snapshots being sent or received on a store.

@AlexTalks AlexTalks requested a review from a team as a code owner July 22, 2022 23:52
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@AlexTalks AlexTalks requested a review from a team July 23, 2022 01:03
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 596 at r1 (raw file):

		}

		inProgressSnapshotMetric.Inc(1)

The way this is written, we won't be recording empty snapshots ingestion as "in progress". I don't have a strong opinion but curious what you think.

I think it would be better to have this increment (and the following decrement in the returned closure) outside of this if-block, what do you think? I think there might be instances where we might find it useful to observe the impact of a wave of empty snapshot ingestion.

Copy link
Copy Markdown
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 596 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

The way this is written, we won't be recording empty snapshots ingestion as "in progress". I don't have a strong opinion but curious what you think.

I think it would be better to have this increment (and the following decrement in the returned closure) outside of this if-block, what do you think? I think there might be instances where we might find it useful to observe the impact of a wave of empty snapshot ingestion.

This was intentional - since the empty snapshots don't wait for the semaphore (on either sender or receiver side), I thought it wouldn't be extremely useful to see how many are in progress at a given time, especially when in the same metric as non-empty snapshots. Do you think we should add a separate metric for this? Or incorporate the non-empty snapshots as well?

Copy link
Copy Markdown
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 596 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

This was intentional - since the empty snapshots don't wait for the semaphore (on either sender or receiver side), I thought it wouldn't be extremely useful to see how many are in progress at a given time, especially when in the same metric as non-empty snapshots. Do you think we should add a separate metric for this? Or incorporate the non-empty snapshots as well?

Incorporate the non- empty snapshots, I mean.

@kvoli
Copy link
Copy Markdown
Contributor

kvoli commented Jul 28, 2022

Related but not a comment on this patch.

Is it possible (straightforward) to instrument tracking for the number of current snapshot bytes queued and the remaining bytes in processing?

edit: Tracing is added for timing in @KnightAsterial's patch #83003, however not a metric for latency or bytes.

@AlexTalks

Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @AlexTalks and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 596 at r1 (raw file):

Previously, AlexTalks (Alex Sarkesian) wrote…

Incorporate the non- empty snapshots, I mean.

My suggestion would be to incorporate empty snapshots in this metric and add a separate metric that tracks just the set of non-empty snapshots. So we'd have one metric corresponding to all snapshots and one corresponding to just non-empty snapshots.

Copy link
Copy Markdown
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 596 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

My suggestion would be to incorporate empty snapshots in this metric and add a separate metric that tracks just the set of non-empty snapshots. So we'd have one metric corresponding to all snapshots and one corresponding to just non-empty snapshots.

Added range.snapshots.(send|recv)-total-in-progress metrics to track empty snapshots as well.

Copy link
Copy Markdown
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for @kvoli's suggestion, I assume you mean per-snapshot, tracking on the sender side for remaining, and on the receiver side for the queued bytes? I think this is definitely feasible but I'd put it in a separate PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)

@kvoli
Copy link
Copy Markdown
Contributor

kvoli commented Aug 3, 2022

As for @kvoli's suggestion, I assume you mean per-snapshot, tracking on the sender side for remaining, and on the receiver side for the queued bytes? I think this is definitely feasible but I'd put it in a separate PR.

A separate PR sounds good. I filed an issue for it: #85528.

@dhartunian dhartunian removed the request for review from a team August 3, 2022 14:25
Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @AlexTalks, and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 569 at r2 (raw file):

) (_cleanup func(), _err error) {
	tBegin := timeutil.Now()
	totalInProgressSnapshotMetric.Inc(1)

This increment should be after the logic that waits to acquire the snapshot reservation, right?

Currently, we have this asymmetry where the total metric includes the snapshots waiting on the semaphore whereas the non-empty in-progress metric doesn't.

Copy link
Copy Markdown
Contributor Author

@AlexTalks AlexTalks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)


pkg/kv/kvserver/store_snapshot.go line 569 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

This increment should be after the logic that waits to acquire the snapshot reservation, right?

Currently, we have this asymmetry where the total metric includes the snapshots waiting on the semaphore whereas the non-empty in-progress metric doesn't.

Thanks for catching - should be fixed now.

Copy link
Copy Markdown
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @KnightAsterial)

@AlexTalks
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 4, 2022

Merge conflict.

While we previously had one metric to track "in-progress" snapshots,
`replicas.reserved`, this metric is used by both senders and receivers,
and does not track the number of snapshots queued on its semaphore. This
change adds the following new metrics, separated by sender and receiver:
```
range.snapshots.send-queue
range.snapshots.recv-queue
range.snapshots.send-in-progress
range.snapshots.recv-in-progress
range.snapshots.send-total-in-progress
range.snapshots.recv-total-in-progress
```

These metrics specifically track the number of queued non-empty
snapshots on a sender or receiver store, as well as the number of
snapshots currently being sent or received by a given store. The
first four metrics specifically only track non-empty snapshots, because
empty snapshots are not throttled at all, and thus do not compete for the
semaphore. The final two metrics track all snapshots, including empty
ones.

Release note (ops change): Added new metrics
`range.snapshots.(send|recv)-queue` and
`range.snapshots.(send|recv)-in-progress` to track the number of queued
and in-progress snapshots being sent or received on a store.
@AlexTalks
Copy link
Copy Markdown
Contributor Author

bors r+

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 4, 2022

Build failed (retrying...):

@craig
Copy link
Copy Markdown
Contributor

craig bot commented Aug 5, 2022

Build succeeded:

@craig craig bot merged commit c651d33 into cockroachdb:master Aug 5, 2022
@AlexTalks AlexTalks deleted the decomm_metrics branch August 5, 2022 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants