Skip to content

Add metrics for raft message send wait & receive delay duration#17735

Merged
ti-chi-bot[bot] merged 17 commits intotikv:masterfrom
hhwyt:weiyang-metrics
Nov 22, 2024
Merged

Add metrics for raft message send wait & receive delay duration#17735
ti-chi-bot[bot] merged 17 commits intotikv:masterfrom
hhwyt:weiyang-metrics

Conversation

@hhwyt
Copy link
Contributor

@hhwyt hhwyt commented Oct 30, 2024

What is changed and how it works?

Issue Number: Close #17683

What's Changed:

Add metrics for raft message send wait & receive delay duration

## Send wait duration

This phase begins when the RaftStore thread sends the RaftMessage
to the RaftClient and ends when the RaftMessage leaves the
BatchRaftMessage buffer, just before being flushed to the gRPC client.
Since this phase occurs entirely within the same node, we measure it
using monotonic time.

## Receive delay duration

Receive delay duration can also be called send duration. The name 
'Receive delay duration' is used because this duration is reported 
by the receiver, making it more clearer.

This phase begins after the send wait ends and continues as the message
is sent over the network, ends when the target peer receives it.
Since this phase spans across nodes, we measure it using physical time.
To facilitate this, we’ve introduced a last_observed_time field in
the BatchRaftMessage to record the physical times (please also review
the related PR: https://github.com/pingcap/kvproto/pull/1276).
Although physical clock drift between nodes is a possibility, its
impact on our measurement is limited. Our primary goal is to observe
trends in duration changes rather than relying on absolute precision.

NOTE: Metrics are only added for the batch_raft RPC, as the raft
RPC is deprecated and no longer tracked.

Grafana new panel:
image

Benchbot result:
The benchmark results for sysbench oltp_insert and oltp_write_only show no impact on performance.
image
image

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

none

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has signed the dco. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. contribution This PR is from a community contributor. labels Oct 30, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 30, 2024

Hi @hhwyt. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 30, 2024
@hhwyt hhwyt force-pushed the weiyang-metrics branch 2 times, most recently from fd2f408 to ae7878c Compare October 30, 2024 13:35
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 30, 2024
@hhwyt hhwyt changed the title [WIP] Add metrics for raft message transmission duration Add metrics for raft message transmission duration Nov 6, 2024
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 6, 2024
@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 6, 2024

/cc @glorv @Connor1996 @hbisheng @LykxSassinator PTAL, thanks~ Please also review the related PR: pingcap/kvproto#1276.

@hhwyt hhwyt changed the title Add metrics for raft message transmission duration Add metrics for raft message send wait & send duration Nov 6, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use NANOSECONDS_PER_SECOND

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make the metric local

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The performance result didn't show any difference. Do we really need local metrics?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, we encountered global metrics to be a performance impact. As it wouldn't take much effort to make it local, better do it. Check make_auto_flush_static_metric!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Copy link
Contributor Author

@hhwyt hhwyt Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Connor1996 This job is done. PTAL again, thx~

@Connor1996
Copy link
Member

Please run a benchmark for it to see if there is any regression

Copy link
Contributor

@LykxSassinator LykxSassinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 8, 2024

Please run a benchmark for it to see if there is any regression

The test result has already been provided in the first comment above. @Connor1996

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 11, 2024

@Connor1996 PTAL, thx~

@LykxSassinator
Copy link
Contributor

Pls check and fix the clippy errors, thx

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 12, 2024

Pls check and fix the clippy errors, thx

These errors is due to the lack of the last_observed_time filed which is added in the ongoing PR: pingcap/kvproto#1276. Is there a way we can run tests normally in this scenario?

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 12, 2024

@LykxSassinator I added a temporary commit to switch the kvproto repo to my repo. This commit will not be merged into master.

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 12, 2024

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Nov 12, 2024

@hhwyt: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

@ti-chi-bot ti-chi-bot bot removed the needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. label Nov 22, 2024
hhwyt and others added 16 commits November 22, 2024 10:59
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
This commit will not be merged into master.

Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: Connor <zbk602423539@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: Connor <zbk602423539@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
This reverts commit b8768ca9c74f8ae909d83b9fe1c4cbdd68b71f6f.

Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Nov 22, 2024
@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 22, 2024

/pull-unit-test

@hhwyt
Copy link
Contributor Author

hhwyt commented Nov 22, 2024

/test pull-unit-test

@ti-chi-bot ti-chi-bot bot merged commit ffd7e34 into tikv:master Nov 22, 2024
@ti-chi-bot ti-chi-bot bot added this to the Pool milestone Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. lgtm ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should add metric for gRPC transmission duration for raft commit log

5 participants