Skip to content

server: add an option to support returning simplified metrics#12417

Closed
glorv wants to merge 9 commits intotikv:masterfrom
glorv:simplify-metrics
Closed

server: add an option to support returning simplified metrics#12417
glorv wants to merge 9 commits intotikv:masterfrom
glorv:simplify-metrics

Conversation

@glorv
Copy link
Contributor

@glorv glorv commented Apr 22, 2022

What is changed and how it works?

Issue Number: ref #12355

What's Changed:
After this change, the metrics registry is divided into 4 registries:

  • HIGH_PRIORITY_REGISTRY. The core metrics that should be always fetched, if server.simplified-metrics is set to true, we try to reduce some counter and histogram samples but still ensure it won't affact the grafana charts.
  • FULL_HISTOGRAM_REGISTRY. These metrics are also high priority, but due to the expression used in grafana or the charts type("flamegraph"), the sample data should not be simplified, so always return the original full sample data.
  • UNUSED_METRICS_REGISTRY. These metrics are not used in tikv's grafana, so do not return them in the simplified mode.
  • default. The rest are in the system default registry. In the simplified mode, these samples will be returned will a smaller frequency, by not the frequency is 1/2, that is one in 2 request includes them.

Add a new server config server.simplified-metrics to control whether the metrics API is running under the simplified-mode.
If this config is set to true(false by default), tikv will return simplified metrics. If set to false, it will still return the full metrics as before.

Sample data simplification strategy:

  • Counter. Filter out all counters that its value is 0.
  • Histogram. If the histogram's max accumulate count is 0, it will be filtered out. Else, any bucket with its value equal to the former and successive bucket's value will be filtered out.

NOTE:
After the compact of histogram, there is no information loss in theory (as we only filtered out buckets will 0 sample values), due to the expr implementation of prometheus, some type of expression or charts can not be drawn correctly, so we keep the effected metrics at the FULL_HISTOGRAM_REGISTRY. Currently, the affected charts are: expression contains delta function and charts with flamegraph.

Side Effect:
due to the reduction of some metrics, when server.simplified-metrics is set to true, there are some negative side effects to the grafana data:

  • Some expression will show no data instead of 0 data because we filter out 0 data for counter and histogram.
  • The precision of some charts will decrease by half because the sample duration is decreased from 15s to 30s.

Benchmark result:
With server.simplified-metrics set to false, the response data size is about 1.3MB, total samples count is: 17081.
With server.simplified-metrics set to true, when only high priority metrics are returned, the data size is about 125KB, the samples count is about 1494. when the default metrics is also returned, the data size is 226KB, samples count is 2609.

Thus the overall sample reduced percentage when server.simplified-metrics set to true and false is: (1494+2609)/2/17081 = 0.12.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression
    • Consumes more CPU
    • Consumes more MEM
  • Breaking backward compatibility

Release note

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Apr 22, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • innerr

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Details

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 22, 2022
@glorv glorv requested review from BusyJay and innerr April 22, 2022 04:09
@glorv glorv force-pushed the simplify-metrics branch 3 times, most recently from 8d912fe to 669a356 Compare April 22, 2022 05:35
@BusyJay
Copy link
Member

BusyJay commented Apr 22, 2022

Is it still worthy after enabling compression? 0 is not the same as absent.

@glorv
Copy link
Contributor Author

glorv commented Apr 22, 2022

Is it still worthy after enabling compression? 0 is not the same as absent.

We need to also reduce the metrics storage size(by 90%), so this is one of the efforts.
I think for histogram and counter, the different between 0 and absent should be small enough that it won't cause any human noticeable issue. For the charts that 0 is import, maybe we can enable the absent as 0 option.

glorv added 7 commits April 29, 2022 12:58
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
@glorv glorv force-pushed the simplify-metrics branch from 669a356 to 7a68302 Compare May 6, 2022 13:18
@ti-chi-bot ti-chi-bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 6, 2022
@glorv glorv changed the title server: add an option to support filter useless metrics server: add an option to support returning simplified metrics May 6, 2022
@glorv
Copy link
Contributor Author

glorv commented May 6, 2022

/test

@glorv
Copy link
Contributor Author

glorv commented May 9, 2022

@innerr @BusyJay @Connor1996 @tonyxuqqi @5kbpers PTAL, thanks

ti-chi-bot added a commit that referenced this pull request May 25, 2022
ref #12355, ref #12417

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 1, 2022
@ti-chi-bot
Copy link
Member

@glorv: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@innerr innerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jun 6, 2022
glorv added a commit to glorv/tikv that referenced this pull request Jun 22, 2022
ref tikv#12355, ref tikv#12417

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
ti-chi-bot added a commit that referenced this pull request Jun 23, 2022
ref #12355, ref #12356, ref #12417, ref #12671

Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@glorv glorv closed this Jun 29, 2022
ti-chi-bot added a commit that referenced this pull request Aug 22, 2022
ref #12355, ref #12417, ref #12602

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Signed-off-by: glorv <glorvs@163.com>

Co-authored-by: glorv <glorvs@163.com>
Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
@glorv glorv deleted the simplify-metrics branch October 20, 2022 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. status/LGT1 Indicates that a PR has LGTM 1.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants