server: add an option to support returning simplified metrics#12417
server: add an option to support returning simplified metrics#12417glorv wants to merge 9 commits intotikv:masterfrom
Conversation
|
[REVIEW NOTIFICATION] This pull request has been approved by:
To complete the pull request process, please ask the reviewers in the list to review by filling The full list of commands accepted by this bot can be found here. DetailsReviewer can indicate their review by submitting an approval review. |
8d912fe to
669a356
Compare
|
Is it still worthy after enabling compression? |
We need to also reduce the metrics storage size(by 90%), so this is one of the efforts. |
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
Signed-off-by: glorv <glorvs@163.com>
|
/test |
|
@innerr @BusyJay @Connor1996 @tonyxuqqi @5kbpers PTAL, thanks |
|
@glorv: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ref tikv#12355, ref tikv#12417 Signed-off-by: glorv <glorvs@163.com> Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>
What is changed and how it works?
Issue Number: ref #12355
What's Changed:
After this change, the metrics registry is divided into 4 registries:
HIGH_PRIORITY_REGISTRY. The core metrics that should be always fetched, ifserver.simplified-metricsis set to true, we try to reduce some counter and histogram samples but still ensure it won't affact the grafana charts.FULL_HISTOGRAM_REGISTRY. These metrics are also high priority, but due to the expression used in grafana or the charts type("flamegraph"), the sample data should not be simplified, so always return the original full sample data.UNUSED_METRICS_REGISTRY. These metrics are not used in tikv's grafana, so do not return them in the simplified mode.default. The rest are in the system default registry. In the simplified mode, these samples will be returned will a smaller frequency, by not the frequency is 1/2, that is one in 2 request includes them.Add a new server config
server.simplified-metricsto control whether the metrics API is running under thesimplified-mode.If this config is set to true(false by default), tikv will return simplified metrics. If set to false, it will still return the full metrics as before.
Sample data simplification strategy:
NOTE:
After the compact of histogram, there is no information loss in theory (as we only filtered out buckets will 0 sample values), due to the expr implementation of prometheus, some type of expression or charts can not be drawn correctly, so we keep the effected metrics at the
FULL_HISTOGRAM_REGISTRY. Currently, the affected charts are: expression containsdeltafunction and charts withflamegraph.Side Effect:
due to the reduction of some metrics, when
server.simplified-metricsis set to true, there are some negative side effects to the grafana data:0data because we filter out 0 data for counter and histogram.Benchmark result:
With
server.simplified-metricsset to false, the response data size is about 1.3MB, total samples count is: 17081.With
server.simplified-metricsset to true, when only high priority metrics are returned, the data size is about 125KB, the samples count is about 1494. when the default metrics is also returned, the data size is 226KB, samples count is 2609.Thus the overall sample reduced percentage when
server.simplified-metricsset to true and false is: (1494+2609)/2/17081 = 0.12.Related changes
pingcap/docs/pingcap/docs-cn:Check List
Tests
Side effects
Release note