prometheus stats: Correctly group lines of the same metric name. by ggreenway · Pull Request #10833 · envoyproxy/envoy

ggreenway · 2020-04-17T20:30:10Z

Signed-off-by: Greg Greenway ggreenway@apple.com

Description: The prometheus exposition format requires all metrics of the same name (without tags) be grouped contiguously in the output. Additionally, it specifies that it is preferred for the stats to be output in the same order every time they are produced. Fix Envoy to comply with both of these constraints by taking an extra pass to collect all the stats so that they can be sorted before the final output is generated.

Risk Level: Low
Testing: New unit test
Docs Changes: Not needed
Release Notes: Added
Fixes #10073

Fixes envoyproxy#10073 Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-04-17T20:43:36Z

Question for anyone with knowledge of the exposition format (https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting):

The format says All lines for a given metric must be provided as one single group, with the optional HELP and TYPE lines first (in no particular order)..

How does this apply to histograms? Is a single histogram, which is composed of several _bucket lines plus the _sum and _count lines, considered a single metric in the sorting directive? For example, which of the following is correct? I'm guessing it is the first, because of how the TYPE annotation is specified, with hist as the name, not including any of the suffixes.

# TYPE hist histogram
hist_bucket{cluster="a", le="1"} 1
hist_bucket{cluster="a", le="+Inf"} 1
hist_sum{cluster="a"} 1
hist_count{cluster="a"} 1
hist_bucket{cluster="b", le="1"} 1
hist_bucket{cluster="b", le="+Inf"} 1
hist_sum{cluster="b"} 1
hist_count{cluster="b"} 1

Or

# TYPE hist histogram
hist_bucket{cluster="a", le="1"} 1
hist_bucket{cluster="a", le="+Inf"} 1
hist_bucket{cluster="b", le="1"} 1
hist_bucket{cluster="b", le="+Inf"} 1
hist_sum{cluster="a"} 1
hist_sum{cluster="b"} 1
hist_count{cluster="a"} 1
hist_count{cluster="b"} 1

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz · 2020-04-23T17:59:29Z

source/server/http/stats_handler.cc

+  auto generate_counter_and_gauge_output =
+      [](const auto& metric) -> std::pair<std::string, std::string> {
+    const std::string tags = formattedTags(metric.tags());
+    const std::string metric_name = metricName(metric.tagExtractedName());


I'm worried about the explosion of string-storage during the sort. Can we populate a vector of metrics, and sort them by Metric::tagExtractedStatName()? StatNames are sortable.

Then we could stream out the formatted value?

I realize that at the moment the admin output infrastructure buffers everything, so I'm not under the delusion we can in this PR have the prometheus handler consume constant space. But I think we could make it consume half the space it consumes with this approach, and that's probably a significant memory bump. And I think it wouldn't be hugely difficult to update the admin output infrastructure to stream rather than fully buffer.

There was a talk at KubeCon by Splunk about stats cardinality blowing through hundreds of gigs of memory in Prometheus (slides) and the source of the stats explosion was Istio.

also to sort by StatName you can use:

envoy/source/common/stats/symbol_table_impl.h

Line 697 in 62777e8

struct StatNameLessThan {

I worried about this too. Let me take a pass at this and see how it comes out.

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-04-24T00:19:18Z

@jmarantz I think I've fixed the memory usage. PTAL.

jmarantz

pretty cool! just a couple of minor suggestions.

@rulex123 FYI as you are messing around in the same code though I don't see a conflict. Note that with this PR we move closer to a point where we don't need to hold the entire copy of the stringified stats output in memory during an admin request.

source/server/http/stats_handler.cc

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz

Looks great generally, just nit-picking style now.

jmarantz · 2020-04-24T18:18:03Z

include/envoy/stats/refcount_ptr.h

 // API in years.
 template <class T> class RefcountPtr {
 public:
+  using element_type = T;


doc this? And also consider spelling it ElementType per Envoy style?

This definition is taken from std::shared_ptr. I'll doc it.

jmarantz · 2020-04-24T18:25:06Z

source/server/http/stats_handler.cc

+   * @param generate_output A std::function<std::string(const MetricType& metric, const std::string&
+   *        prefixedTagExtractedName)> which returns the output text for this metric.
+   */
+  auto process_type = [&response, &regex, used_only](const auto& metrics,


I was wondering whether you could use templates in some way to reduce the use of auto -- the auto for the generated function is OK but it'd be nicer to be explicit about what type metrics and generate_output were.

See https://google.github.io/styleguide/cppguide.html#Type_deduction for discussion.

I think my concrete suggestion is to break up most of this code into a private helper method processType which is templatized. Then you could have the body of statsAsPrometheus() simply be:

uint64_t metric_name_count = 0; metric_name_count += processType<Stats::Counter>(counters, generate_counter_and_gauge_output, "counter"); metric_name_count += processType<Stats::Gauge>(gauges, generate_counter_and_gauge_output, "gauge"); metric_name_count += processType<Stats::Histogram>(histograms, generate_histogram_output, "histogram");

I think that might be clearer than using the lambda with the auto...WDYT?

That would also remove the need for the element_type in the refcount, at least for this PR.

Both metrics and generate_output are templatized types because of how Counter, Gauge, and Histogram are defined. There isn't a common/inherited defintion for the functions that get the data. I had tried to make these non-auto, and it isn't possible.

I can make the helpers into template methods instead of a lambda, and I agree that may make this a little easier to follow. Some of the types will become more concrete.

I think once you make the helper into an explicit template method you'll be able to drop the auto, though I could be wrong.

Yeah, it came out better than I expected, although I had to explicitly specify template parameters in some cases that were unexpected to me. But I think it is more readable.

jmarantz · 2020-04-24T18:26:17Z

/wait

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz

Looks great...thank you!

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway · 2020-04-24T21:41:56Z

Haven't seen this error before:

Executed 1 out of 1 test: 1 test passes.
INFO: Build completed successfully, 6117 total actions
Merging coverage data...
error: bazel-out/k8-fastbuild/testlogs/test/coverage/coverage_tests/shard_20_of_50/coverage.dat: Malformed instrumentation profile data

/retest

repokitteh-read-only · 2020-04-24T21:42:00Z

🔨 rebuilding ci/circleci: coverage (failed build)

🐱

Caused by: a #10833 (comment) was created by @ggreenway.

see: more, trace.

Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)

prometheus stats: Correctly group lines of the same metric name.

d790da2

Fixes envoyproxy#10073 Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway mentioned this pull request Apr 17, 2020

Envoy doesn't output valid prometheus metrics #10073

Closed

fix spelling

48b0f06

Signed-off-by: Greg Greenway <ggreenway@apple.com>

mattklein123 assigned jmarantz Apr 18, 2020

Merge remote-tracking branch 'upstream/master' into prom_stats_sort

10e06ac

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz reviewed Apr 23, 2020

View reviewed changes

ggreenway added 3 commits April 23, 2020 12:43

Merge remote-tracking branch 'upstream/master' into prom_stats_sort

6778f66

compare StatName instead of strings to avoid doubling required memory

e4d0374

Signed-off-by: Greg Greenway <ggreenway@apple.com>

fix_format

a6ba179

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz reviewed Apr 24, 2020

View reviewed changes

source/server/http/stats_handler.cc Outdated Show resolved Hide resolved

source/server/http/stats_handler.cc Outdated Show resolved Hide resolved

source/server/http/stats_handler.cc Outdated Show resolved Hide resolved

ggreenway added 2 commits April 24, 2020 10:03

fixes

1e4d334

Signed-off-by: Greg Greenway <ggreenway@apple.com>

spelling

b7f3fac

Signed-off-by: Greg Greenway <ggreenway@apple.com>

jmarantz reviewed Apr 24, 2020

View reviewed changes

repokitteh-read-only bot added the waiting label Apr 24, 2020

Convert lambdas to templates for readability

c2eb4fb

Signed-off-by: Greg Greenway <ggreenway@apple.com>

repokitteh-read-only bot removed the waiting label Apr 24, 2020

jmarantz previously approved these changes Apr 24, 2020

View reviewed changes

move prom generation into it's own file

8aee0a6

Signed-off-by: Greg Greenway <ggreenway@apple.com>

ggreenway dismissed jmarantz’s stale review via 8aee0a6 April 24, 2020 20:14

Merge remote-tracking branch 'upstream/master' into prom_stats_sort

836783a

jmarantz approved these changes Apr 24, 2020

View reviewed changes

ggreenway merged commit 6516c88 into envoyproxy:master Apr 24, 2020

ppeble mentioned this pull request Apr 27, 2020

Pull in Prometheus Stats fix from Envoy upstream emissary-ingress/emissary#2650

Closed

Conversation

ggreenway commented Apr 17, 2020

Uh oh!

ggreenway commented Apr 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggreenway commented Apr 24, 2020

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz Apr 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmarantz commented Apr 24, 2020

Uh oh!

jmarantz left a comment

Choose a reason for hiding this comment

Uh oh!

ggreenway commented Apr 24, 2020

Uh oh!

repokitteh-read-only bot commented Apr 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggreenway commented Apr 17, 2020 •

edited

Loading

jmarantz Apr 24, 2020 •

edited

Loading