prometheus stats: Correctly group lines of the same metric name.#10833
prometheus stats: Correctly group lines of the same metric name.#10833ggreenway merged 11 commits intoenvoyproxy:masterfrom
Conversation
Fixes envoyproxy#10073 Signed-off-by: Greg Greenway <ggreenway@apple.com>
|
Question for anyone with knowledge of the exposition format (https://github.com/prometheus/docs/blob/master/content/docs/instrumenting/exposition_formats.md#grouping-and-sorting): The format says How does this apply to histograms? Is a single histogram, which is composed of several Or |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
source/server/http/stats_handler.cc
Outdated
| auto generate_counter_and_gauge_output = | ||
| [](const auto& metric) -> std::pair<std::string, std::string> { | ||
| const std::string tags = formattedTags(metric.tags()); | ||
| const std::string metric_name = metricName(metric.tagExtractedName()); |
There was a problem hiding this comment.
I'm worried about the explosion of string-storage during the sort. Can we populate a vector of metrics, and sort them by Metric::tagExtractedStatName()? StatNames are sortable.
Then we could stream out the formatted value?
I realize that at the moment the admin output infrastructure buffers everything, so I'm not under the delusion we can in this PR have the prometheus handler consume constant space. But I think we could make it consume half the space it consumes with this approach, and that's probably a significant memory bump. And I think it wouldn't be hugely difficult to update the admin output infrastructure to stream rather than fully buffer.
There was a talk at KubeCon by Splunk about stats cardinality blowing through hundreds of gigs of memory in Prometheus (slides) and the source of the stats explosion was Istio.
There was a problem hiding this comment.
also to sort by StatName you can use:
envoy/source/common/stats/symbol_table_impl.h
Line 697 in 62777e8
There was a problem hiding this comment.
I worried about this too. Let me take a pass at this and see how it comes out.
Signed-off-by: Greg Greenway <ggreenway@apple.com>
Signed-off-by: Greg Greenway <ggreenway@apple.com>
|
@jmarantz I think I've fixed the memory usage. PTAL. |
jmarantz
left a comment
There was a problem hiding this comment.
pretty cool! just a couple of minor suggestions.
@rulex123 FYI as you are messing around in the same code though I don't see a conflict. Note that with this PR we move closer to a point where we don't need to hold the entire copy of the stringified stats output in memory during an admin request.
jmarantz
left a comment
There was a problem hiding this comment.
Looks great generally, just nit-picking style now.
include/envoy/stats/refcount_ptr.h
Outdated
| // API in years. | ||
| template <class T> class RefcountPtr { | ||
| public: | ||
| using element_type = T; |
There was a problem hiding this comment.
doc this? And also consider spelling it ElementType per Envoy style?
There was a problem hiding this comment.
This definition is taken from std::shared_ptr. I'll doc it.
source/server/http/stats_handler.cc
Outdated
| * @param generate_output A std::function<std::string(const MetricType& metric, const std::string& | ||
| * prefixedTagExtractedName)> which returns the output text for this metric. | ||
| */ | ||
| auto process_type = [&response, ®ex, used_only](const auto& metrics, |
There was a problem hiding this comment.
I was wondering whether you could use templates in some way to reduce the use of auto -- the auto for the generated function is OK but it'd be nicer to be explicit about what type metrics and generate_output were.
See https://google.github.io/styleguide/cppguide.html#Type_deduction for discussion.
I think my concrete suggestion is to break up most of this code into a private helper method processType which is templatized. Then you could have the body of statsAsPrometheus() simply be:
uint64_t metric_name_count = 0;
metric_name_count += processType<Stats::Counter>(counters, generate_counter_and_gauge_output, "counter");
metric_name_count += processType<Stats::Gauge>(gauges, generate_counter_and_gauge_output, "gauge");
metric_name_count += processType<Stats::Histogram>(histograms, generate_histogram_output, "histogram");
I think that might be clearer than using the lambda with the auto...WDYT?
That would also remove the need for the element_type in the refcount, at least for this PR.
There was a problem hiding this comment.
Both metrics and generate_output are templatized types because of how Counter, Gauge, and Histogram are defined. There isn't a common/inherited defintion for the functions that get the data. I had tried to make these non-auto, and it isn't possible.
I can make the helpers into template methods instead of a lambda, and I agree that may make this a little easier to follow. Some of the types will become more concrete.
There was a problem hiding this comment.
I think once you make the helper into an explicit template method you'll be able to drop the auto, though I could be wrong.
There was a problem hiding this comment.
Yeah, it came out better than I expected, although I had to explicitly specify template parameters in some cases that were unexpected to me. But I think it is more readable.
|
/wait |
Signed-off-by: Greg Greenway <ggreenway@apple.com>
jmarantz
left a comment
There was a problem hiding this comment.
Looks great...thank you!
Signed-off-by: Greg Greenway <ggreenway@apple.com>
|
Haven't seen this error before: /retest |
|
🔨 rebuilding |
Signed-off-by: Spencer Lewis <slewis@squareup.com> * master: fault injection: add support for setting gRPC status (envoyproxy#10841) tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940) Fix typo on Postgres Proxy documentation. (envoyproxy#10930) fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931) gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508) http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941) stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735) hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929) prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833) status: Fix ASAN error in Status payload handling (envoyproxy#10906) path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922) compressor filter: add benchmark (envoyproxy#10464) xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934) ci: update before purge in cleanup (envoyproxy#10938) tracer: Improve test coverage for x-ray (envoyproxy#10890) Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
Signed-off-by: Greg Greenway ggreenway@apple.com
Description: The prometheus exposition format requires all metrics of the same name (without tags) be grouped contiguously in the output. Additionally, it specifies that it is preferred for the stats to be output in the same order every time they are produced. Fix Envoy to comply with both of these constraints by taking an extra pass to collect all the stats so that they can be sorted before the final output is generated.
Risk Level: Low
Testing: New unit test
Docs Changes: Not needed
Release Notes: Added
Fixes #10073