Skip to content

stats: add 'detailed' buckets-mode that captures all data from circlhist#27094

Merged
jmarantz merged 30 commits intoenvoyproxy:mainfrom
jmarantz:histogram-detail
May 18, 2023
Merged

stats: add 'detailed' buckets-mode that captures all data from circlhist#27094
jmarantz merged 30 commits intoenvoyproxy:mainfrom
jmarantz:histogram-detail

Conversation

@jmarantz
Copy link
Copy Markdown
Contributor

@jmarantz jmarantz commented May 2, 2023

Commit Message: The current mechanisms for extracting histograms leave a lot of useful data inaccessible. This adds a new extraction mechanism to provide the full computed detail provided by the histogram library, as well as JSON and textual access to this data through the admin port.

Additional Description: This partially addresses #11259 but does not add the new detailed bucket mode to the metrics service gRPC API.

This PR sets the stage for admin visualization of histograms in #26472. The visualization could've occurred without this but it was not very rich; it's got more potential for revealing insight if the full bucket data can be rendered. Note that when aggregating histograms over multiple Envoy proxies it may be better to use the configurable fixed bucket sizes so that the histogram data can be combined across Envoy instances by the time-series database (e.g. Prometheus).

This PR could be split into smaller ones if helpful:

  • ParentHistogram detailed bucket APIs and tests
  • New Admin histogram_buckets enum value, Text and JSON admin renderings and tests

That might be overkill in terms of PR granularity, but when reviewing it might be useful to consider these 3 levels of API separately.

Risk Level: low: this is new functionality.
Testing: //test/...
Docs Changes: added doc for new bucket-mode.
Release Notes: required
Platform Specific Features: n/a

jmarantz added 2 commits May 1, 2023 22:03
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #27094 was opened by jmarantz.

see: more, trace.

jmarantz added 5 commits May 2, 2023 00:01
…sted.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@moderation
Copy link
Copy Markdown
Contributor

There is a dependency change I've been wanting to make for ages that might be a good fit for this PR. Metadata change only

diff --git a/bazel/repository_locations.bzl b/bazel/repository_locations.bzl                                                                 
index 2bf42e8df8..6af423a00a 100644                                                                                                          
--- a/bazel/repository_locations.bzl                                                                                                         
+++ b/bazel/repository_locations.bzl                                                                                                         
@@ -218,8 +217,8 @@ REPOSITORY_LOCATIONS_SPEC = dict(                                                                                        
     ),
     com_github_circonus_labs_libcircllhist = dict(
         project_name = "libcircllhist",
-        project_desc = "An implementation of Circonus log-linear histograms",                                                               
-        project_url = "https://github.com/circonus-labs/libcircllhist",                                                                     
+        project_desc = "An implementation of OpenHistogram log-linear histograms",                                                          
+        project_url = "https://github.com/openhistogram/libcircllhist",                                                                     
         version = "39f9db724a81ba78f5d037f1cae79c5a07107c8e",
         sha256 = "fd2492f6cc1f8734f8f57be8c2e7f2907e94ee2a4c02445ce59c4241fece144b",
         strip_prefix = "libcircllhist-{version}",
@@ -228,7 +227,7 @@ REPOSITORY_LOCATIONS_SPEC = dict(                                                                                        
         release_date = "2019-05-21",
         cpe = "N/A",
         license = "Apache-2.0",
-        license_url = "https://github.com/circonus-labs/libcircllhist/blob/{version}/LICENSE",                                              
+        license_url = "https://github.com/openhistogram/libcircllhist/blob/{version}/LICENSE",                                              
     ),
     com_github_cyan4973_xxhash = dict(

@moderation
Copy link
Copy Markdown
Contributor

Also the commit we are using is from May 2019. I'm not sure what commit it is but the latest commit fails in tests. It would be great to update to the latest commit at some point

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor Author

jmarantz commented May 3, 2023

Hi @moderation -- any reason not to just issue a PR for that dep change so it isn't blocked on this one?

@moderation
Copy link
Copy Markdown
Contributor

No, but I don't think I have the C++ chops to fully incorporate the latest libcirclhist code

@jmarantz
Copy link
Copy Markdown
Contributor Author

jmarantz commented May 3, 2023

Ah so there's an interface change?

Then I definitely don't want to roll that into this one :) I'll look at updating the dep separately.

@jmarantz
Copy link
Copy Markdown
Contributor Author

jmarantz commented May 3, 2023

Actually can you file a bug for the upgrade? You did say "it fails in tests". I think maybe someone has already looked at this issue but I forget who.

@jmarantz
Copy link
Copy Markdown
Contributor Author

jmarantz commented May 3, 2023

@moderation I drafted #27159 to see what CI has to say.

jmarantz added 4 commits May 3, 2023 20:50
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review May 4, 2023 02:58
jmarantz added 3 commits May 3, 2023 23:08
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz
Copy link
Copy Markdown
Contributor Author

jmarantz commented May 4, 2023

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #27094 (comment) was created by @jmarantz.

see: more, trace.

jmarantz added 2 commits May 4, 2023 11:43
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/wait

intervals=1:2, 2:3
summary=P0(1,1) P25(1.0625,1.034) P50(2.0166,1.068) P75(2.058,2.005) P90(2.083,2.06) P95(2.091,2.08) P99(2.09,2.09) P99.5(2.099,2.098) P99.9(2.099,2.099) P100(2.1,2.1)

Each bucket is shown as `value:count`. In the above example there are two buckets. `totals`
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear from reading the docs what value means. Given that a bucket is always a range of values, what is this single value? Upper bound? Lower bound? Midpoint?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali do you know? I think I tried to suss this out with a test but couldn't determine it, nor could I found precise doc for it other than the sparse comments in the circlhist header file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ramaraochavali do you know? I think I tried to suss this out with a test but couldn't determine it, nor could I found precise doc for it other than the sparse comments in the circlhist header file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to either use hist_bucket_to_string() or hist_bucket_to_double() and some other fields to figure out the bounds of the bucket and express that in the output.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't see hist_bucket_t APIs and I think I can use those to get a bucket-min and bucket-max and ultimately use those to control the width of the histogram bar when rendering.

I'll go off and do that.


// Determine how we will map source buckets to destination buckets.
// Consider 3 cases:
// 1. if we have 15 source buckets and 5 destination buckets we'll
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the distribution of the histogram, this may not be a good algorithm for reducing bucket count. It is often preferable to keep more buckets near the upper/lower limits, and fewer buckets in the middle.

Can we get away with not limiting the buckets at all, and just using whatever the histogram library gives us? I believe it is a finite number of buckets, and not excessively large.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the maximum is from the histogram library, but I couldn't get it to give that many buckets, and felt that based on my experiments if I set 64 buckets that would generally not require any interpolation. The interpolation is just defensive to protect against a future version of circlhist, or a certain data pattern that I have not tried, resulting in a lot more buckets.

I agree with you putting the remainder at the beginning is not necessarily best but I'm not sure why another distribution would be better, and so I opted for what would be easiest :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what the maximum is from the histogram library, but I couldn't get it to give that many buckets, and felt that based on my experiments if I set 64 buckets that would generally not require any interpolation. The interpolation is just defensive to protect against a future version of circlhist, or a certain data pattern that I have not tried, resulting in a lot more buckets.

I agree with you putting the remainder at the beginning is not necessarily best but I'm not sure why another distribution would be better, and so I opted for what would be easiest :)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say for now, just add an ENVOY_BUG(buckets <= 100) or something and remove the interpolation code entirely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably OK. We could also decide on that at call-sites by calling getDetail(MAX_BUCKETS + 1) and then asserting at call-site we got <= MAX_BUCKETS.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, 190 buckets is a lot; but it's probably not so many that it would be difficult to render, right? But let me express this a different way: why try to keep the native resolution of the underlying histogram, if we then change that resolution immediately by combining buckets? I believe the underlying design of this histogram has dynamic bucket widths. What if you are combining buckets are drastically different widths? I think that could result in strange distortions of the histogram. I'm wary of doing this type of math on histograms without fully understanding how it changes the data.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just want to render geometry it's probably OK, but I wanted to annotate with actual values as text. I think 30-40 is about that max we'd want to interpolate to.

My thinking is that circlhist is reasonably good at picking buckets logarithmically needed for a dynamic range, and that incrementally changing them by combining neighbors until you get the number of buckets you want should avoid being misleading. I think changing from just a bucket 'value' to a range (min,width) makes me feel like we are not distorting the data with this transform.

But in any case this isn't changing the existing behavior; this would (at least for now) only be used for graphical rendering. We could then iterate on how we combine them.

Another option is to have the C++ always yield all the buckets and sort out interpolation in JS.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran a few more experiments and found that the number of buckets appears to go up with the log of the number of distinct values I record. With 10M entries I get to 550 buckets. So from a memory and bandwidth perspective it might be OK to just stream out all the buckets as there will not be so many that it will OOM the binary.

But from a UI perspective we are going to want to combine buckets. We could do that in JS instead if that makes you feel better about this C++ interface. That would also make it possible for the UI to show detailed sub bucket info when hovering over an interpolated histogram bar.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that's interesting.

This is way outside the scope of this PR, but here's a thought: do we get any value from the current histogram implementation with dynamic buckets? For prometheus, we always coerce to a fixed set of buckets. For this new use case, you need to significantly coerce the buckets (although at least you're just moving them, not changing any of the bounds). Would it make more sense to just use a really simple histogram implementation with fixed buckets?

Assuming that's a topic for another day, I think in this PR I'd prefer having the JS do the bucket-merging for now so that we're not committing to a specific bucket merging behavior in Envoy; any client can merge in any way they see fit (linear, logarithmic, fancy-math, etc).

Copy link
Copy Markdown
Contributor Author

@jmarantz jmarantz May 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't get much value from the dynamic impl right now in terms of detailed buckets, but I suspect it currently helps us provide accurate p95, p99, p99.5 etc across a wide range of data values. And with this series of PRs we'd get detailed bucketing for a single proxy via its admin endpoint. I think this coercion I proposed would rarely be activated in practice; it'd only come into play with a very wide dynamic range of actual data.

We could implement the coercion in C++ but let the endpoint query-params determine whether to coerce at all or what the max bucket-count should be. I think in practice we'll get less than 30 buckets; but I don't have actual data for this yet.

However I'm warming up to moving the coercion to JS as that way a UI could easily be added to show the detail within a merged bucket. So I'll remove the coercion from this PR. I'll move status to 'draft' for now and and re-open soon. Thanks for working through this.

jmarantz added 3 commits May 11, 2023 10:26
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
jmarantz added 2 commits May 11, 2023 11:03
…uckets.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as draft May 15, 2023 16:54
jmarantz added 2 commits May 15, 2023 16:09
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz marked this pull request as ready for review May 16, 2023 04:02
@jmarantz
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #27094 (comment) was created by @jmarantz.

see: more, trace.

Copy link
Copy Markdown
Contributor Author

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpolation is now fully removed from this PR; it will re-emerge later in JS.

Copy link
Copy Markdown
Contributor Author

@jmarantz jmarantz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interpolation is now fully removed from this PR; it will re-emerge later in JS.

Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good now.

/wait

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Joshua Marantz <jmarantz@google.com>
Copy link
Copy Markdown
Member

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jmarantz
Copy link
Copy Markdown
Contributor Author

/retest

@repokitteh-read-only
Copy link
Copy Markdown

Retrying Azure Pipelines:
Check envoy-presubmit isn't fully completed, but will still attempt retrying.
Retried failed jobs in: envoy-presubmit

🐱

Caused by: a #27094 (comment) was created by @jmarantz.

see: more, trace.

Signed-off-by: Joshua Marantz <jmarantz@google.com>
@jmarantz jmarantz merged commit f4e2a26 into envoyproxy:main May 18, 2023
@jmarantz jmarantz deleted the histogram-detail branch May 18, 2023 01:05
reskin89 pushed a commit to reskin89/envoy that referenced this pull request Jul 11, 2023
…ist (envoyproxy#27094)

Commit Message: The current mechanisms for extracting histograms leave a lot of useful data inaccessible. This adds a new extraction mechanism to provide the full computed detail provided by the histogram library, as well as JSON and textual access to this data through the admin port.

Additional Description: This partially addresses envoyproxy#11259 but does not add the new detailed bucket mode to the metrics service gRPC API.

This PR sets the stage for admin visualization of histograms in envoyproxy#26472. The visualization could've occurred without this but it was not very rich; it's got more potential for revealing insight if the full bucket data can be rendered. Note that when aggregating histograms over multiple Envoy proxies it may be better to use the configurable fixed bucket sizes so that the histogram data can be combined across Envoy instances by the time-series database (e.g. Prometheus).

Risk Level: low: this is new functionality.
Testing: //test/...
Docs Changes: added doc for new bucket-mode.
Release Notes: required
Platform Specific Features: n/a

Signed-off-by: Joshua Marantz <jmarantz@google.com>
Signed-off-by: Ryan Eskin <ryan.eskin89@protonmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants