For exponential histograms we currently assume in our percentile and other algorithms, that all values in a bucket have the exact same value.
This is done following the algorithm suggested in the DDSketch and UDDSketch papers providing the scientific proofs for the exponential histogram data structure.
This algorithm optimizes for the least worst-case relative error for the percentile estimation
We assume that all points in a histogram bucket lie on the "point of least relative error" (polre): This is the value that has the least relative error if the actual, correct value lies on either of the histogram bucket boundaries. An example:
- A histogram is queried for a percentile, and it is determined that the percentile lies in the
[1, 2] bucket, which contains 100 values.
- Our algorithm always returns
1.33333, no matter for which of the 100 values we are looking
- This is done because it has the least relative error towards the histogram boundaries
1 and 2, so we minimize the worst-case error:
|1.33333 - 1| / 1 = 33.333% relative error in case the real percentile is 1, |1.33333 - 2| / 2 = 33.333%, in case the real percentile is 2
However, this is not the only possible solution to do things. For example the experimental prometheus native histograms (which are the same as exponential histograms)
instead perform an interpolation within a bucket. In other words, they assume that the values in the bucket are distributed across the bucket instead of all falling on the single polre point.
Here is the explanation from the prometheus docs:
The worst case is an estimation at one end of a bucket where the actual value is at the other end of the bucket. Therefore, the maximum possible error is the whole width of a bucket. Not doing any interpolation and using some fixed midpoint within a bucket (for example the arithmetic mean or even the harmonic mean) would minimize the maximum possible error (which would then be half of the bucket width in case of the arithmetic mean), but in practice, the linear interpolation yields an error that is lower on average. Since the interpolation has worked well over many years of classic histogram usage, interpolation is also applied for native histograms.
Therefore, PromQL uses exponential extrapolation for the standard schemas, which models the assumption that dividing a bucket into two when increasing the schema number by one (i.e. doubling the resolution) will on average see similar populations in both new buckets. A more detailed explanation can be found in the PR implementing the interpolation method.
(Source)
So let's apply that to the same example as above:
- A histogram is queried for a percentile, and it is determined that the percentile lies in the
[1, 2] bucket, which contains 100 values.
- If we ask for a percentile which happens to be the first value in that bucket, we return
1. If it's the last value in the bucket, we return 2. If it's something in between, we interpolate.
- According to the claims from the prometheus team, this minimize the average error for real-world data distributions
- The worst-case error (so the "hard guarantees") however is worse: E.g. if we estimate the percentile to be
2, but the correct value is 1, we have a relative error of |1 - 2| / 1 = 100%
So there is no real right or wrong here, we just need to decide on an approach and ideally stick with it.
Use current approach (always return polre)
Pros:
- Minimizes the worst-case error, providing best "hard guarantees"
- Allows for "upscaling" during histogram merging to minimize the impact of low-resolution data (explanation below)
- The percentiles give a percetion of the accuracy of the provided data: e.g. if my
p95 and p99 happen to be equal, I can suspect that (a) I'm looking at few data or (b) my data is low-accuracy and I should do something about it
Cons:
- In the average case, the percentile will be further of the real value (worse average-case error) according to Prometheus team claims
- Data/queries can be more easily perceived as "broken": e.g. "My p95 and p99 percentile are exactly equal, elasticsearch doesn't work!!"
Use interpolation approach (smoothly interpolate between bucket boundaries)
Pros:
- Better Prometheus compatibility (though Prometheus native histograms are still experimental)
- Minimizes the average-case error (according to Prometheus team claims)
- "Smoother" values: e.g. the p99 will always be at least slightly above p95
Cons:
- The worst-case error is higher than the other approach, however there is still a bound for hard-guarantees
- Can create a false perception of accuracy when querying low-resolution data for percentiles
- "Upscaling" during histogram merging is not possible: A single low-resolution histogram will "drag down" the accuracy of the percentile estimation, even if 99% of the data is high-accuracy
Current "upscaling" merge implementation
Let's assume we want to estimate a percentile for the combination of two histograms:
- The first histogram contains 10k values and has a base of
1.01 and is fairly accurate: first bucket is [1, 1.01], second is [1.01, 1.0201], and so on
- The second histogram only contains a single value, and has the relatively wide bucket
[1, 2]
Before estimating a percentile, we have to merge those two histograms. To do so, they need to be brought to the same bucket sizes.
So normally, the first histograms bucket sizes will be increased to match the bucket sizes of the second histogram.
This greatly reduces the accuracy, even though the first histogram contains 10k values and therefore likely decides the percentile instead of the second histogram containing only a single value.
However, if we follow the assumption that all values in a bucket lie on the polre, we can avoid this:
Instead of increasing the sizes of the buckets of the first (high resolution) histogram, we make the buckets of the second (low-resolution) histogram smaller:
We simply use the bucket sizes of the first histogram and assign the bucket counts of the second histogram based on the polre.
So for example the [1,2] bucket has the polre 1.33333. So we replace the [1,2] bucket with the [1.328, 1.341] bucket.
So our merged histogram maintains the higher resolution, and therefore likely gives better percentile estimates.
The same does not work if we interpolate between histogram boundaries to estimate percentiles:
In that case we would have to distribute the counts of the [1,2] bucket across all higher resolution buckets falling in that range, which would lead to an explosion of the number of populated buckets.
How can this happen in practice?
In OpenTelemetry, the exponential histogram resolution is configured per SDK-instance. So if a SDK is misconfigured (e.g. maximum 10 buckets for histograms), then all its histogram metrics will have a low accuracy. This means that the accuracy drag-down would effect (a) dashboards for this SDK/service instance and (b) dashboards aggregating across SDKs/service-instances. Dashboard of other SDKs/services would not be affected by the accuracy drag-down, as their queries would not include data points from the misconfigured SDK.
For exponential histograms we currently assume in our percentile and other algorithms, that all values in a bucket have the exact same value.
This is done following the algorithm suggested in the DDSketch and UDDSketch papers providing the scientific proofs for the exponential histogram data structure.
This algorithm optimizes for the least worst-case relative error for the percentile estimation
We assume that all points in a histogram bucket lie on the "point of least relative error" (polre): This is the value that has the least relative error if the actual, correct value lies on either of the histogram bucket boundaries. An example:
[1, 2]bucket, which contains 100 values.1.33333, no matter for which of the 100 values we are looking1and2, so we minimize the worst-case error:|1.33333 - 1| / 1 = 33.333%relative error in case the real percentile is1,|1.33333 - 2| / 2=33.333%, in case the real percentile is2However, this is not the only possible solution to do things. For example the experimental prometheus native histograms (which are the same as exponential histograms)
instead perform an interpolation within a bucket. In other words, they assume that the values in the bucket are distributed across the bucket instead of all falling on the single polre point.
Here is the explanation from the prometheus docs:
(Source)
So let's apply that to the same example as above:
[1, 2]bucket, which contains 100 values.1. If it's the last value in the bucket, we return2. If it's something in between, we interpolate.2, but the correct value is1, we have a relative error of|1 - 2| / 1 = 100%So there is no real right or wrong here, we just need to decide on an approach and ideally stick with it.
Use current approach (always return polre)
Pros:
p95andp99happen to be equal, I can suspect that (a) I'm looking at few data or (b) my data is low-accuracy and I should do something about itCons:
Use interpolation approach (smoothly interpolate between bucket boundaries)
Pros:
Cons:
Current "upscaling" merge implementation
Let's assume we want to estimate a percentile for the combination of two histograms:
1.01and is fairly accurate: first bucket is[1, 1.01], second is[1.01, 1.0201], and so on[1, 2]Before estimating a percentile, we have to merge those two histograms. To do so, they need to be brought to the same bucket sizes.
So normally, the first histograms bucket sizes will be increased to match the bucket sizes of the second histogram.
This greatly reduces the accuracy, even though the first histogram contains 10k values and therefore likely decides the percentile instead of the second histogram containing only a single value.
However, if we follow the assumption that all values in a bucket lie on the polre, we can avoid this:
Instead of increasing the sizes of the buckets of the first (high resolution) histogram, we make the buckets of the second (low-resolution) histogram smaller:
We simply use the bucket sizes of the first histogram and assign the bucket counts of the second histogram based on the polre.
So for example the
[1,2]bucket has the polre1.33333. So we replace the[1,2]bucket with the[1.328, 1.341]bucket.So our merged histogram maintains the higher resolution, and therefore likely gives better percentile estimates.
The same does not work if we interpolate between histogram boundaries to estimate percentiles:
In that case we would have to distribute the counts of the
[1,2]bucket across all higher resolution buckets falling in that range, which would lead to an explosion of the number of populated buckets.How can this happen in practice?
In OpenTelemetry, the exponential histogram resolution is configured per SDK-instance. So if a SDK is misconfigured (e.g. maximum 10 buckets for histograms), then all its histogram metrics will have a low accuracy. This means that the accuracy drag-down would effect (a) dashboards for this SDK/service instance and (b) dashboards aggregating across SDKs/service-instances. Dashboard of other SDKs/services would not be affected by the accuracy drag-down, as their queries would not include data points from the misconfigured SDK.