Added new API hdr_value_at_percentiles() as a means for redudant computation removal#90
Merged
mikeb01 merged 2 commits intoHdrHistogram:masterfrom Feb 12, 2021
Merged
Conversation
Contributor
|
I'm happy to include this, but I think the |
Contributor
Author
@mikeb01 followed the PR review and removed allocations within hdr_value_at_percentiles(). wdyt? |
Contributor
|
Thanks for the PR, merged now. |
This was referenced Dec 30, 2021
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem definition
Thinking on the way we do latency percentile analysis, we normally require more than one percentile to be calculated for the given histogram at any given moment (example of p50, p95, p99, p999).
If you look at the C function that provides the percentile calculation implementation you'll notice that we iterate over the counts array starting from 0 up until the we've reached a cumulative count of samples that represents or surpasses the requested percentile.
Sampling stack traces using perf
By profiling

hdr_value_at_percentilesymbol and annotate it we'll notice that 99% of the CPU cycles ofhdr_value_at_percentileare spent on the iterator calculating the cumulative count up until we reach or surpass the count that represents the percentile.This means that if we want to compute p50 and p99, we could re-use the already precomputed comulative count that gives us the p50 and start from that value ( instead of starting again at 0 ) for calculating the p99, thus eliminating the redundant computation.
Adding the value_at_percentiles API
As seen above, when we want to compute multiple percentiles for a given histogram, we've identified a reusable partial results in
hdr_value_at_percentilethat is also where threads are spending most CPU cycles while running on-CPU.With that in mind, by adding a new API with the following signature and implementation, and assuming the user will follow the API pre-requesites (sorted percentiles array input) we should be able to deeply reduce redundant work and improve the overall function performance.
new api function signature
benchmarking the new API
Now that we've defined the new API, using the same setup code as the above microbenchmark, we should be able to deterministically and properly compare the optimization provided by eliminating the redundant computation.
New microbenchmark:
Benefits of hdr_value_at_percentiles() optimization as a means for redudant computation removal
As expected, by simply reusing intermediate computation and consequently reducing the redundant computation we've moved from taking 342 microseconds of CPU time to compute the 4 percentiles to 107 microseconds, and from a max of approximately 11.6K percentile calculations per second to 37.36K calculations per second, representing an 3.4X speedup ( dependent on both the number of percentiles to calculate and also the percentile and data distribution ).
The following chart visualize demonstrates the above optimization.
Used platform details
The performance results provided in this PR were collected using: