Skip to content

Added new API hdr_value_at_percentiles() as a means for redudant computation removal#90

Merged
mikeb01 merged 2 commits intoHdrHistogram:masterfrom
RedisBloom:value_at_percentiles
Feb 12, 2021
Merged

Added new API hdr_value_at_percentiles() as a means for redudant computation removal#90
mikeb01 merged 2 commits intoHdrHistogram:masterfrom
RedisBloom:value_at_percentiles

Conversation

@filipecosta90
Copy link
Contributor

Problem definition

Thinking on the way we do latency percentile analysis, we normally require more than one percentile to be calculated for the given histogram at any given moment (example of p50, p95, p99, p999).

If you look at the C function that provides the percentile calculation implementation you'll notice that we iterate over the counts array starting from 0 up until the we've reached a cumulative count of samples that represents or surpasses the requested percentile.

Sampling stack traces using perf

By profiling hdr_value_at_percentile symbol and annotate it we'll notice that 99% of the CPU cycles of hdr_value_at_percentile are spent on the iterator calculating the cumulative count up until we reach or surpass the count that represents the percentile.
perf-hdr_value_at_percentile

perf-hdr_value_at_percentile-annotated

This means that if we want to compute p50 and p99, we could re-use the already precomputed comulative count that gives us the p50 and start from that value ( instead of starting again at 0 ) for calculating the p99, thus eliminating the redundant computation.

Adding the value_at_percentiles API

As seen above, when we want to compute multiple percentiles for a given histogram, we've identified a reusable partial results in hdr_value_at_percentile that is also where threads are spending most CPU cycles while running on-CPU.

With that in mind, by adding a new API with the following signature and implementation, and assuming the user will follow the API pre-requesites (sorted percentiles array input) we should be able to deeply reduce redundant work and improve the overall function performance.

new api function signature

/**
 * Get the values at the given percentiles.
 *
 * @param h "This" pointer.
 * @param percentiles The ordered percentiles array to get the values for.
 * @param N Number of elements in the arrays.
 * @param values Destination array containg the values at the given percentiles.
 * @return 0 on success, ENOMEM if malloc failed.
 */
int hdr_value_at_percentiles(const struct hdr_histogram* h, const double* percentiles, const size_t N, int64_t** values);

benchmarking the new API

Now that we've defined the new API, using the same setup code as the above microbenchmark, we should be able to deterministically and properly compare the optimization provided by eliminating the redundant computation.

New microbenchmark:

static void BM_hdr_value_at_percentile_given_array(benchmark::State &state) {
  /////////////////////
  // benchmark setup //
  /////////////////////
  srand(12345);
  const int64_t precision = state.range(0);
  const int64_t max_value = state.range(1);
  const double percentile_list[4] = {50.0, 95.0, 99.0, 99.9};
  std::default_random_engine generator;
  // gama distribution shape 1 scale 100000
  std::gamma_distribution<double> latency_gamma_dist(1.0, 100000);
  struct hdr_histogram *histogram;
  hdr_init(min_value, max_value, precision, &histogram);
  for (int64_t i = 1; i < generated_datapoints; i++) {
    int64_t number = int64_t(latency_gamma_dist(generator)) + 1;
    number = number > max_value ? max_value : number;
    hdr_record_value(histogram, number);
  }
  benchmark::DoNotOptimize(histogram->counts);
  int64_t items_processed = 0;
  ///////////////////
  // benchmark run //
  ///////////////////
  for (auto _ : state) {
    for (auto percentile : percentile_list) {
      benchmark::DoNotOptimize(hdr_value_at_percentile(histogram, percentile));
      // read/write barrier
      benchmark::ClobberMemory();
    }
    items_processed += 4;
  }
  state.SetItemsProcessed(items_processed);
}
~/HdrHistogram_c/build$ ./test/hdr_histogram_benchmark --benchmark_filter=BM_hdr_value_at_percentile --benchmark_min_time=60
2021-02-07 16:55:23
Running ./test/hdr_histogram_benchmark
Run on (40 X 1227.32 MHz CPU s)
CPU Caches:
  L1 Data 32K (x20)
  L1 Instruction 32K (x20)
  L2 Unified 1024K (x20)
  L3 Unified 28160K (x1)
Load Average: 0.83, 0.37, 0.14
---------------------------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------------------
BM_hdr_value_at_percentile_given_array/3/86400000        342337 ns       342335 ns       245263 items_per_second=11.6845k/s
BM_hdr_value_at_percentile_given_array/3/86400000000     343298 ns       343295 ns       245245 items_per_second=11.6518k/s
BM_hdr_value_at_percentile_given_array/4/86400000       3064091 ns      3064067 ns        27417 items_per_second=1.30545k/s
BM_hdr_value_at_percentile_given_array/4/86400000000    3062938 ns      3062914 ns        27379 items_per_second=1.30595k/s
BM_hdr_value_at_percentiles_given_array/3/86400000        107060 ns       107059 ns       786567 items_per_second=37.3625k/s
BM_hdr_value_at_percentiles_given_array/3/86400000000     107040 ns       107039 ns       783051 items_per_second=37.3694k/s
BM_hdr_value_at_percentiles_given_array/4/86400000       1043602 ns      1043592 ns        80462 items_per_second=3.83292k/s
BM_hdr_value_at_percentiles_given_array/4/86400000000    1043322 ns      1043312 ns        80458 items_per_second=3.83394k/s

Benefits of hdr_value_at_percentiles() optimization as a means for redudant computation removal

As expected, by simply reusing intermediate computation and consequently reducing the redundant computation we've moved from taking 342 microseconds of CPU time to compute the 4 percentiles to 107 microseconds, and from a max of approximately 11.6K percentile calculations per second to 37.36K calculations per second, representing an 3.4X speedup ( dependent on both the number of percentiles to calculate and also the percentile and data distribution ).

The following chart visualize demonstrates the above optimization.

perf-hdr_value_at_percentile_chart

Used platform details

The performance results provided in this PR were collected using:

  • Hardware platform: A physical HPE ProLiant DL380 Gen10 Server, with one Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, Maximum memory capacity of 3 TB, disabling CPU Frequency Scaling and with all configurable BIOS and CPU system settings set to performance.
  • OS: Ubuntu 18.04 Linux release 4.15.0-123.
  • Installed Memory: 64GB DDR4-SDRAM @ 2933 MHz
  • Compiler: gcc-10 (Ubuntu 10.1.0-2ubuntu1~18.04) 10.1.0

@mikeb01
Copy link
Contributor

mikeb01 commented Feb 8, 2021

I'm happy to include this, but I think the values array should be allocated by the caller and not inside the function.

@filipecosta90
Copy link
Contributor Author

array should be allocated by the caller

@mikeb01 followed the PR review and removed allocations within hdr_value_at_percentiles(). wdyt?

@mikeb01 mikeb01 merged commit c73b164 into HdrHistogram:master Feb 12, 2021
@mikeb01
Copy link
Contributor

mikeb01 commented Feb 12, 2021

Thanks for the PR, merged now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants