Skip to content

hdr_value_at_percentile 30% on cpu time reduction: reuse common computed values within highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value#94

Merged
mikeb01 merged 2 commits intoHdrHistogram:masterfrom
RedisBloom:value_at_percentile_opt
Aug 13, 2021

Conversation

@filipecosta90
Copy link
Contributor

The performance problem

Using the current master commit 706a9e0.
Given the following on cpu analysis of hdr_value_at_percentile() we notice that the top hotspots all relate to move_next and within it highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value() as shown on the follow top on-cpu consumers:

Function / Call Stack CPU Time (secs) Relative CPU Time Module Function (Full) Source File
hdr_size_of_equivalent_value_range 2.184 10.56% libhdr_histogram.so.6 hdr_size_of_equivalent_value_range hdr_histogram.c
move_next 1.296 6.27% libhdr_histogram.so.6 move_next hdr_histogram.c
hdr_value_at_index 1.176 5.69% libhdr_histogram.so.6 hdr_value_at_index hdr_histogram.c
hdr_median_equivalent_value 0.908 4.39% libhdr_histogram.so.6 hdr_median_equivalent_value hdr_histogram.c
get_bucket_index 0.908 4.39% libhdr_histogram.so.6 get_bucket_index hdr_histogram.c
value_from_index 0.884 4.27% libhdr_histogram.so.6 value_from_index hdr_histogram.c
get_bucket_index 0.876 4.24% libhdr_histogram.so.6 get_bucket_index hdr_histogram.c
hdr_next_non_equivalent_value 0.816 3.95% libhdr_histogram.so.6 hdr_next_non_equivalent_value hdr_histogram.c
all_values_iter_next 0.714 3.45% libhdr_histogram.so.6 all_values_iter_next hdr_histogram.c
count_leading_zeros_64 0.624 3.02% libhdr_histogram.so.6 count_leading_zeros_64 hdr_histogram.c
get_bucket_index 0.616 2.98% libhdr_histogram.so.6 get_bucket_index hdr_histogram.c
hdr_iter_next 0.56 2.71% libhdr_histogram.so.6 hdr_iter_next hdr_histogram.c

If we look closer at the following 5 methods we see there a lot of repeated computation:

int64_t hdr_size_of_equivalent_value_range(const struct hdr_histogram* h, int64_t value)
{
    int32_t bucket_index     = get_bucket_index(h, value);
    int32_t sub_bucket_index = get_sub_bucket_index(value, bucket_index, h->unit_magnitude);
    int32_t adjusted_bucket  = (sub_bucket_index >= h->sub_bucket_count) ? (bucket_index + 1) : bucket_index;
    return INT64_C(1) << (h->unit_magnitude + adjusted_bucket);
}

static int64_t lowest_equivalent_value(const struct hdr_histogram* h, int64_t value)
{
    int32_t bucket_index     = get_bucket_index(h, value);
    int32_t sub_bucket_index = get_sub_bucket_index(value, bucket_index, h->unit_magnitude);
    return value_from_index(bucket_index, sub_bucket_index, h->unit_magnitude);
}

int64_t hdr_next_non_equivalent_value(const struct hdr_histogram *h, int64_t value)
{
    return lowest_equivalent_value(h, value) + hdr_size_of_equivalent_value_range(h, value);
}

static int64_t highest_equivalent_value(const struct hdr_histogram* h, int64_t value)
{
    return hdr_next_non_equivalent_value(h, value) - 1;
}

int64_t hdr_median_equivalent_value(const struct hdr_histogram *h, int64_t value)
{
    return lowest_equivalent_value(h, value) + (hdr_size_of_equivalent_value_range(h, value) >> 1);
}

namely:

  • hdr_size_of_equivalent_value_range and lowest_equivalent_value both calculate the bucket/subbucket. Proposed optimization is to add two internal functions that take the precomputed bucket/subbucket indices.
  • within hdr_next_non_equivalent_value function, the functions highest_equivalent_value and hdr_median_equivalent_value both calculate the lowest_equivalent_value and hdr_size_of_equivalent_value_range. Proposed solution is to avoid at all duplicate computation and simplify hdr_next_non_equivalent_value.

Measured improvement

As expected, by simply reusing intermediate computation and consequently reducing the redundant computation within the internally called functions of hdr_value_at_percentile(), we've measured a reduction in the overall CPU time of ~= 30%, as show on the following chart:

image

Used platform details

The performance results provided in this PR were collected using:

  • Hardware platform: A physical HPE ProLiant DL380 Gen10 Server, with one Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, Maximum memory capacity of 3 TB, disabling CPU Frequency Scaling and with all configurable BIOS and CPU system settings set to performance.
  • OS: Ubuntu 18.04 Linux release 4.15.0-123.
  • Installed Memory: 64GB DDR4-SDRAM @ 2933 MHz
  • Compiler: gcc-10 (Ubuntu 10.1.0-2ubuntu1~18.04) 10.1.0

…owest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value()
@filipecosta90 filipecosta90 changed the title hdr_value_at_percentile 30% on cpu time reduction: reuse common computed values within highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value() hdr_value_at_percentile 30% on cpu time reduction: reuse common computed values within highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value() Aug 9, 2021
@filipecosta90 filipecosta90 changed the title hdr_value_at_percentile 30% on cpu time reduction: reuse common computed values within highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value() hdr_value_at_percentile 30% on cpu time reduction: reuse common computed values within highest_equivalent_value/lowest_equivalent_value/hdr_median_equivalent_value calls on hdr_next_non_equivalent_value Aug 9, 2021
@mikeb01
Copy link
Contributor

mikeb01 commented Aug 11, 2021

Is ready to merge or was there some other changes that you wanted to include?

@filipecosta90
Copy link
Contributor Author

Is ready to merge or was there some other changes that you wanted to include?

@mikeb01 IMO we can merge and re-iterate over further changes. The smaller changes per review the best :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants