Float histograms: implement methods for Add/Sub operations using Kahan summation#15687
Conversation
|
@KofClubs is also working on this (or maybe the work has stalled?). Just FYI. |
Yes, I know. (See the TODO here). I'm not sure |
7790b58 to
7ada325
Compare
I know, I guess the work has stalled as there hasn’t been activity for a long time. If that's not the case, then I'm ok to step aside and not interfere. Otherwise I'm ready to keep on |
786179b to
8cfa0d5
Compare
|
Thank you very much, I'll have a detailed look ASAP (which is probably not very soon, as I'm about to disappear into my holiday break). If anyone wants to review this in the meantime, be my guest. I haven't heard anything from @KofClubs , so let's hope we don't have redundant work happening here. |
|
Back from my holiday break, but swamped by the backlog. I hope I'll get to this next week. |
|
I just received an email about @crush-on-anechka 's good work. I'm sorry that I put this work on hold due to the long office hours every day:( I shall unassign the corresponding issue. |
|
Sorry, I caught a cold and got less done than planned. I need to ask for more patience. (And of course, if somebody else beats me to the review, be my guest.) |
That's because I still haven't made up my mind how to treat counter resets for subtraction. I'll add that once I have an idea.
Let's not do it for now. I don't expect the counts in the merged buckets to be hugely different, and usually we'll only add a few buckets anyway. |
There was a problem hiding this comment.
Thank you very much. Generally looks good.
Special thanks for all the careful thought you put into this.
It's a bit sad that we need to duplicate so much logic, but I guess it cannot be prevented most of the time. (For the few occasions where we can, I have left a comment.)
It would be good to already wire the new methods into PromQL (for addition, subtraction, and the sum aggregation etc, see more in later comment). This will give us a lot more test coverage (via the PromQL tests), and we could even easily try out a few things in the PromQL tests to prove that the Kahan summation actually helps.
|
A few more thoughts: We currently don't do Kahan summation for the float binary Kahan summation is currently only used in certain aggregations, the corresponding "over time" functions and some specific functions. Many of those only act on float samples. The only aggregations and functions we will need Kahan summation for histograms for will be:
As far as I can see, we do not need |
|
@beorn7 thank you for the comments! I will be able to proceed in a week, if that's fine |
8cfa0d5 to
ca0e4a2
Compare
|
@beorn7 I fixed the issues. I also replaced Add with KahanAdd in engine.go/aggregation and corresponding functions as you mentioned, but I'm a little bit confused with how to handle errors in adding up the accumulated receiving histogram and compensation histograms (lines 3129 and 3161) since handleAggregationError demands metricName. The solution could be to store metric name in groupedAggregation so we would have access to them, but I'm not sure. |
|
Thanks for all the updates. I'll have a look ASAP. |
|
Still being caught is other things. Hopefully I'll get to this before this weeks ends. |
|
More apologies. I'm still drawn away from more substantial code reviews by other things. This is still on my list to do ASAP. |
beorn7
left a comment
There was a problem hiding this comment.
Thank you very much for your patience and all the epic work.
This looks very good in general. Among all the comments, there are two substantial topics:
- Handling of the zero bucket (I assume that's not yet correct, see comment for details).
- Like for floats, we should first try to do simple average calculation and only revert to incremental average calculation if we run into overflow problems.
(About the 2nd point: Without Kahan summation, the incremental average calculation also helps to avoid the problem of adding a huge number (the running sum) to a small number (a new sample). However, Kahan summation solves that in a different way, so now the incremental calculation is only needed if the running sum would overflow float64. If that's not the case, it's just additional work and additional inaccuracies. See #14413 for some historical context.)
We need to address (1) in this PR, but you could leave TODOs about (2) if you feel it's too much for this PR.
Thanks to wiring the Kahan summation into PromQL already, we get some test coverage for free. However, let's add some more tests for histograms to make sure everything works as expected. All of these tests should be done via the PromQL testing framework, so they are easy to write (just using PromQL, no Go code required). You could take inspiration from float tests, e.g. https://github.com/prometheus/prometheus/blob/61aa82865d9c8474393bbbdcd539c58f63ba514f/promql/promqltest/testdata/aggregators.test#L612C1-L617C1 would not work without Kahan summation, https://github.com/prometheus/prometheus/blob/61aa82865d9c8474393bbbdcd539c58f63ba514f/promql/promqltest/testdata/aggregators.test#L576C1-L585C1 exercises the encremental avg calculation.
It would also be good to construct a test that exposes the (suspected) bug with the zero bucket, see point (1) above.
|
@beorn7 Thanks for the feedback! I'll analyze the information and update the PR asap |
7655f25 to
9d82d2c
Compare
beorn7
left a comment
There was a problem hiding this comment.
Here my thoughts. The zero bucket handling is really tough… WDYT?
|
To go even further down the rabbit hole: When reconciling zero buckets, we might be merging buckets into the zero bucket. Even when ignoring zero buckets, we will still merge buckets if we sum up histograms of different resolution. If we merge buckets, we add numbers. And we might need to add a very small to a very large number. So we should also use Kahan summation in this case, given that we have the compensation value available anyway. However, now we don't add one number to another one with its compensation value, we add two numbers with each having their own compensation term. Let's say bucket b1 with c1, and bucket b2 with c2. We need to check the math, but I guess it will be something like first doing For the merge between regular buckets, I guess this can easily be added to |
|
@beorn7 regarding the rabbit hole - I see where you're coming from, I'll take care of it |
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
…liation, and resolution reduction Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
…uction Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
…ation term Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
08ad456 to
a54b370
Compare
572fe99 to
f62b9b6
Compare
histogram_avg_incremental_2: |
|
My spontaneous thought about the rounding: This might work out in our (artificial) test cases, where we do expect round numbers, but in general, the rounding can only reduce accuracy rather than increasing it. It's inevitable that the accuracy of the float operations depends in their order, and we have decided so far to not go down the rabbit hole and start to reorder operations for improved accuracy (see comments), so we have to accept that the accuracy is "randomly" depending on whatever order happens to be used. |
|
About the counter reset hint: Yes, we only recently improved the testing framework to actually check for the counter reset hint. This explains the test failures after the rebase. |
beorn7
left a comment
There was a problem hiding this comment.
All looks good now, except the rounding. As already mentioned above, I have a really bad feeling about this. While I cannot provide an explicit proof, I'm pretty sure that we don't really improve accuracy by rounding. We are merely making it more likely to get "round" results, which might be the reason why our test cases (which expect round results) pass with the rounding.
The motivation for such a trick should be the other way around: We should only do it if we understand mathematically why we are doing it.
And since we don't, I propose to remove the rounding again and comment out the test cases that fail because of that, while adding an explanation that the precise result depends on the order of operation, which is unpredictable.
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
f62b9b6 to
831e5db
Compare
|
You’re right, I overlooked that the main goal here is to improve accuracy. Rounding intermediate results only to make an artificial test pass is not an effective approach. |
|
Great. I'll give this a final look tomorrow, and then we are hopefully and finally done. My apologies that this took so long, but I think we have learned a lot on the journey. One final question: I appreciate your effort to keep the commits meaningful for reviewing. However, I would not like to merge the now 40 commits into main as-is. My idea is to just squash them all into one big "add Kahan summation for native histograms" commit. For example, we don't need to have the journey of adding a KahanSub and then removing it again in main. However, if you feel strongly about it, let me know and then we can check if there is a meaningful grouping of commits to do. |
|
I’m totally fine with squashing commits into one on merge. The only reason I tried to keep them clean was to make the review easier. |
beorn7
left a comment
There was a problem hiding this comment.
All good (as far as I can see). Just let's be a bit more generous with the comments on deactivated tests.
Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru>
beorn7
left a comment
There was a problem hiding this comment.
🎉
Thanks again.
I'm currently re-running the tests. The failure seemed to have been a flaky test. Will merge on green.
|
I received an email saying that the CI had failed, but I decided to leave it until the morning. Happy to have it all done, thank you for all your guidance! |
* Update npm dependencies for v3.10 Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * Fix critical npm vulnerabilities with npm audit fix Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * UI: Move HistoryCompleteStrategy into its own file and fix lint Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * [REFACTOR] Relabel: Remove unnecessary Process() function All uses can be replaced by ProcessBuilder, which is more efficient. Signed-off-by: Bryan Boreham <bjboreham@gmail.com> * promql: use Kahan summation for Native Histograms (#15687) As for float samples, Kahan summation is used for the `sum` and `avg` aggregation and for the respective `_over_time` functions. Kahan summation is not perfect. This commit also adds tests that even Kahan summation cannot reliably pass. These tests are commented out. Note that the behavior might be different on other hardware platforms. We have to keep an eye on test failing on other hardware platforms and adjust them accordingly. Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru> * fix(docs): typo in prometheus_agent.md doc Signed-off-by: Mohammad Abbasi <mohammad.v184@gmail.com> * AWS SD: ECS Standalone Tasks The current ECS role in AWS SD assumes that a task is part of a service. This means that tasks that are started as part of AWS Batch will get missed and not be discovered. This changed fixes this so that standalone tasks can be discovered as well. Signed-off-by: matt-gp <small_minority@hotmail.com> * AWS SD: Optmise MSK Role Signed-off-by: matt-gp <small_minority@hotmail.com> * promtool: support missing promql syntax features (#17926) Namely promql-duration-expr and promql-extended-range-selectors. This allows promtool to e.g. check rules files using syntax gated by these features. Signed-off-by: Ian Kerins <git@isk.haus> * Also run CI on release tags Make sure we also run the main CI workflow `v*` release tags so `publish_release` job is run. Signed-off-by: SuperQ <superq@gmail.com> * chore(deps): bump github.com/hashicorp/consul/api from 1.32.0 to 1.33.2 (#17449) Bumps [github.com/hashicorp/consul/api](https://github.com/hashicorp/consul) from 1.32.0 to 1.33.2. - [Release notes](https://github.com/hashicorp/consul/releases) - [Changelog](https://github.com/hashicorp/consul/blob/main/CHANGELOG.md) - [Commits](hashicorp/consul@api/v1.32.0...api/v1.33.2) --- updated-dependencies: - dependency-name: github.com/hashicorp/consul/api dependency-version: 1.33.2 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com> * AWS SD: Load Region Fallback (#18019) * AWS SD: Load Region Fallback --------- Signed-off-by: matt-gp <small_minority@hotmail.com> * Build riscv64 docker image by default (#17508) * Allow building riscv64 docker image Co-authored by: nijincheng@iscas.ac.cn; Signed-off-by: ffgan <sudoemt@gmail.com> * Update Makefile Co-authored-by: Ben Kochie <superq@gmail.com> Signed-off-by: ffgan <sudoemt@gmail.com> --------- Signed-off-by: ffgan <sudoemt@gmail.com> Co-authored-by: Ben Kochie <superq@gmail.com> * Increase Renovate speed (#18052) * Allow Renovate to run all day on the monthly day. * Increase the max PRs from 10 to 20. * Incresae the PR open rate from 2 to 5. Signed-off-by: SuperQ <superq@gmail.com> * docs(api): clarify metadata vs remote protocols (#17481) Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> * tsdb/index: export sentinel error for symbol table size exceeded Signed-off-by: Patryk Prus <p@trykpr.us> * Update Go dependencies for v3.10 release Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * Kubernetes SD: Disable WatchListClient in tests Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * Replace deprecated google.CredentialsFromJSON with option.WithAuthCredentialsFile Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> * chore(deps): update actions/checkout action to v4.3.1 (#18054) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * chore(deps): update actions/cache action to v5.0.3 (#18053) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> * chore(deps): update actions/setup-go action to v6.2.0 (#18055) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> --------- Signed-off-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> Signed-off-by: Bryan Boreham <bjboreham@gmail.com> Signed-off-by: Aleksandr Smirnov <5targazer@mail.ru> Signed-off-by: Mohammad Abbasi <mohammad.v184@gmail.com> Signed-off-by: matt-gp <small_minority@hotmail.com> Signed-off-by: Ian Kerins <git@isk.haus> Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Arve Knudsen <arve.knudsen@gmail.com> Signed-off-by: ffgan <sudoemt@gmail.com> Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com> Signed-off-by: Patryk Prus <p@trykpr.us> Co-authored-by: Ganesh Vernekar <ganesh.vernekar@reddit.com> Co-authored-by: Bryan Boreham <bjboreham@gmail.com> Co-authored-by: Sasha <103973965+crush-on-anechka@users.noreply.github.com> Co-authored-by: Mohammad Abbasi <mohammad.v184@gmail.com> Co-authored-by: matt-gp <small_minority@hotmail.com> Co-authored-by: Ian Kerins <git@isk.haus> Co-authored-by: SuperQ <superq@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com> Co-authored-by: Julien <291750+roidelapluie@users.noreply.github.com> Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com> Co-authored-by: ffgan <sudoemt@gmail.com> Co-authored-by: Patryk Prus <p@trykpr.us> Co-authored-by: Ganesh Vernekar <ganeshvern@gmail.com> Co-authored-by: Joe Adams <github@joeadams.io> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR addresses issue #14105 and PR #14325 maintained by @beorn7. While the issue has an assignee, I decided to contribute out of personal interest as there hasn’t been recent activity.
Overview:
New methods, kahanAdd and kahanSub, to the FloatHistogram replicate the logic of the existing Add and Sub methods but incorporate the Kahan summation algorithm. The methods require an initialized compensation histogram as input and return both the modified receiving histogram and the updated compensation histogram. The resulting histograms must then be combined using the original Add/Sub methods. Usage examples can be seen in the modified TestFloatHistogramAdd and TestFloatHistogramSub test cases.
Details:
Food for thought: