Calculate sum in Kahan summation algorithm in aggregations (#27807)#27848
Calculate sum in Kahan summation algorithm in aggregations (#27807)#27848jpountz merged 8 commits intoelastic:masterfrom
Conversation
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
jpountz
left a comment
There was a problem hiding this comment.
I left some comments to the avg aggregation that applies to other aggregations as well. It looks good to me. I think we should also modify the InternalSum, InternalAvg, ... aggregations to store the compensation as well, so that merging sums that come from multiple shards retains more accuracy. Finally I think we need to modify the extended stats aggregation as well?
| DoubleArray sums; | ||
| DocValueFormat format; | ||
|
|
||
| private DoubleArray compensations; |
There was a problem hiding this comment.
please declare it next to sums and using the same modifiers
| double corrected = values.nextValue() - compensation; | ||
| double newSum = sum + corrected; | ||
| compensation = (newSum - sum) - corrected; | ||
| sum = newSum; |
There was a problem hiding this comment.
can you add a comment saying this is Kahan summation?
| } | ||
| assertEquals(expectedSum, reduced.getValue(), 0.000d); | ||
| } | ||
|
|
There was a problem hiding this comment.
Why I have to modify this test case is because if there is a Double.NEGATIVE_INFINITY in the values to sum, Kahan summation with get Double.NaN, and naive summation will get Double.NEGATIVE_INFINITY, I'm not sure if this use case should be supported.
There was a problem hiding this comment.
We started rejecting this on new indices recently, but old indices might still have infinities. I think we should have special handling for these values in the aggregators?
|
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
jpountz
left a comment
There was a problem hiding this comment.
Thank you, this looks almost good to me. I left some minor comments on the avg aggregation that apply to other aggs as well. Could you also add some tests for the Infinity/NaN corner-cases, including things like summing up a few finite doubles that are so large that the sum is infinite?
| for (int i = 0; i < valueCount; i++) { | ||
| sum += values.nextValue(); | ||
| double value = values.nextValue(); | ||
| if (Double.isNaN(value) || Double.isInfinite(value)) { |
There was a problem hiding this comment.
We could test both at once by doing Double.isFinite(value) == false?
| sum += value; | ||
| if (Double.isNaN(sum)) | ||
| break; | ||
| } else if (Double.isFinite(sum)) { |
There was a problem hiding this comment.
this is always true I think? So we could make it an else block?
There was a problem hiding this comment.
The check statement is needed. For example when summing up [Double.POSITIVE_INFINITY, 1, 2, 3], if no this check, we'll got NaN, and we expect Double.POSITIVE_INFINITY here.
| double value = values.nextValue(); | ||
| if (Double.isNaN(value) || Double.isInfinite(value)) { | ||
| sum += value; | ||
| if (Double.isNaN(sum)) |
There was a problem hiding this comment.
I would not try to break once the sum is NaN, this doesn't bring much imo
| sum += ((InternalAvg) aggregation).sum; | ||
| InternalAvg avg = (InternalAvg) aggregation; | ||
| count += avg.count; | ||
| if (Double.isNaN(sum) == false) { |
There was a problem hiding this comment.
I would remove this if statement to keep things simple?
af772c4 to
1f07cca
Compare
|
Hi @jpountz Thanks for your comments. They're really useful to me. 👍 When I trying to add test case for some corner cases for The result of |
|
Sorry for the long time with no response. I just fetched your code, and this is due to the fact that you use |
b6f5984 to
176be1f
Compare
jpountz
left a comment
There was a problem hiding this comment.
I think I found one corner case, but other than that, it looks good to me!
| double newSum = sum + corrected; | ||
| compensation = (newSum - sum) - corrected; | ||
| sum = newSum; | ||
| } |
There was a problem hiding this comment.
I think one corner-case is not covered with this logic. Imagine that all values are finite but 2: -Inf and +Inf, for instance:
[+Inf, 4, -Inf]. The expected result is NaN but your logic will make it return +Inf if I'm not mistaken. Maybe this should be something like that instead:
if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
sum += value;
} else {
// kahan summation
}There was a problem hiding this comment.
Hi @jpountz Thanks a lot! I made a test for both of them:
double sum = 0;
double compensation = 0;
double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};
for (int i = 0; i < values.length; i++) {
double value = values[i];
if (Double.isFinite(value) == false) {
sum += value;
} else if (Double.isFinite(sum)) {
double corrected = value - compensation;
double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
}
}
System.out.println(sum);
sum = 0;
compensation = 0;
for (int i = 0; i < values.length; i++) {
double value = values[i];
if (Double.isFinite(value) == false || Double.isFinite(sum) == false) {
sum += value;
} else {
double corrected = value - compensation;
double newSum = sum + corrected;
compensation = (newSum - sum) - corrected;
sum = newSum;
}
}
System.out.println(sum);
Both of their result are NaN. Because in my code, for each value, if it's not finite, it will be summed up to sum, no matter what sum is. I can also made this change because I also think your way is more intuitive. I also have a random test case for this kind corner-case:
int n = randomIntBetween(5, 10);
values = new double[n];
double sum = 0;
for (int i = 0; i < n; i++) {
values[i] = frequently()
? randomFrom(Double.NaN, Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)
: randomDoubleBetween(Double.MIN_VALUE, Double.MAX_VALUE, true);
sum += values[i];
}
verifyAvgOfDoubles(values, sum / n, 1e-10);
I can make it be regular for example:
double[] values = new double[]{Double.POSITIVE_INFINITY, 4, Double.NEGATIVE_INFINITY};
verifyAvgOfDoubles(values, NaN, 0d);
WDYT?
There was a problem hiding this comment.
I had misread your code, sorry! I am fine either way.
There was a problem hiding this comment.
Let's keep things the way they are.
|
@elasticmachine please test it |
* master: Trim down usages of `ShardOperationFailedException` interface (#28312) Do not return all indices if a specific alias is requested via get aliases api. [Test] Lower bwc version for rank-eval rest tests CountedBitSet doesn't need to extend BitSet. (#28239) Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848) Remove the `update_all_types` option. (#28288) Add information when master node left to DiscoveryNodes' shortSummary() (#28197) Provide explanation of dangling indices, fixes #26008 (#26999)
* 6.x: Trim down usages of `ShardOperationFailedException` interface (#28312) Clean up commits when global checkpoint advanced (#28140) Do not return all indices if a specific alias is requested via get aliases api. CountedBitSet doesn't need to extend BitSet. (#28239) Calculate sum in Kahan summation algorithm in aggregations (#27807) (#27848)
* master: (94 commits) Completely remove Painless Type from AnalyzerCaster in favor of Java Class. (elastic#28329) Fix spelling error Reindex: Wait for deletion in test Reindex: log more on rare test failure Ensure we protect Collections obtained from scripts from self-referencing (elastic#28335) [Docs] Fix asciidoc style in composite agg docs Adds the ability to specify a format on composite date_histogram source (elastic#28310) Provide a better error message for the case when all shards failed (elastic#28333) [Test] Re-Add integer_range and date_range field types for query builder tests (elastic#28171) Added Put Mapping API to high-level Rest client (elastic#27869) Revert change that does not return all indices if a specific alias is requested via get alias api. (elastic#28294) Painless: Replace Painless Type with Java Class during Casts (elastic#27847) Notify affixMap settings when any under the registered prefix matches (elastic#28317) Trim down usages of `ShardOperationFailedException` interface (elastic#28312) Do not return all indices if a specific alias is requested via get aliases api. [Test] Lower bwc version for rank-eval rest tests CountedBitSet doesn't need to extend BitSet. (elastic#28239) Calculate sum in Kahan summation algorithm in aggregations (elastic#27807) (elastic#27848) Remove the `update_all_types` option. (elastic#28288) Add information when master node left to DiscoveryNodes' shortSummary() (elastic#28197) ...
Currently, when computing sum of double values in aggregators, all values are summed up in naive summation. Kahan summation is a compensated summation algorithm which is more accurate for double values. However for NaN and infinities, Kahan summation will not get the same result as naive summation.
Closes #27807