[ES|QL] Add a standard deviation function#116531
Conversation
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
|
Documentation preview: |
| final long count = state.count(); | ||
| final double m2 = state.m2(); | ||
| if (count == 0 || Double.isFinite(m2) == false) { | ||
| return driverContext.blockFactory().newConstantNullBlock(1); |
There was a problem hiding this comment.
If the result is infinity or NaN I set it to return null, but I'm not sure if there should be a warning or something similar printed (or where that would best be done)?
|
|
||
| import static java.util.Collections.emptyList; | ||
|
|
||
| public class StdDeviation extends AggregateFunction implements ToAggregator { |
There was a problem hiding this comment.
I wasn't sure what best to name this. There were a few options: StdDeviation, StandardDeviation, or Stdev (or maybe even Stddev)
There was a problem hiding this comment.
I think StandardDeviation is better, although it's an internal name, and we can change it anytime.
There was a problem hiding this comment.
What about for the name of the ES|QL function? Right now it's std_deviation, I feel like std_dev maybe works better? I worry that standard_deviation is kind of long
There was a problem hiding this comment.
Should the internal name be changed to StdDev to match?
There was a problem hiding this comment.
I prefer StandardDeviation for the class name, but feel free to choose whichever name you prefer.
There was a problem hiding this comment.
I think it might be easier to change both; I tried changing only the function name but it looks like the name of the tests class is used for generating the docs so I think it might be simpler to use StdDev for the class name as well
There was a problem hiding this comment.
You can annotate the test class with the function name if it does not match cleanly. For example, see this class SpatialIntersectsTest which tests ST_INTERSECTS: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/test/java/org/elasticsearch/xpack/esql/expression/function/scalar/spatial/SpatialIntersectsTests.java#L23
...rc/main/generated-src/org/elasticsearch/compute/aggregation/StdDeviationFloatAggregator.java
Outdated
Show resolved
Hide resolved
|
@limotova I extracted values from the serverless test and combined them to reproduce the test failure. Using three batches, the final value is 0.23282704603226836, while with a single batch, the value is 0.22797190865484734. Could you check if this discrepancy is acceptable? public void testBasic() {
double[] v1 = {1.97, 2.0, 1.57, 1.48, 1.77};
double[] v2 = {2.1, 1.74, 1.96, 1.42, 1.59, 2.07, 1.81, 1.59, 1.44, 2.03, 1.81};
double[] v3 = {2.03, 1.54, 1.55};
WelfordAlgorithm a1 = new WelfordAlgorithm();
for (double v : v1) {
a1.add(v);
}
WelfordAlgorithm a2 = new WelfordAlgorithm();
for (double v : v2) {
a2.add(v);
}
WelfordAlgorithm a3 = new WelfordAlgorithm();
for (double v : v3) {
a3.add(v);
}
WelfordAlgorithm merged = new WelfordAlgorithm();
for (WelfordAlgorithm a : List.of(a3, a2, a1)) {
merged.add(a.mean(), a.m2(), a.count());
}
System.err.println("--> merged = " + merged.evaluate());
WelfordAlgorithm single = new WelfordAlgorithm();
for (double v : v1) {
single.add(v);
}
for (double v : v2) {
single.add(v);
}
for (double v : v3) {
single.add(v);
}
System.err.println("--> single = " + single.evaluate());
} |
|
|
||
| import static java.util.Collections.emptyList; | ||
|
|
||
| public class StdDeviation extends AggregateFunction implements ToAggregator { |
There was a problem hiding this comment.
I think StandardDeviation is better, although it's an internal name, and we can change it anytime.
...gin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDeviationStates.java
Outdated
Show resolved
Hide resolved
| } | ||
|
|
||
| public double evaluate() { | ||
| return count < 2 ? 0 : Math.sqrt(m2 / count); |
There was a problem hiding this comment.
I went with count because I believe we use the population standard deviation elsewhere but I can change it if we'd prefer sample standard deviation?
| * <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm"> | ||
| * Parallel algorithm</a> | ||
| */ | ||
| public final class WelfordAlgorithm { |
There was a problem hiding this comment.
Should we fold this class into the StdDeviationStates#SingleState?
There was a problem hiding this comment.
I wasn't sure if we wanted to use it elsewhere (like if we wanted to support variance or have both sample and population standard deviation)?
There was a problem hiding this comment.
Feels like the kind of thing that could move moved to a common place like libs/, but while there is only one usage, it might as well stay here for now.
...lugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/WelfordAlgorithm.java
Outdated
Show resolved
Hide resolved
dnhatn
left a comment
There was a problem hiding this comment.
I have two optional comments, but overall, I think the PR is ready. LGTM! However, I’d love to have another review from the ES|QL team. Great work, thanks Larisa!
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
.../esql/compute/src/main/java/org/elasticsearch/compute/aggregation/X-StdDevAggregator.java.st
Outdated
Show resolved
Hide resolved
ivancea
left a comment
There was a problem hiding this comment.
Looks good! Added some suggestions around tests mostly
...ql/src/test/java/org/elasticsearch/xpack/esql/expression/function/aggregate/StdDevTests.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/esql/qa/testFixtures/src/main/resources/stats.csv-spec
Outdated
Show resolved
Hide resolved
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
...ck/plugin/esql/compute/src/main/java/org/elasticsearch/compute/aggregation/StdDevStates.java
Outdated
Show resolved
Hide resolved
|
Flyby feedback:
|
ivancea
left a comment
There was a problem hiding this comment.
After adding Andrei's suggestions, LGTM!
docs/changelog/116531.yaml
Outdated
| @@ -0,0 +1,5 @@ | |||
| pr: 116531 | |||
| summary: "[ES|QL] Add a standard deviation function" | |||
There was a problem hiding this comment.
Remove the prefix [ES|QL]. The changelog is organized by area, which already says it is ES|QL.
| * <a href="https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm"> | ||
| * Parallel algorithm</a> | ||
| */ | ||
| public final class WelfordAlgorithm { |
There was a problem hiding this comment.
Feels like the kind of thing that could move moved to a common place like libs/, but while there is only one usage, it might as well stay here for now.
| tag = "docsStatsStdDevNestedExpression" | ||
| ) } | ||
| ) | ||
| public StdDev(Source source, @Param(name = "number", type = { "double", "integer", "long" }) Expression field) { |
There was a problem hiding this comment.
The params claims to only support the numeric types, but there is no resolveType function to enforce this. It should be similar to the Avg function: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/esql/src/main/java/org/elasticsearch/xpack/esql/expression/function/aggregate/Avg.java#L57
I tried to add in tests with
It seems to work with row as far as I can tell, but I'm not sure I understand what I should be looking for when I add |
| */ | ||
| @Aggregator( | ||
| { | ||
| @IntermediateState(name = "mean", type = "DOUBLE"), |
There was a problem hiding this comment.
We need to specify that these are blocks or we need to always emit them as vectors (with count=0)?
There was a problem hiding this comment.
Thank you! I changed it to returning 0 in the intermediate stages
💚 Backport successful
|
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
Uses Welford's online algorithm, as well as the parallel version, to calculate standard deviation.
Uses Welford's online algorithm, as well as the parallel version, to
calculate standard deviation.