Conversation
Adds a `boxplot` aggregation that calculates min, max, medium and the first and the third quartiles of the given data set. Closes elastic#33112
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
polyfractal
left a comment
There was a problem hiding this comment.
Left some comments! Mainly around the docs, and a few bits and pieces elsewhere. This looks good!
| [source,js] | ||
| -------------------------------------------------- | ||
| { | ||
| "boxplot": { |
There was a problem hiding this comment.
I wonder, should we snake_case this to box_plot? I personally don't really like snake casing but it would be more consistent (date_histogram, significant_terms, etc).
There was a problem hiding this comment.
Wikipedia states both spellings and boxplot was shorter, but I don't have strong feelings about it.
There was a problem hiding this comment.
In that case, boxplot it is. I like shorter too
| -------------------------------------------------- | ||
| { | ||
| "boxplot": { | ||
| "buckets_path": "my_cardinality_agg" |
There was a problem hiding this comment.
Typo re: "buckets_path" instead of "field"?
| -------------------------------------------------- | ||
| // TESTRESPONSE[s/\.\.\./"took": $body.took,"timed_out": false,"_shards": $body._shards,"hits": $body.hits,/] | ||
|
|
||
| As you can see, the aggregation will return a calculated value for each percentile |
There was a problem hiding this comment.
Looks like copy pasta leftovers :)
| ==== Script | ||
|
|
||
| The boxplot metric supports scripting. For example, if our load times | ||
| are in milliseconds but we want percentiles calculated in seconds, we could use |
There was a problem hiding this comment.
hmm, I am not sure how to call them here. Technically they are percentiles :) Values?
There was a problem hiding this comment.
Whoops, yeah this is probably fine. Was thinking it was a copy/paste leftover.
|
|
||
| <1> Compression controls memory usage and approximation error | ||
|
|
||
| The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the |
There was a problem hiding this comment.
I'm assuming this is the same as the Percentiles page? Maybe we should link to it instead? Or maybe there's a fancy way to embed a snippet from a different page? @nik9000 do you happen to know?
There was a problem hiding this comment.
| iw.addDocument(singleton(new NumericDocValuesField("number", 1))); | ||
| }, (Consumer<InternalGlobal>) global -> { | ||
| assertEquals(1.0, global.getDocCount(), 2); | ||
| assertEquals(2, global.getDocCount()); |
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue { |
There was a problem hiding this comment.
Out of curiosity, did you look into extending AbstractTDigestPercentilesAggregator? It might avoid some code duplication, but otoh it probably does more than you want/need. No strong feelings here.... the percentile/ranks code is already a complicated hierarchy so might be easier to just keep this separate.
There was a problem hiding this comment.
I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.
There was a problem hiding this comment.
Yeah, noodled on this a bit and it's probably too much for this PR (especially with #51887 potentially in the works)
The main difficulty is that percentiles/ranks share TDigest and HDRHisto agg base classes, and this would just share with TDigest, so it'd be a fork in the tree in the middle somewhere.
AbstractTDigestBase --> AbstractTDigestPercentiles --> Percentiles
\--> Ranks
\--> Boxplot
Unsure if the extra complexity is worth it or not. 🤷♂️ But probably too much complexity for this PR.
There was a problem hiding this comment.
I think it is up to @imotov but I'm fine waiting on it.
There was a problem hiding this comment.
I looked at it when it was still IQR and it was immediate dead-end from the inheritance perspective because of IQR was a single value/multivalue agg. It might be a bit easier now but I agree with @nik9000 and still think we should do some refactoring by composition (instead of inheritance) here. However, I would prefer to do it in another PR.
| if (super.equals(obj) == false) return false; | ||
|
|
||
| InternalBoxplot that = (InternalBoxplot) obj; | ||
| return Objects.equals(state, that.state); |
There was a problem hiding this comment.
Should we include format in this equality check?
Philosophically, I wonder if we should check equality of the boxplot components instead of the state? Like, are two boxplots identical if min/max/q1/q2/q3 are the same but the rest of the distribution is different?
I really dunno heh :)
There was a problem hiding this comment.
It is already included in super.
| expected.add(input.state()); | ||
| } | ||
| assertNotNull(expected); | ||
| // The final calculated result may very depending on the order, which requires higher delta |
There was a problem hiding this comment.
Is this comment leftover? E.g. the delta looks like it's zero?
|
|
||
| <1> Compression controls memory usage and approximation error | ||
|
|
||
| The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the |
There was a problem hiding this comment.
|
|
||
| import org.elasticsearch.search.aggregations.metrics.NumericMetricsAggregation; | ||
|
|
||
| public interface Boxplot extends NumericMetricsAggregation.MultiValue { |
There was a problem hiding this comment.
Hmm, I'm not sure tbh. We might want to keep them around as a bridge to the HLRC? E.g. InternalMin and ParsedMin both implement Min, and it might help us make sure they don't diverge?
Or maybe useless bloat now that the TC is gone and we can get rid of them... but probably better to investigate that in a separate PR and apply it to all aggs if we decide we can remove them?
There was a problem hiding this comment.
The HLRC won't have Boxplot on the classpath because it doesn't have the analytics module. With top_metrics I just pointed folks to ParsedTopMetrics. I'm fine either way. It is just that when I made my new agg I figured I could save myself a little work an not make it.
There was a problem hiding this comment.
Argh, you're right. Keep forgetting this is in xpack. 👍 probably not necessary given the current situation
| BoxplotAggregationBuilder> { | ||
| public static final String NAME = "boxplot"; | ||
|
|
||
| public static final ParseField COMPRESSION_FIELD = new ParseField("compression"); |
There was a problem hiding this comment.
I wonder if it'd be better to use the field from the PercentilesAggregationBuilder. That at least is a hint that it means the same thing.
| static { | ||
| PARSER = new ObjectParser<>(BoxplotAggregationBuilder.NAME); | ||
| ValuesSourceParserHelper.declareNumericFields(PARSER, true, true, false); | ||
| PARSER.declareDouble((builder, compression) -> builder.compression = compression, COMPRESSION_FIELD); |
There was a problem hiding this comment.
I think if you've declared the setter you may as well use it here.
There was a problem hiding this comment.
So BoxplotAggregationBuilder::compression
There was a problem hiding this comment.
Also that'd gt your validation too.
| import java.util.List; | ||
| import java.util.Map; | ||
|
|
||
| public class BoxplotAggregator extends NumericMetricsAggregator.MultiValue { |
There was a problem hiding this comment.
I wonder if it'd be cleaner to make a class they could both delegate their common behavior to? It'd be cool to do that in a follow up change if it is tricky.
|
|
||
| enum Metrics { | ||
|
|
||
| min, max, q1, q2, q3; |
There was a problem hiding this comment.
Could you name these in SHOUTING_SNAKE_CASE like normal Java enums? Is there a super compelling reason not to?
| @Override | ||
| public double value(String name) { | ||
| Metrics metrics = Metrics.valueOf(name); | ||
| switch (metrics) { |
There was a problem hiding this comment.
Would you be making a private method on Metrics that takes this as a parameter and calls the right method instead?
| return new InternalBoxplot(name, merged, format, pipelineAggregators(), metaData); | ||
| } | ||
|
|
||
| static class Fields { |
There was a problem hiding this comment.
I thought that these Fields objects had gone out of favor. But I'm not really in the loop.
|
@elasticmachine test this please |
|
I was wondering if |
|
@jtibshirani excellent idea! It doesn't support it at the moment. I will open a follow up PR to add this support. |
polyfractal
left a comment
There was a problem hiding this comment.
I'm happy with it, lgtm!
Refactors BoxplotAggregator to reuse parts that are already available in AbstractTDigestPercentilesAggregator Follow up for elastic#51948
Relates: elastic/elasticsearch#51948 This commit implements the boxplot aggregation. Integration tests run against XPackCluster because it requires a license to use.
Adds a
boxplotaggregation that calculates min, max, medium and the firstand the third quartiles of the given data set.
Closes #33112