Implement stats aggregation for string terms#47468
Conversation
|
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
StringStatsAggregatorTests#testSingleValuedFieldFormatter fails because of elastic#47469
polyfractal
left a comment
There was a problem hiding this comment.
Left some comments, think it looks good!
Needs some documentation + doc tests. Let me know if you have questions about the doc tests, they are a bit funky :)
| import org.elasticsearch.xpack.analytics.stringstats.StringStatsAggregationBuilder; | ||
|
|
||
| public class DataScienceAggregationBuilders { | ||
| public class AnalyticsAggregationBuilders { |
There was a problem hiding this comment.
Whoops, good catch, thanks for the fix :)
There was a problem hiding this comment.
And the embarrassing typo below heh :)
| } | ||
|
|
||
| static class Fields { | ||
| public static final String COUNT = "count"; |
There was a problem hiding this comment.
Let's use ParseField for these. ParseField has some extra functionality to handle deprecations/renaming, in case we ever decide to change the string values.
| case avg_length: return this.getAvgLength(); | ||
| case entropy: return this.getEntropy(); | ||
| default: | ||
| throw new IllegalArgumentException("Unknown value [" + name + "] in common stats aggregation"); |
There was a problem hiding this comment.
"common stats" left over from a different agg?
There was a problem hiding this comment.
Right, it should read string stats. Fixed.
...ytics/src/main/java/org/elasticsearch/xpack/analytics/stringstats/StringStatsAggregator.java
Show resolved
Hide resolved
| // Parse string chars and count occurrences | ||
| for (Character c : valueStr.toCharArray()) { | ||
| LongArray occ = charOccurrences.get(c); | ||
| final long overSize = BigArrays.overSize(bucket + 1); |
There was a problem hiding this comment.
It's not a terribly expensive call, but we should be able to move this up and out of the valuesCount loop I think? That way we only calculate the bigarrays size once instead of for each character in each value?
|
@elasticmachine run elasticsearch-ci/packaging-sample-matrix |
polyfractal
left a comment
There was a problem hiding this comment.
Code LGTM, pending docs. /cc @not-napoleon who I volunteered to review the docs while I'm out 😁
|
@elasticmachine run elasticsearch-ci/default-distro |
|
@elasticmachine run elasticsearch-ci/default-distro |
not-napoleon
left a comment
There was a problem hiding this comment.
Couple of nits, but looks good to me overall.
| * @return A map with the character as key and the probability of | ||
| * this character to occur as value. The map is ordered by frequency descending. | ||
| */ | ||
| public Map<String, Double> getDistribution() { |
There was a problem hiding this comment.
Nit: Does this need to be public? Looked to me like it was only called within the package
|
|
||
| public Object value(String name) { | ||
| Metrics metrics = Metrics.valueOf(name); | ||
| switch (metrics) { |
There was a problem hiding this comment.
Bit of a nit, but I don't love switching on an enum. If someone later adds a field to the enum, they need to remember to also update this switch. IMHO, a better solution would be to put a method on the enum getFieldValue(InternalStringStats stats) which could then call the appropriate getter and return the value. That way, any new enum value would need to implement the method for it to compile.
|
|
||
| @Override | ||
| public ScoreMode scoreMode() { | ||
| return valuesSource != null && valuesSource.needsScores() ? ScoreMode.COMPLETE : ScoreMode.COMPLETE_NO_SCORES; |
There was a problem hiding this comment.
Please add some parenthesis around the predicate here. Having to remember that && is higher precedence than ?: is unnecessary cognitive load, so parenthesis will make it more readable, even if they are technically redundant.
|
@elasticmachine run elasticsearch-ci/bwc |
Backport of #47468 to 7.x This PR adds a new metric aggregation called string_stats that operates on string terms of a document and returns the following: min_length: The length of the shortest term max_length: The length of the longest term avg_length: The average length of all terms distribution: The probability distribution of all characters appearing in all terms entropy: The total Shannon entropy value calculated for all terms This aggregation has been implemented as an analytics plugin.
This PR adds a new metric aggregation called
string_statsthat operates on string terms of a document and returns the following:min_length: The length of the shortest termmax_length: The length of the longest termavg_length: The average length of all termsdistribution: The probability distribution of all characters appearing in all termsentropy: The total Shannon entropy value calculated for all termsThis aggregation has been implemented as an analytics plugin.