Improve synthetic source for tdigest field#138121
Conversation
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
JonasKunz
left a comment
There was a problem hiding this comment.
LGTM, just a nit and more of a question about min/max estimation, but that can be addressed in a follow-up if needed.
| - match: | ||
| _source: | ||
| latency: | ||
| # Note that we're storing 0.1 as the min, even though it's count is 0. |
There was a problem hiding this comment.
Does it make sense to include centroids in the min/max estimations in the estimation when their count is zero?
Good that you raised this here, I didn't notice it before.
I would assume that we should exclude it from the estimations, which would be consistent to the current min/max queryDSL aggregations on the histogram field I think?
We don't store the empty centroids, so min/max in queryDSL will return the smallest/highest populated centroid.
There was a problem hiding this comment.
My thinking was that if the user sent in an explicit min that did not correspond to a centroid, we'd store that and it would have the same inconsistency with the QueryDSL implementation.
I also don't want to spend a lot of time on this. The empty bucket handling is an extreme edge case. Our expectation is that users will be sending in valid t-digests, which should never produce empty buckets. To get an empty bucket, the user has to not only not use the t-digest algorithm, but also do something which doesn't really make sense for most histogram types. Frankly, I'm not even sure we should treat these as valid inputs, but that's what the histogram field does and for now I'm trying to keep this as compatible as possible.
I don't think we need to decide this on this PR. Technically, this behavior was added in #137982, I just called it out in this test. We can discuss and change it later if we want to.
|
|
||
| private static final ParseField COUNTS_FIELD = new ParseField(COUNTS_NAME); | ||
| private static final ParseField CENTROIDS_FIELD = new ParseField(CENTROIDS_NAME); | ||
| private static final ParseField TOTAL_COUNT_FIELD = new ParseField(TOTAL_COUNT_FIELD_NAME); |
There was a problem hiding this comment.
You can also remove TOTAL_COUNT_FIELD_NAME from TDigestFieldMapper, as it is unused now.
Follow up to #137982 to support returning min, max, and sum in the synthetic source results. This also drops support for sending the count as a parameter (and thus doesn't include the total count in the synthetic source result), which matches the behavior of the exponential histogram field.