Add summary metrics for tdigest fields by not-napoleon · Pull Request #137982 · elastic/elasticsearch

not-napoleon · 2025-11-12T18:07:03Z

Add support for storing summary metrics (min, max, sum, and count) along side the t-digest histogram values. These values can be sent by the user, or computed at parse time. As of this PR, there is no validation of user sent values for these summary metrics, although we should do some validation in the future.

elasticsearchmachine · 2025-11-12T18:07:28Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

...gin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapper.java

kkrik-es · 2025-11-13T07:20:55Z

...k/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestParser.java

+        return new ParsedHistogram(centroids, counts, count, sum, min, max);
+    }
+
+    private static ArrayList<Long> getLongs(String mappedFieldName, XContentParser parser) throws IOException {


Let's rename to getCounts since the error hardcodes COUNTS_FIELD below..

kkrik-es · 2025-11-13T07:21:20Z

...k/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestParser.java

+        return counts;
+    }
+
+    private static ArrayList<Double> getDoubles(String mappedFieldName, XContentParser parser) throws IOException {


Ditto, rename to getCentroids?

kkrik-es · 2025-11-13T07:23:05Z

...nalytics/src/test/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapperTests.java

    public void testSyntheticSourceKeepArrays() {
        // The mapper expects to parse an array of values by default, it's not compatible with array of arrays.
    }
+    /*


We still need this unfortunately. This overrides a test in the parent class which validates that synthetic source works correctly for multi-valued inputs. But this field (like the original histogram field) does not support multi-valued input, so the parent test will always throw a parsing error. We cover making sure that the correct error is thrown in testParseArrayValue already, and I'd rather not override the intended purpose of this test to do that.

kkrik-es · 2025-11-13T07:24:31Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/analytics/t_digest_fieldtype.yml

+                type: "tdigest"
+  - do:
+      headers:
+        Authorization: "Basic eF9wYWNrX3Jlc3RfdXNlcjp4LXBhY2stdGVzdC1wYXNzd29yZA==" # run as x_pack_rest_user, i.e. the test setup superuser


Why is this needed?

because this is in x-pack, and available only under basic license, not the OSS license. This is the same as the existing histogram field (which, indeed, is where I copied this key from)

Actually, upon further experimentation I'm not sure this is still true. I'm going to try taking it out and see what happens.

kkrik-es · 2025-11-13T07:28:33Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/analytics/t_digest_fieldtype.yml

+          - '{"index": {"_id": 1}}'
+          - '{"latency": {"centroids" : [0.1, 0.2, 0.3, 0.4, 0.5], "counts" : [3, 7, 23, 12, 6]}}'
+          - '{"index": {"_id": 2}}'
+          - '{"latency": {"centroids" : [0, 0.1, 0.2, 0.3, 0.4, 0.5], "counts" : [3, 2, 5, 10, 1, 8]}}'


Add one where the min,max,sum,count values are passed externally?

kkrik-es

Looks good, just a few nits.

romseygeek

LGTM, I left a question about lazy calculations but no need for another review.

romseygeek · 2025-11-13T09:15:26Z

...k/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestParser.java

+    private static final ParseField MIN_FIELD = new ParseField(MIN_FIELD_NAME);

    /**
     * A parsed histogram field, can represent either a T-Digest


This only represents T-Digest now I think?

romseygeek · 2025-11-13T09:25:54Z

...k/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestParser.java

+            if (centroids != null && centroids.isEmpty() == false) {
+                return centroids.get(centroids.size() - 1);
+            }
+            return Double.NaN;


Is there a reason to always calculate this lazily? As if not I think this would be easier to reason about if the record used primitive types (eg long rather than Long) and we had a factory method that calculated summary values if they are not present.

JonasKunz · 2025-11-13T11:01:54Z

...gin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapper.java

+            streamOutput.writeDouble(parsedHistogram.min());
+            streamOutput.writeDouble(parsedHistogram.max());
+            streamOutput.writeDouble(parsedHistogram.sum());
+            streamOutput.writeLong(parsedHistogram.count());


Storing these in a single BinaryDocValues prevents us from optimizing queries on loading only the required data. E.g. a MIN(histogram) will have inferior performance to MIN(some_double_column)because it (a) loads more data and (b) needs to extract the values from the bytes instead of being able to operate directly on the memory as is.

In addition this will use more storage, because it can't use our optimized numeric doc values encodings.
(At least until we have some compression on BinaryDocValue)

I guess this is a known tradeoff for now and something we accept to move faster? Nothing will prevent us from optimizing this later.

Nothing will prevent us from optimizing this later.

I mean, we would still need to support this layout if we release it and change it later. If we want separate sub fields, I'd rather do that from the start. That said, I don't think it's a terribly big chunk of work, so I can try to get it in this PR.

It will be a lot more work to support in ES|QL, because IINM if you just have a single BinaryDocValue you wouldn't be required to add a new block type, which was like the heaviest part of the exponential_histogram ES|QL work.

But like you said, it is a tradeoff: if we decide to do this later, we'll have to keep the code for the "old" encoding around

Yeah I missed this part.. I agree with Jonas; we either add them separately (so as to get improved query performance), or we don't add them at all and calculate the values over the centroids.

JonasKunz · 2025-11-13T11:02:57Z

...gin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapper.java

+            streamOutput.writeDouble(parsedHistogram.min());
+            streamOutput.writeDouble(parsedHistogram.max());
+            streamOutput.writeDouble(parsedHistogram.sum());
+            streamOutput.writeLong(parsedHistogram.count());


Also is there a reason why we store totalCount, even though it can be directly derived from the buckets which we have to load anyway?

JonasKunz · 2025-11-13T11:08:29Z

...k/plugin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestParser.java

+        @Override
+        public Long count() {
+            if (count != null) {
+                return count;


Is there a reason why have a separate count which can be different from the sum of the counts of the buckets?

Are we sure that all the algorithms you are planning to have working on this type will handle this well?

For exponential histograms, I intentionally do not have a count field. Instead it is implicitly there as the sum of all buckets.

Why do we accept any of these values explicitly? they can all be computed from the histogram (more or less). Honestly, always computing them would make more sense to me, but I had the impression from exponential histogram that we needed to allow users to specify them for some reason.

FWIW, Prometheus does seem to support having an explicit count field different from the sum of the bucket counts:

An observation of NaN goes into no bucket, but increments the count of observations. This implies that the count of observations can be greater than the sum of all buckets (negative, positive, and zero buckets), and the difference is the number of NaN observations.

(see https://prometheus.io/docs/specs/native_histograms/)

sum, min and max can only be approximated from the buckets, while (so I thought) count be computed exactly.
To the benefit of storing these is that independent of the actual resolution of the histogram, AVG, MIN and MAX queries will give you exact results, just like for aggregate metric double.
In OpenTelemetry the sum, min and max are captured as extra values and send via OTLP for that reason.

In fact not having them has been a problem in the past when putting OTLP into the histogram field:
The default configuration in OpenTelemetry it to use histograms with fixed buckets, where the buckets are the prometheus defaults: [0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10].
So if you MIN, MAX and AVG estimated from the buckets would always be capped to [0.005, 10] which is pretty terrible in my opinion.

FWIW, Prometheus does seem to support having an explicit count field different from the sum of the bucket counts

Interesting, I wasn't aware of that. Though I find it a questionable choice: That means avg=sum/count would be distorted towards zero by NaN observations. Imo it would be better to have a dedicated NaN bucket if desired which doesn't count towards the total count.

Anyway, I would say we can leave this edge case aside and say only finite observations are supported. I didn't find it in the OpenTelemetry spec, but that's what the OpenTelemetry SDK does:
https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/aggregator/DoubleBase2ExponentialHistogramAggregator.java#L219

not-napoleon · 2025-11-13T22:38:07Z

I've taken @JonasKunz 's advice and moved the summary data into separate numeric doc values fields. That has some side effects for synthetic source. Specifically, it's no longer trivial to read the summary values while synthesizing the source, and recomputing them can result in value drift that makes the tests flap. This seems very solvable, but it isn't on the critical path to getting percentile support in ESQL, so I'm leaving it for now and will come back to it after we release the tech preview.

JonasKunz · 2025-11-14T08:39:51Z

...gin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapper.java

+    private static class InternalTDigestValue extends HistogramValue {
        double value;
        long count;
+        double min;


Should probably remove these members as you don't support reading them from the separate doc values yet?

…etc' into t-digest-sum-etc

* main: (135 commits) Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=1} elastic#138130 Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=2} elastic#138129 Mute org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects elastic#138128 [DiskBBQ] avoid EsAcceptDocs bug by calling cost before building iterator (elastic#138127) Log NOT_PREFERRED shard movements (elastic#138069) Improve bulk loading of binary doc values (elastic#137995) Add internal action for getting inference fields and inference results for those fields (elastic#137680) Address issue with DateFieldMapper#isFieldWithinQuery(...) (elastic#138032) WriteLoadConstraintDecider: Have separate rate limiting for canRemain and canAllocate decisions (elastic#138067) Adding NodeContext to TransportBroadcastByNodeAction (elastic#138057) Mute org.elasticsearch.simdvec.ESVectorUtilTests testSoarDistanceBulk elastic#138117 Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#137909 Backport batched_response_might_include_reduction_failure version to 8.19 (elastic#138046) Add summary metrics for tdigest fields (elastic#137982) Add gp-llm-v2 model ID and inference endpoint (elastic#138045) Various tracing fixes (elastic#137908) [ML] Fixing KDE evaluate() to return correct ValueAndMagnitude object (elastic#128602) Mute org.elasticsearch.xpack.shutdown.NodeShutdownIT testStalledShardMigrationProperlyDetected elastic#115697 [ML] Fix Flaky Audit Message Assertion in testWithDatastream for RegressionIT and ClassificationIT (elastic#138065) [ML] Fix Non-Deterministic Training Set Selection in RegressionIT testTwoJobsWithSameRandomizeSeedUseSameTrainingSet (elastic#138063) ... # Conflicts: # rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/200_dense_vector_docvalue_fields.yml

Follow up to #137982 to support returning min, max, and sum in the synthetic source results. This also drops support for sending the count as a parameter (and thus doesn't include the total count in the synthetic source result), which matches the behavior of the exponential histogram field.

not-napoleon added 9 commits November 10, 2025 13:53

randomize data for tdigest tests

f41e6cf

parse the new fields

8d6dd19

write the new field components

cfe0f14

everything working except empty array case

7da648c

error on empty arrays

464dee0

yaml tests based on Histogram field synthetic source tests

f5d7f3a

support empty tdigest fields

e3ccd1f

yaml test for empty arrays & synthetic source

04b18ed

extra test gates

5008c5c

not-napoleon added >non-issue test-release Trigger CI checks against release build :StorageEngine/Mapping The storage related side of mappings v9.3.0 labels Nov 12, 2025

elasticsearchmachine added the Team:StorageEngine label Nov 12, 2025

not-napoleon requested review from JonasKunz, kkrik-es and romseygeek November 12, 2025 18:07

not-napoleon and others added 3 commits November 12, 2025 13:08

Merge branch 'main' into t-digest-sum-etc

6f67687

[CI] Auto commit changes from spotless

1f0402f

remove nocommits

1b6f953

not-napoleon mentioned this pull request Nov 12, 2025

Add support for storing and querying T-Digest sketches #137649

Closed

57 tasks

not-napoleon added 2 commits November 12, 2025 14:28

skip inherited tests when flag is disabled

f39eff0

even more feature flag checks

af00bd6

kkrik-es reviewed Nov 13, 2025

View reviewed changes

...gin/analytics/src/main/java/org/elasticsearch/xpack/analytics/mapper/TDigestFieldMapper.java Outdated Show resolved Hide resolved

kkrik-es reviewed Nov 13, 2025

View reviewed changes

kkrik-es approved these changes Nov 13, 2025

View reviewed changes

romseygeek approved these changes Nov 13, 2025

View reviewed changes

JonasKunz reviewed Nov 13, 2025

View reviewed changes

not-napoleon added 7 commits November 13, 2025 08:17

clean up a scratch comment

de56d57

Skip the new yaml tests in XpackWithMultipleProjects test

6bd8a93

rename some parser functions

105c5ba

add yaml tests for summary fields

a970550

naming cleanup

511ba01

write summary data as subfields

474a044

drop support for showing the summary fields in synthetic source

709e755

not-napoleon removed the test-release Trigger CI checks against release build label Nov 13, 2025

not-napoleon and others added 2 commits November 13, 2025 17:38

Merge branch 'main' into t-digest-sum-etc

8a3fadb

[CI] Auto commit changes from spotless

884da50

JonasKunz reviewed Nov 14, 2025

View reviewed changes

not-napoleon added 3 commits November 14, 2025 09:07

clean up unused fields

6f05786

Merge remote-tracking branch 'refs/remotes/not-napoleon/t-digest-sum-…

178f19f

…etc' into t-digest-sum-etc

Merge branch 'main' into t-digest-sum-etc

c8b06dd

not-napoleon enabled auto-merge (squash) November 14, 2025 15:42

not-napoleon merged commit 396b1dc into elastic:main Nov 14, 2025
34 checks passed

not-napoleon mentioned this pull request Nov 14, 2025

Improve synthetic source for tdigest field #138121

Merged

This was referenced Nov 24, 2025

[CLEAN] Synthetic Benchmark PR #137982 - Add summary metrics for tdigest fields qodo-benchmark/elasticsearch#7

Open

[CORRUPTED] Synthetic Benchmark PR #137982 - Add summary metrics for tdigest fields qodo-benchmark/elasticsearch#8

Open

Conversation

not-napoleon commented Nov 12, 2025

Uh oh!

elasticsearchmachine commented Nov 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kkrik-es left a comment

Choose a reason for hiding this comment

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JonasKunz Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

not-napoleon commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JonasKunz Nov 14, 2025 •

edited

Loading