ESQL: Be careful with duplicate doc ids by nik9000 · Pull Request #142055 · elastic/elasticsearch

nik9000 · 2026-02-06T22:59:29Z

ESQL's DocVector is fine having duplicate doc ids inside it. Except some consumers aren't! It's much easy to optimize some of the loaders if there aren't any duplicate docs ids.

This adds a flag, mayContainDuplicates to DocVector. When it's false we assert that there aren't any duplicate docs. When it's true we allow duplicates. Callers that need to optimize can check if before running the no-dups-path.

While building this I consolidated a bunch of the ctor calls for DocVector into a pattern like this:

new DocVector(refCounteds, shards, segments, docs,
  DocVector.config().mayContainDuplicates()
)

config() handles all of the optional parameters builder-style.

While doing this I noticed that ResultBuilderForDoc didn't track it's memory! It should! So I wrote a DocVector.FixedBuilder class and plugged it in. It's easier to use than what we had and it tracks memory.

ESQL's `DocVector` is *fine* having duplicate doc ids inside it. Except some consumers aren't! It's much easy to optimize some of the loaders if there aren't any duplicate docs ids. This adds a flag, `mayContainDuplicates` to `DocVector`. When it's `false` we `assert` that there aren't any duplicate docs. When it's `true` we allow duplicates. Callers that need to optimize can check if before running the no-dups-path. While building this I consolidated a bunch of the ctor calls for `DocVector` into a pattern like this: ``` new DocVector(refCounteds, shards, segments, docs, DocVector.config().mayContainDuplicates() ) ``` `config()` handles all of the optional parameters builder-style. While doing this I noticed that `ResultBuilderForDoc` didn't track it's memory! It should! So I wrote a `DocVector.FixedBuilder` class and plugged it in. It's easier to use than what we had and it tracks memory.

elasticsearchmachine · 2026-02-06T22:59:54Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

nik9000 · 2026-02-06T22:59:49Z

test/framework/src/main/java/org/elasticsearch/test/BreakerTestUtil.java

-        throws E {
-
+    public static ByteSizeValue findBreakerLimit(ByteSizeValue tooBigToBreak, CheckedConsumer<ByteSizeValue, Exception> c)
+        throws Exception {


Made it easier to find a bug.

nik9000 · 2026-02-06T23:00:33Z

x-pack/plugin/esql/compute/src/main/java/org/elasticsearch/compute/data/Block.java

-     * TODO: pass BlockFactory
     */
-    Block filter(int... positions);
+    Block filter(int... positions); // TODO mayContainDuplicates


In a follow-up I'll deal with duplicates here. Right now you can write 1, 1, 1 and it'll contain duplicates. This should be a parameter we pass in.

nik9000 · 2026-02-06T23:01:26Z

.../esql/compute/src/main/java/org/elasticsearch/compute/operator/topn/ResultBuilderForDoc.java

    @Override
    public void close() {
-        // TODO Memory accounting
+        builder.close();


nik9000 · 2026-02-06T23:03:02Z

@parkertimmins , I didn't plug this into the values reader, but figured this was enough for a Friday afternoon.

dnhatn

LGTM. Thanks Nik!

As of elastic#142055 we track when `DocVector` contains duplicates. In that PR we said "if you call `filter`, then the result may contain duplicates." This PR adds a flag to `filter` saying "the result of this `filter` may contain duplicates." It's quite common for `filter` calls not to create duplicates and most callers can pass `false`. That'll allow block loaders that follow these `filter` calls to use faster paths.

As of #142055 we track when `DocVector` contains duplicates. In that PR we said "if you call `filter`, then the result may contain duplicates." This PR adds a flag to `filter` saying "the result of this `filter` may contain duplicates." It's quite common for `filter` calls not to create duplicates and most callers can pass `false`. That'll allow block loaders that follow these `filter` calls to use faster paths.

nik9000 requested review from dnhatn and parkertimmins February 6, 2026 22:59

nik9000 added >bug :Analytics/ES|QL AKA ESQL v9.4.0 labels Feb 6, 2026

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 6, 2026

nik9000 commented Feb 6, 2026

View reviewed changes

nik9000 mentioned this pull request Feb 6, 2026

ESQL: Load many fields column-at-a-time #141926

Merged

dnhatn approved these changes Feb 6, 2026

View reviewed changes

nik9000 merged commit e857700 into elastic:main Feb 7, 2026
35 checks passed

nik9000 mentioned this pull request Feb 8, 2026

ESQL: Opt into dupes when filtering #142088

Merged

leontyevdv mentioned this pull request Mar 2, 2026

Set DocVector.mayContainDuplicates flag to test deduplication #143375

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ESQL: Be careful with duplicate doc ids#142055

ESQL: Be careful with duplicate doc ids#142055
nik9000 merged 1 commit intoelastic:mainfrom
nik9000:doc_flag

nik9000 commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

nik9000 Feb 6, 2026

Uh oh!

nik9000 Feb 6, 2026

Uh oh!

nik9000 Feb 6, 2026

Uh oh!

nik9000 commented Feb 6, 2026

Uh oh!

dnhatn left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nik9000 commented Feb 6, 2026

Uh oh!

elasticsearchmachine commented Feb 6, 2026

Uh oh!

nik9000 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

nik9000 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

nik9000 Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

nik9000 commented Feb 6, 2026

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants