Skip to content

ESQL: Be careful with duplicate doc ids#142055

Merged
nik9000 merged 1 commit intoelastic:mainfrom
nik9000:doc_flag
Feb 7, 2026
Merged

ESQL: Be careful with duplicate doc ids#142055
nik9000 merged 1 commit intoelastic:mainfrom
nik9000:doc_flag

Conversation

@nik9000
Copy link
Copy Markdown
Member

@nik9000 nik9000 commented Feb 6, 2026

ESQL's DocVector is fine having duplicate doc ids inside it. Except some consumers aren't! It's much easy to optimize some of the loaders if there aren't any duplicate docs ids.

This adds a flag, mayContainDuplicates to DocVector. When it's false we assert that there aren't any duplicate docs. When it's true we allow duplicates. Callers that need to optimize can check if before running the no-dups-path.

While building this I consolidated a bunch of the ctor calls for DocVector into a pattern like this:

new DocVector(refCounteds, shards, segments, docs,
  DocVector.config().mayContainDuplicates()
)

config() handles all of the optional parameters builder-style.

While doing this I noticed that ResultBuilderForDoc didn't track it's memory! It should! So I wrote a DocVector.FixedBuilder class and plugged it in. It's easier to use than what we had and it tracks memory.

ESQL's `DocVector` is *fine* having duplicate doc ids inside it. Except
some consumers aren't! It's much easy to optimize some of the loaders if
there aren't any duplicate docs ids.

This adds a flag, `mayContainDuplicates` to `DocVector`. When it's
`false` we `assert` that there aren't any duplicate docs. When it's
`true` we allow duplicates. Callers that need to optimize can check if
before running the no-dups-path.

While building this I consolidated a bunch of the ctor calls for
`DocVector` into a pattern like this:
```
new DocVector(refCounteds, shards, segments, docs,
  DocVector.config().mayContainDuplicates()
)
```

`config()` handles all of the optional parameters builder-style.

While doing this I noticed that `ResultBuilderForDoc` didn't track it's
memory! It should! So I wrote a `DocVector.FixedBuilder` class and
plugged it in. It's easier to use than what we had and it tracks memory.
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Feb 6, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

throws E {

public static ByteSizeValue findBreakerLimit(ByteSizeValue tooBigToBreak, CheckedConsumer<ByteSizeValue, Exception> c)
throws Exception {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it easier to find a bug.

* TODO: pass BlockFactory
*/
Block filter(int... positions);
Block filter(int... positions); // TODO mayContainDuplicates
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a follow-up I'll deal with duplicates here. Right now you can write 1, 1, 1 and it'll contain duplicates. This should be a parameter we pass in.

@Override
public void close() {
// TODO Memory accounting
builder.close();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooof.

@nik9000
Copy link
Copy Markdown
Member Author

nik9000 commented Feb 6, 2026

@parkertimmins , I didn't plug this into the values reader, but figured this was enough for a Friday afternoon.

Copy link
Copy Markdown
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks Nik!

@nik9000 nik9000 merged commit e857700 into elastic:main Feb 7, 2026
35 checks passed
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request Feb 8, 2026
As of elastic#142055 we track when `DocVector` contains duplicates. In that PR
we said "if you call `filter`, then the result may contain duplicates."
This PR adds a flag to `filter` saying "the result of this `filter` may
contain duplicates." It's quite common for `filter` calls not to create
duplicates and most callers can pass `false`. That'll allow block
loaders that follow these `filter` calls to use faster paths.
nik9000 added a commit that referenced this pull request Feb 9, 2026
As of #142055 we track when `DocVector` contains duplicates. In that PR
we said "if you call `filter`, then the result may contain duplicates."
This PR adds a flag to `filter` saying "the result of this `filter` may
contain duplicates." It's quite common for `filter` calls not to create
duplicates and most callers can pass `false`. That'll allow block
loaders that follow these `filter` calls to use faster paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants