Skip to content

Sort binary doc values by default#140244

Merged
Kubik42 merged 3 commits intoelastic:mainfrom
Kubik42:sorted-binary-doc-values
Jan 8, 2026
Merged

Sort binary doc values by default#140244
Kubik42 merged 3 commits intoelastic:mainfrom
Kubik42:sorted-binary-doc-values

Conversation

@Kubik42
Copy link
Copy Markdown
Contributor

@Kubik42 Kubik42 commented Jan 6, 2026

MultiValuedSortedBinaryDocValues's name is misleading as it doesn't actually sort anything. It also violates the contract imposed by SortedBinaryDocValues (parent class).

Now, sorting values may not actually be necessary as sometimes the underlying binary array is created in sorted order during indexing (example code). However, more often than not, its not (example code).

I found that leaving things as is makes introducing binary doc values into match only text difficult due to match only text fields having a dedicated wrapper that sorts stored fields. Leaving things as is means we need to introduce a similar abstraction to StoredFieldSortedBinaryIndexFieldData, which is a lot of extra code simply to make up for the fact that MultiValuedSortedBinaryDocValues doesn't sort things despite "Sorted" literally being in its name.

This PR address this inconsistency. All values in MultiValuedBinaryDocValuesField are now sorted by default.

@Kubik42 Kubik42 added >non-issue Team:StorageEngine :StorageEngine/Mapping The storage related side of mappings labels Jan 6, 2026
@Kubik42 Kubik42 force-pushed the sorted-binary-doc-values branch from 3679bd0 to 966199d Compare January 6, 2026 23:58
@Kubik42 Kubik42 force-pushed the sorted-binary-doc-values branch from 966199d to 35e72a4 Compare January 6, 2026 23:59
@Kubik42 Kubik42 marked this pull request as ready for review January 7, 2026 02:29
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@martijnvg
Copy link
Copy Markdown
Member

I don't think I'm onboard with this change. I agree that sometimes SeparateCounts or IntegratedCounts are not confirming to SortedBinaryDocValues abstraction. So maybe we should just not allow custom collections to be provided in MultiValuedBinaryDocValuesField subclasses? Maybe we should just fix it to a List and then later in MultiValuedBinaryDocValuesField#binaryValue() we just sort it.

@Kubik42
Copy link
Copy Markdown
Contributor Author

Kubik42 commented Jan 7, 2026

We can't do that because our tests will fail. We expect different things across different fields and across different use cases. Some expect a deduped sorted list, other a sorted list thats not deduped, and others an unsorted list. We're inconsistent. We either need to agree to be consistent and introduce a breaking change or we deal with the mess.

@Kubik42
Copy link
Copy Markdown
Contributor Author

Kubik42 commented Jan 7, 2026

We synced offline to agree to sort at index time and adjust any tests that may fail.

@Kubik42 Kubik42 force-pushed the sorted-binary-doc-values branch from 35e72a4 to 4338523 Compare January 7, 2026 16:52
protected void writeLenAndValues(BytesStreamOutput out) throws IOException {
// sort the ArrayList variant of the collection prior to serializing it into a binary array
if (values instanceof ArrayList<BytesRef>) {
((ArrayList<BytesRef>) values).sort(Comparator.naturalOrder());
Copy link
Copy Markdown
Contributor Author

@Kubik42 Kubik42 Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list.sort() is more efficient than Collections.sort() since it sorts in place

@Kubik42 Kubik42 changed the title Sorted binary doc values by default Sort binary doc values by default Jan 7, 2026
@Kubik42 Kubik42 force-pushed the sorted-binary-doc-values branch 2 times, most recently from 42a1347 to 023790d Compare January 7, 2026 19:17
@Kubik42 Kubik42 force-pushed the sorted-binary-doc-values branch from 023790d to 2207387 Compare January 7, 2026 23:27
private final boolean store;
private final boolean index;
private final Integer ignoreAbove;
private final boolean fallbackUsesBinaryDocValues;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed for annotated text. If we plan on using binary doc values for fallback annotated text fields, then this can later be removed

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

}

tasks.named("yamlRestCompatTestTransform").configure { task ->
task.skipTest("match_only_text/10_basic/synthetic_source match_only_text as multi-field with ignored keyword as parent", "ignore_above now uses sorted binary doc values")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether we can use replaceValueInMatch here instead of skipTest? Then we can still run the test with previous assertion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't work, then please ignore my comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this but couldn't get it working

@Kubik42 Kubik42 merged commit 37b2414 into elastic:main Jan 8, 2026
3 of 6 checks passed
@Kubik42 Kubik42 deleted the sorted-binary-doc-values branch January 8, 2026 18:59
jimczi pushed a commit to jimczi/elasticsearch that referenced this pull request Jan 12, 2026
* Store fallback annotated text fields in binary doc values

* Address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants