Skip to content

Store ignored malformed fields in binary doc values#142357

Merged
Kubik42 merged 3 commits intoelastic:mainfrom
Kubik42:ignore-malformed-binary-doc-values
Feb 25, 2026
Merged

Store ignored malformed fields in binary doc values#142357
Kubik42 merged 3 commits intoelastic:mainfrom
Kubik42:ignore-malformed-binary-doc-values

Conversation

@Kubik42
Copy link
Copy Markdown
Contributor

@Kubik42 Kubik42 commented Feb 12, 2026

@Kubik42 Kubik42 force-pushed the ignore-malformed-binary-doc-values branch 3 times, most recently from 96c4b41 to 9dade60 Compare February 13, 2026 20:56
@Kubik42 Kubik42 force-pushed the ignore-malformed-binary-doc-values branch 3 times, most recently from 178d3ff to 057c7ae Compare February 20, 2026 22:59
task.skipTest("search.vectors/42_knn_search_bbq_flat/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations")
task.skipTest("search.vectors/41_knn_search_bbq_hnsw/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations")
task.skipTest(
"get/100_synthetic_source/fields with ignore_malformed",
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't use replaceValueInMatch here due to how the yaml is structured - fields with ignore_malformed is large and has many match assertions that specifically check _source.ip. Using replaceValueInMatch here means all of those assertions will be changed, which is not what we want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether instead of skipping the test entirely can we maybe use removeMatch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that wouldn't work well - each yaml test uses a nested format for all of its match assertions (ex. _source: { ip: [...] }). removeMatch() looks only at the name of the top-level key, (ex. _source), so using it here effectively removes all match assertion and renders the tests useless. This applies to all the test I've skipped in this PR. We would need to unnest some of the match assertions and then backport that in order for the yamlRestCompatTests to pass

import java.io.IOException;

public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {
public class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to removed final in order to override BinaryDocValuesSyntheticFieldLoaderLayer.writeValue() to support malformed values. By default writeValue() treats BytesRef values as raw UTF8 strings. This is not the case for malformed values, which can be of any type. For malformed values, we override writeValue() like so:

        protected void writeValue(XContentBuilder b, BytesRef value) throws IOException {
            XContentDataHelper.decodeAndWrite(b, value);
        }

}

@Override
protected void writeValue(XContentBuilder b, BytesRef value) throws IOException {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is that override of writeValue() that I mentioned above

}

/**
* Encodes an XContentBuilder's content, normalizing to JSON regardless of the builder's original content type.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is primarily here for the tests, although its also important to deliver a consistent experience to customers.

Why this is needed for tests: our tests randomize the content type, meanwhile the assertions don't take that into account (ex. yamlRestTests). As a result, the assertion must either be consistent across all content types or we need to change how we store values. For example, we could store things in insertion order as opposed to sorting every time. That said, in an older PR (I can dig it up if needed), we specifically decided to sort everything.

return malformedLoader;
}
return fieldLoader;
}
Copy link
Copy Markdown
Contributor Author

@Kubik42 Kubik42 Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is kind of ugly. Most of the other mappers use CompositeSyntheticFieldLoader, which abstracts away how 2+ loaders are handled. I think its worth refactoring SortedNumericDocValuesSyntheticFieldLoader into a Layer that can be used by CompositeSyntheticFieldLoader, but in another PR.

@Kubik42 Kubik42 force-pushed the ignore-malformed-binary-doc-values branch from 057c7ae to fa3ffde Compare February 24, 2026 01:35
@Kubik42 Kubik42 marked this pull request as ready for review February 24, 2026 07:43
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

Copy link
Copy Markdown
Contributor

@jordan-powers jordan-powers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but @martijnvg should probably have a look too

Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @Kubik42! I left a question.


private static void saveToBinaryDocValues(DocumentParserContext context, String fieldPath, BytesRef encoded) {
final String fieldName = name(fieldPath);
MultiValuedBinaryDocValuesField.SeparateCount.addToSeparateCountMultiBinaryFieldInDoc(context.doc(), fieldName, encoded, true);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use IntegratedCount here? I think this is what we use for ignore above?

I think integrated count is a better match here, since we don't optimize for fast reading. We're ok with reading value count as part of the actual values. And SeperateCount adds an additional doc value field for all fields that will have malformed values.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm ok, I initially chose SeperateCount due to optimizations it makes for single-valued fields.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes separate count fields make sense for mapped fields, but this like a side field that will not get queried (except when source gets generated) and so optimizing for less doc values fields and storage is better.

task.skipTest("search.vectors/42_knn_search_bbq_flat/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations")
task.skipTest("search.vectors/41_knn_search_bbq_hnsw/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations")
task.skipTest(
"get/100_synthetic_source/fields with ignore_malformed",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether instead of skipping the test entirely can we maybe use removeMatch?

task.skipTest("aggregate-metrics/30_sum_agg/Test sum agg with query", "Default metric cannot be defined, so the query is empty")
task.skipTest("aggregate-metrics/50_value_count_agg/Test value_count agg with query", "Default metric cannot be defined, so the query is empty")
// Malformed values are now stored in binary doc values (sorted) instead of stored fields (insertion order).
task.skipTest(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as in the other file?

@Kubik42 Kubik42 force-pushed the ignore-malformed-binary-doc-values branch from 7eb8e23 to 8ce768b Compare February 24, 2026 23:35
Copy link
Copy Markdown
Member

@martijnvg martijnvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@Kubik42 Kubik42 merged commit f094941 into elastic:main Feb 25, 2026
3 of 6 checks passed
@Kubik42 Kubik42 deleted the ignore-malformed-binary-doc-values branch February 25, 2026 16:21
Kubik42 added a commit to Kubik42/elasticsearch that referenced this pull request Feb 25, 2026
Kubik42 added a commit to Kubik42/elasticsearch that referenced this pull request Feb 25, 2026
Kubik42 added a commit to Kubik42/elasticsearch that referenced this pull request Feb 25, 2026
Kubik42 added a commit that referenced this pull request Feb 25, 2026
* Store ignored malformed fields in binary doc values (#142357)

This reverts commit 26ca5c3.

* Remove extra newline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants