Store ignored malformed fields in binary doc values#142357
Store ignored malformed fields in binary doc values#142357Kubik42 merged 3 commits intoelastic:mainfrom
Conversation
96c4b41 to
9dade60
Compare
178d3ff to
057c7ae
Compare
| task.skipTest("search.vectors/42_knn_search_bbq_flat/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations") | ||
| task.skipTest("search.vectors/41_knn_search_bbq_hnsw/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations") | ||
| task.skipTest( | ||
| "get/100_synthetic_source/fields with ignore_malformed", |
There was a problem hiding this comment.
Can't use replaceValueInMatch here due to how the yaml is structured - fields with ignore_malformed is large and has many match assertions that specifically check _source.ip. Using replaceValueInMatch here means all of those assertions will be changed, which is not what we want.
There was a problem hiding this comment.
I wonder whether instead of skipping the test entirely can we maybe use removeMatch?
There was a problem hiding this comment.
that wouldn't work well - each yaml test uses a nested format for all of its match assertions (ex. _source: { ip: [...] }). removeMatch() looks only at the name of the top-level key, (ex. _source), so using it here effectively removes all match assertion and renders the tests useless. This applies to all the test I've skipped in this PR. We would need to unnest some of the match assertions and then backport that in order for the yamlRestCompatTests to pass
| import java.io.IOException; | ||
|
|
||
| public final class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer { | ||
| public class BinaryDocValuesSyntheticFieldLoaderLayer implements CompositeSyntheticFieldLoader.DocValuesLayer { |
There was a problem hiding this comment.
Needed to removed final in order to override BinaryDocValuesSyntheticFieldLoaderLayer.writeValue() to support malformed values. By default writeValue() treats BytesRef values as raw UTF8 strings. This is not the case for malformed values, which can be of any type. For malformed values, we override writeValue() like so:
protected void writeValue(XContentBuilder b, BytesRef value) throws IOException {
XContentDataHelper.decodeAndWrite(b, value);
}
| } | ||
|
|
||
| @Override | ||
| protected void writeValue(XContentBuilder b, BytesRef value) throws IOException { |
There was a problem hiding this comment.
here is that override of writeValue() that I mentioned above
| } | ||
|
|
||
| /** | ||
| * Encodes an XContentBuilder's content, normalizing to JSON regardless of the builder's original content type. |
There was a problem hiding this comment.
This is primarily here for the tests, although its also important to deliver a consistent experience to customers.
Why this is needed for tests: our tests randomize the content type, meanwhile the assertions don't take that into account (ex. yamlRestTests). As a result, the assertion must either be consistent across all content types or we need to change how we store values. For example, we could store things in insertion order as opposed to sorting every time. That said, in an older PR (I can dig it up if needed), we specifically decided to sort everything.
| return malformedLoader; | ||
| } | ||
| return fieldLoader; | ||
| } |
There was a problem hiding this comment.
this is kind of ugly. Most of the other mappers use CompositeSyntheticFieldLoader, which abstracts away how 2+ loaders are handled. I think its worth refactoring SortedNumericDocValuesSyntheticFieldLoader into a Layer that can be used by CompositeSyntheticFieldLoader, but in another PR.
057c7ae to
fa3ffde
Compare
|
Pinging @elastic/es-storage-engine (Team:StorageEngine) |
jordan-powers
left a comment
There was a problem hiding this comment.
LGTM, but @martijnvg should probably have a look too
|
|
||
| private static void saveToBinaryDocValues(DocumentParserContext context, String fieldPath, BytesRef encoded) { | ||
| final String fieldName = name(fieldPath); | ||
| MultiValuedBinaryDocValuesField.SeparateCount.addToSeparateCountMultiBinaryFieldInDoc(context.doc(), fieldName, encoded, true); |
There was a problem hiding this comment.
Should we use IntegratedCount here? I think this is what we use for ignore above?
I think integrated count is a better match here, since we don't optimize for fast reading. We're ok with reading value count as part of the actual values. And SeperateCount adds an additional doc value field for all fields that will have malformed values.
There was a problem hiding this comment.
hmm ok, I initially chose SeperateCount due to optimizations it makes for single-valued fields.
There was a problem hiding this comment.
I think it makes separate count fields make sense for mapped fields, but this like a side field that will not get queried (except when source gets generated) and so optimizing for less doc values fields and storage is better.
| task.skipTest("search.vectors/42_knn_search_bbq_flat/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations") | ||
| task.skipTest("search.vectors/41_knn_search_bbq_hnsw/Vector rescoring has same scoring as exact search for kNN section", "scores have changed slightly with native implementations") | ||
| task.skipTest( | ||
| "get/100_synthetic_source/fields with ignore_malformed", |
There was a problem hiding this comment.
I wonder whether instead of skipping the test entirely can we maybe use removeMatch?
| task.skipTest("aggregate-metrics/30_sum_agg/Test sum agg with query", "Default metric cannot be defined, so the query is empty") | ||
| task.skipTest("aggregate-metrics/50_value_count_agg/Test value_count agg with query", "Default metric cannot be defined, so the query is empty") | ||
| // Malformed values are now stored in binary doc values (sorted) instead of stored fields (insertion order). | ||
| task.skipTest( |
There was a problem hiding this comment.
same reason as in the other file?
7eb8e23 to
8ce768b
Compare
This reverts commit 26ca5c3.
This reverts commit 26ca5c3.
Addresses https://github.com/elastic/logs-program/issues/41