Synthetic source: load text from stored fields#87480
Conversation
Adds support for loading `text` fields that have `store: true`. We could likely load *any* stored fields, but I wanted to blaze the trail using something fairly useful.
|
Force push incoming to resolve merge conflicts. |
|
cloud deploy robot, please build me an image |
|
run elasticsearch-ci/part-2 |
| * Write values for this document. | ||
| */ | ||
| void write(XContentBuilder b) throws IOException; | ||
| void write(FieldsVisitor fieldsVisitor, XContentBuilder b) throws IOException; |
There was a problem hiding this comment.
I don't think I actually need fieldsVisitor here - I think advanceToDoc can grab it.
There was a problem hiding this comment.
Yep. I wonder if we can avoid having it as a parameter in any of these methods and instead pass it StoredFieldSourceLoader implementations directly? Having a method param that is only used by a specific subset of implementations feels off to me.
There was a problem hiding this comment.
I could move it to the leaf method pretty easily. But it's kind of tricky because you have to advance the state in a specific way. And holding on to a reference to the thing for a while feels like it is more "at a distance". Like, we take a docId as a parameter, but we only use it if we're using doc values.
|
@elasticmachine retest this please |
|
@romseygeek i think this is ready for another round when you are ready for it! |
romseygeek
left a comment
There was a problem hiding this comment.
I like the API! I left a few questions.
| "field [" + name() + "] of type [" + typeName() + "] doesn't support synthetic source because it doesn't have doc values" | ||
| ); | ||
| } | ||
| if (fieldType().ignoreAbove() != Defaults.IGNORE_ABOVE) { |
There was a problem hiding this comment.
Does ignore_above not work if stored=true?
There was a problem hiding this comment.
It doesn't store the field if it is above ignore_above.
| public abstract class SortedNumericDocValuesSyntheticFieldLoader implements SourceLoader.SyntheticFieldLoader { | ||
| private final String name; | ||
| private final String simpleName; | ||
| private CheckedConsumer<XContentBuilder, IOException> writer = b -> {}; |
There was a problem hiding this comment.
This reads a bit weirdly to me, does it make more sense to leave write as abstract and just overload it in the two implementations?
There was a problem hiding this comment.
Those are per-segment writers. I'll see if I can make it less janky.
|
|
||
| private final String name; | ||
| private final String simpleName; | ||
| private CheckedConsumer<XContentBuilder, IOException> writer = b -> {}; |
server/src/main/java/org/elasticsearch/index/mapper/SourceLoader.java
Outdated
Show resolved
Hide resolved
| && kwd.hasNormalizer() == false | ||
| && kwd.fieldType().ignoreAbove() == KeywordFieldMapper.Defaults.IGNORE_ABOVE) { | ||
| if (kwd.hasNormalizer() == false | ||
| && kwd.fieldType().ignoreAbove() == KeywordFieldMapper.Defaults.IGNORE_ABOVE |
There was a problem hiding this comment.
Should this work with ignore_above=true and stored=true on the keyword subfield?
There was a problem hiding this comment.
Same deal. We don't store the field if it is above ignore_above.
|
@romseygeek , I pushed a patch to remove the weird: thing. I think it's more like what we want when we want to support |
|
run elasticsearch-ci/bwc |
romseygeek
left a comment
There was a problem hiding this comment.
I think there are more cleanups to do around stored field loading, but this is a great start. Thanks for all the iterations!
Woooh! Thanks for all the iterations too. I think we got something much nicer through them. |
|
I'll work on adding some docs for this after I cover |
When I added support for stored fields to synthetic _source (elastic#87480) I accidentally caused a performance regression. Our friends working on building the nightly charts for tsdb caught it. It looked like: ``` | 50th percentile latency | default_1k | 20.1228 | 41.289 | 21.1662 | ms | +105.18% | | 90th percentile latency | default_1k | 26.7402 | 42.5878 | 15.8476 | ms | +59.27% | | 99th percentile latency | default_1k | 37.0881 | 45.586 | 8.49786 | ms | +22.91% | | 99.9th percentile latency | default_1k | 43.7346 | 48.222 | 4.48742 | ms | +10.26% | | 100th percentile latency | default_1k | 46.057 | 56.8676 | 10.8106 | ms | +23.47% | ``` This fixes the regression and puts us in line with how we were: ``` | 50th percentile latency | default_1k | 20.1228 | 24.023 | 3.90022 | ms | +19.38% | | 90th percentile latency | default_1k | 26.7402 | 29.7841 | 3.04392 | ms | +11.38% | | 99th percentile latency | default_1k | 37.0881 | 36.8038 | -0.28428 | ms | -0.77% | | 99.9th percentile latency | default_1k | 43.7346 | 39.0192 | -4.71531 | ms | -10.78% | | 100th percentile latency | default_1k | 46.057 | 42.9181 | -3.13889 | ms | -6.82% | ``` A 20% bump in the 50% latency isn't great, but it four microseconds per document which is acceptable.
When I added support for stored fields to synthetic _source (#87480) I accidentally caused a performance regression. Our friends working on building the nightly charts for tsdb caught it. It looked like: ``` | 50th percentile latency | default_1k | 20.1228 | 41.289 | 21.1662 | ms | +105.18% | | 90th percentile latency | default_1k | 26.7402 | 42.5878 | 15.8476 | ms | +59.27% | | 99th percentile latency | default_1k | 37.0881 | 45.586 | 8.49786 | ms | +22.91% | | 99.9th percentile latency | default_1k | 43.7346 | 48.222 | 4.48742 | ms | +10.26% | | 100th percentile latency | default_1k | 46.057 | 56.8676 | 10.8106 | ms | +23.47% | ``` This fixes the regression and puts us in line with how we were: ``` | 50th percentile latency | default_1k | 20.1228 | 24.023 | 3.90022 | ms | +19.38% | | 90th percentile latency | default_1k | 26.7402 | 29.7841 | 3.04392 | ms | +11.38% | | 99th percentile latency | default_1k | 37.0881 | 36.8038 | -0.28428 | ms | -0.77% | | 99.9th percentile latency | default_1k | 43.7346 | 39.0192 | -4.71531 | ms | -10.78% | | 100th percentile latency | default_1k | 46.057 | 42.9181 | -3.13889 | ms | -6.82% | ``` A 20% bump in the 50% latency isn't great, but it four microseconds per document which is acceptable.
Adds support for loading
textandkeywordfields that havestore: true. We could likely load any stored fields, but Iwanted to blaze the trail using something fairly useful.