Enhances exists queries to reduce need for _field_names#26930
Enhances exists queries to reduce need for _field_names#26930colings86 merged 8 commits intoelastic:masterfrom colings86:enhance/existsQuery
_field_names#26930Conversation
jpountz
left a comment
There was a problem hiding this comment.
Thanks for working on it.
One concern I have is that the impl for existsQuery is shared in a super class while the logic for adding mappers is in concrete impls. I think it makes it very easy to inherit the default impl for the exists query and forget to add a _field_names field at index time if necessary?
There was a problem hiding this comment.
haha good spot, this was so that the rest client didn't time out when I was debugging rest tests, shouldn't have been committed.
There was a problem hiding this comment.
I expected binary fields to require changes?
There was a problem hiding this comment.
Then shouldn't it create field names fields when it is stored or has doc values?
There was a problem hiding this comment.
You are right, I've made changes to support this
There was a problem hiding this comment.
why do we need this if statement?
There was a problem hiding this comment.
@jpountz and I discussed this and we agreed that for this PR we will keep the behaviour that exists queries will only work for fields that are indexed, stored, or have doc values (or a combination of those) so this PR can be merged into 7.0 and 6.x. I'll make a followup PR for 7.0 only that changes things to allow exists to work for any field that is mapped.
There was a problem hiding this comment.
Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?
There was a problem hiding this comment.
I've moved this logic outside of this method, the callers now check to ensure if field names field should be created
There was a problem hiding this comment.
can you leave a comment that the logic sits in other impls?
There was a problem hiding this comment.
you could return a match_all if the field is enabled and match_none otherwise?
There was a problem hiding this comment.
I threw an exception here to mirror the behaviour of termQuery(). I assumed we did this in term query to discourage people from performing queries on _source and assumed we would want to do this here too for the same reason?
There was a problem hiding this comment.
would it make sense to create a helper method that creates a query for a given field?
There was a problem hiding this comment.
Now that I think more about it, I think this impl has potential problems with multiple types, since getObjectMapper will return the mapper of one of the types, meaning it might miss sub fields if they only exist in another type (those mapper classes are per-type on the contrary to the field type classes). So maybe we should resolve fields that match field + ".*" and build a bool query out of it (and no need for recursion since I would expect the glob to match fields of sub objects too)
There was a problem hiding this comment.
maybe use NormFieldExistsQuery since we are on a 7.1 snapshot now?
There was a problem hiding this comment.
I'll do this as a followup PR as it will require the Lucene 7.1 snapshot to be updated too
|
@jpountz I pushed a commit that addresses your comments. Could you take another look? |
jpountz
left a comment
There was a problem hiding this comment.
I left some more comments. I think we also need to fix the _field_names docs to remove the example that directly queries _field_names?
There was a problem hiding this comment.
Then shouldn't it create field names fields when it is stored or has doc values?
There was a problem hiding this comment.
Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?
|
@jpountz I pushed another commit. This addresses your remaining comments including removing the implementation of |
jpountz
left a comment
There was a problem hiding this comment.
It looks good to me overall! I have two questions left:
- should we document the breaking change about term queries on
_field_namesor maybe even try to maintain backward compatibility? In any cases we should at least remove the term query example in the docs? - Doc value iterators are only good for sure if they were written with Lucene 7. So I'm wondering that we might still want to write field names entries if the index was created before 6.0 as well as only use DocValuesFieldExistsQuery if the index was created on or after 6.0?
There was a problem hiding this comment.
this will always be false?
Apart from going back to always writing all field names to the
Ok so this would mean checking the indexCreatedVersion at both index time and query time to determine how to index/query the data? It would also mean the index-time performance improvements would only be visible for indexes created after 6.0? |
I think it is fine. It would also be symmetric since we add StringField instances manually at index time rather than calling FieldNamesFieldMapper.parse?
Correct. |
There was a problem hiding this comment.
Probably worth a short comment about how we only need it in this case.
There was a problem hiding this comment.
Good point, will add
There was a problem hiding this comment.
Should you override parseCreateField to throw an IllegalStateException or something? I don't think we use it any more.
There was a problem hiding this comment.
We didn't use it until I pushed the latest commits which put the BWC code in to make sure _field_names is still used for indexes created before 6.1. I will make sure I change parseCreateField to throw an exception when I remove the BWC code though. Thanks for pointing this out.
Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance. This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`. Closes #26770
These values will need to be changed after backporting this PR to 6.x
* Enhances exists queries to reduce need for `_field_names` Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance. This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`. Closes #26770 * Addresses review comments * Addresses more review comments * implements existsQuery explicitly on every mapper * Reinstates ability to perform term query on `_field_names` * Added bwc depending on index created version * Review Comments * Skips tests that are not supported in 6.1.0 These values will need to be changed after backporting this PR to 6.x
* master: Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)
* master: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (#26811) prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)
* 6.x: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Tests: Fix FullClusterRestartIT.testSnapshotRestore test failing in 6.x (#27218) Fix FullClusterRestartIT using lenient booleans with 6.0 Fix stable BWC branch detection logic Add version 6.0.0 Fix logic detecting unreleased versions Skips exists query tests on unsupported versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)
|
This is triggering a few thousand |
Before this change we wrote the name all the fields in a document to a
_field_namesfield and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance.This change adds a new method
existsQuery()toMappedFieldTypewhich is implemented by each sub-class. For most field types if doc values are available aDocValuesFieldExistsQueryis used, falling back to using_field_namesif doc values are disabled. Note that only fields where no doc values are available are written to_field_names.Closes #26770