Enhances exists queries to reduce need for `_field_names` by colings86 · Pull Request #26930 · elastic/elasticsearch

colings86 · 2017-10-09T14:49:04Z

Before this change we wrote the name all the fields in a document to a _field_names field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance.

This change adds a new method existsQuery() to MappedFieldType which is implemented by each sub-class. For most field types if doc values are available a DocValuesFieldExistsQuery is used, falling back to using _field_names if doc values are disabled. Note that only fields where no doc values are available are written to _field_names.

Closes #26770

jpountz

Thanks for working on it.

One concern I have is that the impl for existsQuery is shared in a super class while the logic for adding mappers is in concrete impls. I think it makes it very easy to inherit the default impl for the exists query and forget to add a _field_names field at index time if necessary?

jpountz · 2017-10-09T21:49:09Z

client/rest/src/main/java/org/elasticsearch/client/RestClientBuilder.java

haha good spot, this was so that the rest client didn't time out when I was debugging rest tests, shouldn't have been committed.

jpountz · 2017-10-09T21:49:32Z

core/src/main/java/org/elasticsearch/index/mapper/BinaryFieldMapper.java

I expected binary fields to require changes?

see #26930 (comment)

Then shouldn't it create field names fields when it is stored or has doc values?

You are right, I've made changes to support this

jpountz · 2017-10-09T21:51:24Z

core/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

why do we need this if statement?

@jpountz and I discussed this and we agreed that for this PR we will keep the behaviour that exists queries will only work for fields that are indexed, stored, or have doc values (or a combination of those) so this PR can be merged into 7.0 and 6.x. I'll make a followup PR for 7.0 only that changes things to allow exists to work for any field that is mapped.

Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?

I've moved this logic outside of this method, the callers now check to ensure if field names field should be created

jpountz · 2017-10-09T21:51:56Z

core/src/main/java/org/elasticsearch/index/mapper/FieldNamesFieldMapper.java

can you leave a comment that the logic sits in other impls?

jpountz · 2017-10-09T21:56:55Z

core/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java

you could return a match_all if the field is enabled and match_none otherwise?

I threw an exception here to mirror the behaviour of termQuery(). I assumed we did this in term query to discourage people from performing queries on _source and assumed we would want to do this here too for the same reason?

jpountz · 2017-10-09T21:59:13Z

core/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java

would it make sense to create a helper method that creates a query for a given field?

jpountz · 2017-10-09T22:05:26Z

core/src/main/java/org/elasticsearch/index/query/ExistsQueryBuilder.java

Now that I think more about it, I think this impl has potential problems with multiple types, since getObjectMapper will return the mapper of one of the types, meaning it might miss sub fields if they only exist in another type (those mapper classes are per-type on the contrary to the field type classes). So maybe we should resolve fields that match field + ".*" and build a bool query out of it (and no need for recursion since I would expect the glob to match fields of sub objects too)

jpountz · 2017-10-09T22:07:15Z

core/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

maybe use NormFieldExistsQuery since we are on a 7.1 snapshot now?

I'll do this as a followup PR as it will require the Lucene 7.1 snapshot to be updated too

colings86 · 2017-10-10T11:55:44Z

@jpountz I pushed a commit that addresses your comments. Could you take another look?

jpountz

I left some more comments. I think we also need to fix the _field_names docs to remove the example that directly queries _field_names?

jpountz · 2017-10-10T14:33:01Z

core/src/main/java/org/elasticsearch/index/mapper/BinaryFieldMapper.java

Then shouldn't it create field names fields when it is stored or has doc values?

jpountz · 2017-10-10T14:44:12Z

core/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

Hmm but then that if statement is still buggy since it should also check for hasDocValues? Actually I'd be in favour of removing it entirely since you seem to have that logic on the call sites already anyway (only adding this field if there are no doc values)?

jpountz · 2017-10-10T14:45:49Z

core/src/main/java/org/elasticsearch/index/mapper/SourceFieldMapper.java

jpountz · 2017-10-10T14:45:54Z

core/src/main/java/org/elasticsearch/index/mapper/TextFieldMapper.java

colings86 · 2017-10-11T11:28:51Z

@jpountz I pushed another commit. This addresses your remaining comments including removing the implementation of existsQuery from MappedFieldType and instead requires (and implements) the method on the subclasses

jpountz

It looks good to me overall! I have two questions left:

should we document the breaking change about term queries on _field_names or maybe even try to maintain backward compatibility? In any cases we should at least remove the term query example in the docs?
Doc value iterators are only good for sure if they were written with Lucene 7. So I'm wondering that we might still want to write field names entries if the index was created before 6.0 as well as only use DocValuesFieldExistsQuery if the index was created on or after 6.0?

jpountz · 2017-10-11T13:22:07Z

core/src/main/java/org/elasticsearch/index/mapper/CompletionFieldMapper.java

this will always be false?

jpountz · 2017-10-11T13:23:28Z

core/src/main/java/org/elasticsearch/index/mapper/GeoShapeFieldMapper.java

always false?

jpountz · 2017-10-11T13:24:20Z

core/src/main/java/org/elasticsearch/index/mapper/RoutingFieldMapper.java

always false?

colings86 · 2017-10-11T14:36:12Z

should we document the breaking change about term queries on _field_names or maybe even try to maintain backward compatibility? In any cases we should at least remove the term query example in the docs?

Apart from going back to always writing all field names to the _field_names field the only way I see to maintain backward compatibility for term queries on _field_names is to have FieldNamesFieldMapper.termQuery() retrieve the MappedFieldType for the field being queried and call existsQuery() on it. At the moment the implementations of existsQuerycall FieldNamesFieldMapper.termQuery() so if we want to do this we would need to have the existsQuery() implementations instead create a TermQuery directly otherwise we would end up with a circular call chain. Does this sound ok to you?

Doc value iterators are only good for sure if they were written with Lucene 7. So I'm wondering that we might still want to write field names entries if the index was created before 6.0 as well as only use DocValuesFieldExistsQuery if the index was created on or after 6.0?

Ok so this would mean checking the indexCreatedVersion at both index time and query time to determine how to index/query the data? It would also mean the index-time performance improvements would only be visible for indexes created after 6.0?

jpountz · 2017-10-13T10:41:30Z

Does this sound ok to you?

I think it is fine. It would also be symmetric since we add StringField instances manually at index time rather than calling FieldNamesFieldMapper.parse?

Ok so this would mean checking the indexCreatedVersion at both index time and query time to determine how to index/query the data? It would also mean the index-time performance improvements would only be visible for indexes created after 6.0?

Correct.

nik9000 · 2017-10-16T14:52:15Z

core/src/main/java/org/elasticsearch/index/mapper/BinaryFieldMapper.java

Probably worth a short comment about how we only need it in this case.

Good point, will add

nik9000 · 2017-10-16T15:09:54Z

core/src/main/java/org/elasticsearch/index/mapper/FieldNamesFieldMapper.java

Should you override parseCreateField to throw an IllegalStateException or something? I don't think we use it any more.

We didn't use it until I pushed the latest commits which put the BWC code in to make sure _field_names is still used for indexes created before 6.1. I will make sure I change parseCreateField to throw an exception when I remove the BWC code though. Thanks for pointing this out.

colings86 · 2017-10-24T09:20:06Z

@nik9000 @jpountz thanks for the reviews so far. I've pushed some more commits, would you be able to take another look?

jpountz

LGTM

jpountz · 2017-10-31T16:37:18Z

core/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java

indentation

Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance. This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`. Closes #26770

These values will need to be changed after backporting this PR to 6.x

* Enhances exists queries to reduce need for `_field_names` Before this change we wrote the name all the fields in a document to a `_field_names` field and then implemented exists queries as a term query on this field. The problem with this approach is that it bloats the index and also affects indexing performance. This change adds a new method `existsQuery()` to `MappedFieldType` which is implemented by each sub-class. For most field types if doc values are available a `DocValuesFieldExistsQuery` is used, falling back to using `_field_names` if doc values are disabled. Note that only fields where no doc values are available are written to `_field_names`. Closes #26770 * Addresses review comments * Addresses more review comments * implements existsQuery explicitly on every mapper * Reinstates ability to perform term query on `_field_names` * Added bwc depending on index created version * Review Comments * Skips tests that are not supported in 6.1.0 These values will need to be changed after backporting this PR to 6.x

* master: Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)

* master: Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (elastic#26930) Added new terms_set query Set request body to required to reflect the code base (elastic#27188) Update Docker docs for 6.0.0-rc2 (elastic#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (elastic#26933) Convert index blocks to cluster block exceptions (elastic#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (elastic#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (elastic#26811) prevent duplicate fields when mixing parent and root nested includes (elastic#27072) TopHitsAggregator must propagate calls to `setScorer`. (elastic#27138)

* master: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Fix stable BWC branch detection logic Fix logic detecting unreleased versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Add version 6.0.0 Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page Fix Laplace scorer to multiply by alpha (and not add) (#27125) [DOCS] Clarify migrate guide and search request validation Raise IllegalArgumentException if query validation failed (#26811) prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)

* 6.x: Lazy initialize checkpoint tracker bit sets Remove checkpoint tracker bit sets setting Tests: Fix FullClusterRestartIT.testSnapshotRestore test failing in 6.x (#27218) Fix FullClusterRestartIT using lenient booleans with 6.0 Fix stable BWC branch detection logic Add version 6.0.0 Fix logic detecting unreleased versions Skips exists query tests on unsupported versions Enhances exists queries to reduce need for `_field_names` (#26930) Added new terms_set query Set request body to required to reflect the code base (#27188) Update Docker docs for 6.0.0-rc2 (#27166) Docs: restore now fails if it encounters incompatible settings (#26933) Convert index blocks to cluster block exceptions (#27050) [DOCS] Link remote info API in Cross Cluster Search docs page prevent duplicate fields when mixing parent and root nested includes (#27072) TopHitsAggregator must propagate calls to `setScorer`. (#27138)

Bek · 2020-11-07T17:29:29Z

This is triggering a few thousand DocValuesFieldExistsQuery for us when using exists for a field type object because objects under the field have even more deeply nested documents. Queries taking a second before are taking around 30-40 seconds. Profiling is indicating that "exists": "user" causes the DocValuesFieldExistsQuery to be called on each and all subfields as if "exists": "user.*" is intended. "exists": "user.id" returns without performance degradation. Is this the expected behavior?

colings86 added :Query DSL >enhancement v7.0.0 WIP labels Oct 9, 2017

colings86 self-assigned this Oct 9, 2017

colings86 requested a review from jpountz October 9, 2017 14:49

jpountz reviewed Oct 9, 2017

View reviewed changes

jpountz reviewed Oct 10, 2017

View reviewed changes

jpountz approved these changes Oct 11, 2017

View reviewed changes

nik9000 reviewed Oct 16, 2017

View reviewed changes

jpountz approved these changes Oct 31, 2017

View reviewed changes

core/src/main/java/org/elasticsearch/index/mapper/FieldMapper.java Outdated

Copy link
Copy Markdown

Contributor

jpountz Oct 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

colings86 added 8 commits November 1, 2017 09:04

Addresses review comments

d98886e

Addresses more review comments

6d254cb

implements existsQuery explicitly on every mapper

ad2ed1d

Reinstates ability to perform term query on _field_names

40b7878

Added bwc depending on index created version

9778fcb

Review Comments

a49bf89

Skips tests that are not supported in 6.1.0

b4bfc79

These values will need to be changed after backporting this PR to 6.x

colings86 merged commit 99aca9c into elastic:master Nov 1, 2017

colings86 deleted the enhance/existsQuery branch November 1, 2017 15:30

colings86 added :Search/Search Search-related issues that do not fall into other categories and removed WIP labels Nov 2, 2017

jpountz mentioned this pull request Nov 2, 2017

Disallow disabling _field_names #27239

Closed

clintongormley added v6.1.0 and removed :Query DSL labels Nov 6, 2017

jimczi mentioned this pull request Mar 21, 2018

Allow enabling doc_values on meta fields (specifically _field_names) #29191

Closed

colings86 mentioned this pull request May 25, 2018

Disabling _field_names don't work? #30862

Closed

bczifra mentioned this pull request Jun 13, 2018

{"query_string":{"query":"_exists_:(foo bar)"}} default operator behavior is different between 2.4.4 and 6.2.3 #31297

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

jpountz mentioned this pull request Mar 11, 2019

Update the how-to section of the docs for 7.0: #37717

Merged

jimczi mentioned this pull request Jun 5, 2019

Deprecate _field_names disabling #42854

Merged

keenangraham mentioned this pull request Nov 9, 2020

Optimize exists queries on object fields #64837

Closed

Conversation

colings86 commented Oct 9, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colings86 commented Oct 10, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colings86 commented Oct 11, 2017

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

colings86 commented Oct 11, 2017

Uh oh!

jpountz commented Oct 13, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!