Use the primary_term field to identify parent documents by s1monw · Pull Request #27469 · elastic/elasticsearch

s1monw · 2017-11-21T09:03:32Z

This change stops indxing the _primary_term field for nested documents
to allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.

Relates to #24362

This change stops indxing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to elastic#24362

bleskes

This looks good to me. I can't speak to the implications on the lucene side - like whether we want to index an illegal value (0) for the primary term vs using the mentioned bitset and the exists query. I presume the bit set will be faster for reads and probably smaller index wise (as the doc values will be compressed better).

bleskes · 2017-11-21T09:17:47Z

core/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java

            doc.add(seqID.seqNo);
            doc.add(seqID.seqNoDocValue);
-            doc.add(seqID.primaryTerm);
+            if (includePrimaryTerm) {


can we add a comment saying that primary terms are used to distinguish between top level (parent) docs and nested ones.

+1 will do that

martijnvg

LGTM! I left a few small comments.

martijnvg · 2017-11-21T09:52:24Z

core/src/test/java/org/apache/lucene/search/QueriesTests.java

        // This is a custom query that extends AutomatonQuery and want to make sure the equals method works
-        assertEquals(Queries.newNonNestedFilter(), Queries.newNonNestedFilter());
-        assertEquals(Queries.newNonNestedFilter().hashCode(), Queries.newNonNestedFilter().hashCode());
+        Version version = VersionUtils.randomVersion(random());


instead of a random version maybe also test both pre 7.0.0 and post 7.0.0 specifically?

I added a test that checks all versions

martijnvg · 2017-11-21T09:53:15Z

...src/test/java/org/elasticsearch/search/aggregations/bucket/nested/NestedAggregatorTests.java


                BooleanQuery.Builder bq = new BooleanQuery.Builder();
-                bq.add(Queries.newNonNestedFilter(), BooleanClause.Occur.MUST);
+                bq.add(Queries.newNonNestedFilter(Version.CURRENT), BooleanClause.Occur.MUST);


randomize version? The test should be able to handle both the old and new way.

martijnvg · 2017-11-21T10:00:58Z

core/src/main/java/org/elasticsearch/index/mapper/SeqNoFieldMapper.java

        assert seqID != null;
-        for (int i = 1; i < context.docs().size(); i++) {
+        int numDocs = context.docs().size();
+        final Version versionCreated = context.mapperService().getIndexSettings().getIndexVersionCreated();


not related to this change, but I think we should have add QueryShardContext#getIndexVersionCreated() helper method that does this: mapperService().getIndexSettings().getIndexVersionCreated().

Arg... I meant ParseContext#getIndexVersionCreated()

I will open a followup but I think we need to have a more common class across all index level contexts maybe a base class?

jpountz

The change looks good to me, thanks for tackling it.

Regarding implications, my understanding is that primary term lookups are rare so this change should not slow down indexing, even it might make primary term lookups slower. The current way that things are designed, doc value lookups may be linear (this typically happens in 2 cases: if the field is sparse of if splitting into blocks proves to give better compression) but with a very high constant-factor of 2^16. So say you have a segment with 100M documents (which is pretty large for a segment), Lucene will be looping over 1.5k blocks of documents until it reaches the right one.

s1monw · 2017-11-21T13:36:52Z

@jpountz @martijnvg please take another look I had to work around a lucene bug

martijnvg

Still LGTM

This change stops indexing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to #24362

* master: (41 commits) [Test] Fix AggregationsTests#testFromXContentWithRandomFields [DOC] Fix mathematical representation on interval (range) (elastic#27450) Update version check for CCS optional remote clusters Bump BWC version to 6.1.0 for elastic#27469 Adapt rest test BWC version after backport Fix dynamic mapping update generation. (elastic#27467) Use the primary_term field to identify parent documents (elastic#27469) Move composite aggregation to core (elastic#27474) Fix test BWC version after backport Protect shard splitting from illegal target shards (elastic#27468) Cross Cluster Search: make remote clusters optional (elastic#27182) [Docs] Fix broken bulleted lists (elastic#27470) Move resync request serialization assertion Fix resync request serialization Fix issue where pages aren't released (elastic#27459) Add YAML REST tests for filters bucket agg (elastic#27128) Remove tcp profile from low level nio channel (elastic#27441) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (elastic#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly elastic#27454 ...

s1monw added :Nested Docs >enhancement v6.1.0 v7.0.0 labels Nov 21, 2017

s1monw requested review from bleskes, jpountz and martijnvg November 21, 2017 09:03

add missing file

b65adf5

bleskes approved these changes Nov 21, 2017

View reviewed changes

martijnvg approved these changes Nov 21, 2017

View reviewed changes

jpountz approved these changes Nov 21, 2017

View reviewed changes

s1monw added 3 commits November 21, 2017 11:37

apply review comments

d1c98f8

add workaround for lucene bug

bcbfe46

reference lucene issue

05cd180

martijnvg approved these changes Nov 21, 2017

View reviewed changes

s1monw merged commit 5a0b6d1 into elastic:master Nov 21, 2017

s1monw added a commit that referenced this pull request Nov 21, 2017

Bump BWC version to 6.1.0 for #27469

cc78b24

clintongormley added :Search/Search Search-related issues that do not fall into other categories and removed :Nested Docs labels Feb 14, 2018

jeffreynscrbdee mentioned this pull request Sep 25, 2018

Extra Lucene DocValueExistsQuery fired - due to nested mapping and primary_terms #34067

Closed

romseygeek mentioned this pull request Jul 17, 2019

Don't use TypeField for nested filters #44482

Merged

Conversation

s1monw commented Nov 21, 2017

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw commented Nov 21, 2017

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants