Use the primary_term field to identify parent documents#27469
Use the primary_term field to identify parent documents#27469s1monw merged 5 commits intoelastic:masterfrom
Conversation
This change stops indxing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to elastic#24362
bleskes
left a comment
There was a problem hiding this comment.
This looks good to me. I can't speak to the implications on the lucene side - like whether we want to index an illegal value (0) for the primary term vs using the mentioned bitset and the exists query. I presume the bit set will be faster for reads and probably smaller index wise (as the doc values will be compressed better).
| doc.add(seqID.seqNo); | ||
| doc.add(seqID.seqNoDocValue); | ||
| doc.add(seqID.primaryTerm); | ||
| if (includePrimaryTerm) { |
There was a problem hiding this comment.
can we add a comment saying that primary terms are used to distinguish between top level (parent) docs and nested ones.
martijnvg
left a comment
There was a problem hiding this comment.
LGTM! I left a few small comments.
| // This is a custom query that extends AutomatonQuery and want to make sure the equals method works | ||
| assertEquals(Queries.newNonNestedFilter(), Queries.newNonNestedFilter()); | ||
| assertEquals(Queries.newNonNestedFilter().hashCode(), Queries.newNonNestedFilter().hashCode()); | ||
| Version version = VersionUtils.randomVersion(random()); |
There was a problem hiding this comment.
instead of a random version maybe also test both pre 7.0.0 and post 7.0.0 specifically?
There was a problem hiding this comment.
I added a test that checks all versions
|
|
||
| BooleanQuery.Builder bq = new BooleanQuery.Builder(); | ||
| bq.add(Queries.newNonNestedFilter(), BooleanClause.Occur.MUST); | ||
| bq.add(Queries.newNonNestedFilter(Version.CURRENT), BooleanClause.Occur.MUST); |
There was a problem hiding this comment.
randomize version? The test should be able to handle both the old and new way.
| assert seqID != null; | ||
| for (int i = 1; i < context.docs().size(); i++) { | ||
| int numDocs = context.docs().size(); | ||
| final Version versionCreated = context.mapperService().getIndexSettings().getIndexVersionCreated(); |
There was a problem hiding this comment.
not related to this change, but I think we should have add QueryShardContext#getIndexVersionCreated() helper method that does this: mapperService().getIndexSettings().getIndexVersionCreated().
There was a problem hiding this comment.
Arg... I meant ParseContext#getIndexVersionCreated()
There was a problem hiding this comment.
I will open a followup but I think we need to have a more common class across all index level contexts maybe a base class?
jpountz
left a comment
There was a problem hiding this comment.
The change looks good to me, thanks for tackling it.
Regarding implications, my understanding is that primary term lookups are rare so this change should not slow down indexing, even it might make primary term lookups slower. The current way that things are designed, doc value lookups may be linear (this typically happens in 2 cases: if the field is sparse of if splitting into blocks proves to give better compression) but with a very high constant-factor of 2^16. So say you have a segment with 100M documents (which is pretty large for a segment), Lucene will be looping over 1.5k blocks of documents until it reaches the right one.
|
@jpountz @martijnvg please take another look I had to work around a lucene bug |
This change stops indexing the `_primary_term` field for nested documents to allow fast retrieval of parent documents. Today we create a docvalues field for children to ensure we have a dense datastructure on disk. Yet, since we only use the primary term to tie-break on when we see the same seqID on indexing having a dense datastructure is less important. We can use this now to improve the nested docs performance and it's memory footprint. Relates to #24362
* master: (41 commits) [Test] Fix AggregationsTests#testFromXContentWithRandomFields [DOC] Fix mathematical representation on interval (range) (elastic#27450) Update version check for CCS optional remote clusters Bump BWC version to 6.1.0 for elastic#27469 Adapt rest test BWC version after backport Fix dynamic mapping update generation. (elastic#27467) Use the primary_term field to identify parent documents (elastic#27469) Move composite aggregation to core (elastic#27474) Fix test BWC version after backport Protect shard splitting from illegal target shards (elastic#27468) Cross Cluster Search: make remote clusters optional (elastic#27182) [Docs] Fix broken bulleted lists (elastic#27470) Move resync request serialization assertion Fix resync request serialization Fix issue where pages aren't released (elastic#27459) Add YAML REST tests for filters bucket agg (elastic#27128) Remove tcp profile from low level nio channel (elastic#27441) [TEST] Fix `GeoShapeQueryTests#testPointsOnly` failure Transition transport apis to use void listeners (elastic#27440) AwaitsFix GeoShapeQueryTests#testPointsOnly elastic#27454 ...
This change stops indxing the
_primary_termfield for nested documentsto allow fast retrieval of parent documents. Today we create a docvalues
field for children to ensure we have a dense datastructure on disk. Yet,
since we only use the primary term to tie-break on when we see the same
seqID on indexing having a dense datastructure is less important. We can
use this now to improve the nested docs performance and it's memory footprint.
Relates to #24362