Use a _recovery_source if source is omitted or modified#31106
Use a _recovery_source if source is omitted or modified#31106s1monw merged 11 commits intoelastic:ccrfrom
_recovery_source if source is omitted or modified#31106Conversation
Today if a user omits the `_source` entirely or modifies the source on indexing we have no chance to re-create the document after it has been added. This is an issue for CCR and recovery based on soft deletes which we are going to make the default. This change adds an additional recovery source if the source is disabled or modified that is only kept around until the document leaves the retention policy window. This change adds a merge policy that efficiently removes this extra source on merge for all document that are live and not in the retention policy window anymore.
|
Pinging @elastic/es-distributed |
jpountz
left a comment
There was a problem hiding this comment.
I like the approach.
Since I'd expect most documents to have dropped their recovery source already I'm wondering whether it would be a bit more efficient to compute the bitset of documents for which we need to retain the recovery source rather than drop.
| public NumericDocValues getNumeric(FieldInfo field) throws IOException { | ||
| NumericDocValues numeric = super.getNumeric(field); | ||
| if (recoverySourceField.equals(field.name)) { | ||
| return new FilterNumericDocValues(numeric) { |
There was a problem hiding this comment.
maybe leave comments about why we don't need to check whether numeric and docValuesReader are null
| if (recoverySource == null) { | ||
| return false; | ||
| } | ||
| if (tombstoneDV.docID() > segmentDocId) { |
| if (tombstoneDV.docID() > segmentDocId) { | ||
| recoverySource = leafReader.getNumericDocValues(SourceFieldMapper.RECOVERY_SOURCE_NAME); | ||
| } | ||
| return tombstoneDV.advanceExact(segmentDocId); |
There was a problem hiding this comment.
Same here tombstoneDV -> recoverySource
|
|
||
| if (originalSource != null && source != originalSource && context.indexSettings().isSoftDeleteEnabled()) { | ||
| // if we omitted source or modified it we add the _recovery_source to ensure we have it for ops based recovery | ||
| BytesRef ref = source.toBytesRef(); |
There was a problem hiding this comment.
Should we store the original source here?
There was a problem hiding this comment.
LOL yeah we should :D - I will fix
| import java.util.function.Supplier; | ||
|
|
||
| final class RecoverySourcePruneMergePolicy extends OneMergeWrappingMergePolicy { | ||
| RecoverySourcePruneMergePolicy(String recoverySourceField, Supplier<Query> retentionPolicySupplier, MergePolicy in) { |
There was a problem hiding this comment.
retentionPolicySupplier is confusing. It's a prune query supplier.
| } | ||
|
|
||
| // pkg private for testing | ||
| static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retentionPolicySupplier) |
There was a problem hiding this comment.
Same here: retentionPolicySupplier
jpountz
left a comment
There was a problem hiding this comment.
LGTM. I left a suggestion for an improvement.
| if (scorer != null) { | ||
| return new SourcePruningFilterCodecReader(recoverySourceField, reader, BitSet.of(scorer.iterator(), reader.maxDoc())); | ||
| } else { | ||
| return new SourcePruningFilterCodecReader(recoverySourceField, reader, new BitSet.MatchNoBits(reader.maxDoc())); |
There was a problem hiding this comment.
s/BitSet.MatchNoBits/Bits.MatchNoBits/
| NumericDocValues numeric = super.getNumeric(field); | ||
| if (recoverySourceField.equals(field.name)) { | ||
| assert numeric != null : recoverySourceField + " must have numeric DV but was null"; | ||
| return new FilterNumericDocValues(numeric) { |
There was a problem hiding this comment.
If recoverySourceToKeep was a bitset, we could do a leap frog, which would be faster if recoverySourceToKeep is sparse.
final ConjunctionDISI intersection = ConjunctionDISI.intersectIterators(Arrays.asList(numeric, new BitSetIterator(recoverySourceToKeep)));
return new FilterNumericDocValues(numeric) {
@Override
public int nextDoc() throws IOException {
return intersection.nextDoc();
}
};
dnhatn
left a comment
There was a problem hiding this comment.
It's nice that we now use a single retention query for both MPs.
| // pkg private for testing | ||
| static CodecReader wrapReader(String recoverySourceField, CodecReader reader, Supplier<Query> retainSourceQuerySupplier) | ||
| throws IOException { | ||
| NumericDocValues recovery_source = reader.getNumericDocValues(recoverySourceField); |
| if (softDeleteEnabled) { | ||
| iwc.setSoftDeletesField(Lucene.SOFT_DELETE_FIELD); | ||
| mergePolicy = new SoftDeletesRetentionMergePolicy(Lucene.SOFT_DELETE_FIELD, this::softDeletesRetentionQuery, mergePolicy); | ||
| mergePolicy = new RecoverySourcePruneMergePolicy(SourceFieldMapper.RECOVERY_SOURCE_NAME, this::softDeletesRetentionQuery, |
There was a problem hiding this comment.
Nice, a single retention query for both 💯
Today if a user omits the `_source` entirely or modifies the source on indexing we have no chance to re-create the document after it has been added. This is an issue for CCR and recovery based on soft deletes which we are going to make the default. This change adds an additional recovery source if the source is disabled or modified that is only kept around until the document leaves the retention policy window. This change adds a merge policy that efficiently removes this extra source on merge for all document that are live and not in the retention policy window anymore.
This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch. Highlight works in this PR include: 1. Replace hard-deletes by soft-deletes in InternalEngine 2. Use _recovery_source if _source is disabled or modified (elastic#31106) 3. Soft-deletes retention policy based on the global checkpoint (elastic#30335) 4. Read operation history from Lucene instead of translog (elastic#30120) 5. Use Lucene history in peer-recovery (elastic#30522) These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand jpountz@gmail.com Co-authored-by: Boaz Leskes b.leskes@gmail.com Co-authored-by: Jason Tedor jason@tedor.me Co-authored-by: Martijn van Groningen martijn.v.groningen@gmail.com Co-authored-by: Nhat Nguyen nhat.nguyen@elastic.co Co-authored-by: Simon Willnauer simonw@apache.org
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (elastic#31106) - Soft-deletes retention policy based on the global checkpoint (elastic#30335) - Read operation history from Lucene instead of translog (elastic#30120) - Use Lucene history in peer-recovery (elastic#30522) Relates elastic#30086 Closes elastic#29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
Today if a user omits the
_sourceentirely or modifies the sourceon indexing we have no chance to re-create the document after it has
been added. This is an issue for CCR and recovery based on soft deletes
which we are going to make the default. This change adds an additional
recovery source if the source is disabled or modified that is only kept
around until the document leaves the retention policy window.
This change adds a merge policy that efficiently removes this extra source
on merge for all document that are live and not in the retention policy window
anymore.
it's not fully tested and needs some cleanups but I wanted to put it out here for discussion.