Integrates soft-deletes into Elasticsearch#33222
Conversation
This PR integrates Lucene soft-deletes (LUCENE-8200) into Elasticsearch. Highlight works in this PR include: 1. Replace hard-deletes by soft-deletes in InternalEngine 2. Use _recovery_source if _source is disabled or modified (elastic#31106) 3. Soft-deletes retention policy based on the global checkpoint (elastic#30335) 4. Read operation history from Lucene instead of translog (elastic#30120) 5. Use Lucene history in peer-recovery (elastic#30522) These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
|
Pinging @elastic/es-distributed |
|
@elastic/es-distributed Can you please also have a look? Thank you! |
We should enable by default in 7.0 in a follow-up.
s1monw
left a comment
There was a problem hiding this comment.
Looks awesome. Great piece of work! I left some comments most of them are nits but a couple of minors.
| } | ||
| // TODO: Avoid recalculate numDocs everytime. | ||
| int numDocs = 0; | ||
| for (int i = 0; i < hardLiveDocs.length(); i++) { |
There was a problem hiding this comment.
we are still waiting for a lucene release to fix this right? this is https://issues.apache.org/jira/browse/LUCENE-8458 ?
There was a problem hiding this comment.
if so please put a comment in here referencing the issue
There was a problem hiding this comment.
Yes, but there is something not clear on my side. I will add a comment then make the change in a follow-up so we can clarify things.
There was a problem hiding this comment.
what is not clean on your side?
| /** | ||
| * Specifies if the index should use soft-delete instead of hard-delete for update/delete operations. | ||
| */ | ||
| public static final Setting<Boolean> INDEX_SOFT_DELETES_SETTING = |
There was a problem hiding this comment.
I wonder if we should use a validator here like we use here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetaData.java#L200 to validate that the index this is set on is on version 7.0 or higher? I think this would prevent setting this on other indices and prevent confusion?
There was a problem hiding this comment.
I tried but was unable to get it done with the current Validator. I think we need to extend the Validator to pass an entire Settings instance to make this possible (previous discussion #25560 (comment)). Would it okay if I make this in a follow-up?
| * documents increases the chance of operation-based recoveries and allows querying a longer history of documents. | ||
| * If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint. | ||
| **/ | ||
| public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING = |
There was a problem hiding this comment.
should we only allow setting this one if soft deletes are enabled?
| /** | ||
| * Creates a new history snapshot from Lucene for reading operations whose seqno in the requesting seqno range (both inclusive) | ||
| */ | ||
| public abstract Translog.Snapshot newLuceneChangesSnapshot(String source, MapperService mapperService, |
There was a problem hiding this comment.
can we just call this newChangesSnapshot ?
| builder.add(new DocValuesFieldExistsQuery(recoverySourceField), BooleanClause.Occur.FILTER); | ||
| builder.add(retainSourceQuerySupplier.get(), BooleanClause.Occur.FILTER); | ||
| IndexSearcher s = new IndexSearcher(reader); | ||
| s.setQueryCache(null); |
There was a problem hiding this comment.
I know this is not necessary per-se but I think we should call s.rewrite(builder.build()) and pass the result of this to s.createWeigth(...) we missed this also in lucene. ie if you'd pass a prefix query to this it would fail. I realized this yesterday when I worked on something else. :)
| // TODO: We haven't had timestamp for Index operations in Lucene yet, we need to loosen this check without timestamp. | ||
| // We don't store versionType in Lucene index, we need to exclude it from this check | ||
| final boolean sameOp; | ||
| if (newOp instanceof Translog.Index && prvOp instanceof Translog.Index) { |
There was a problem hiding this comment.
can we maybe fix the equals method of Operation rather than this? WE can do it in a followup no worries. I think versionType is maybe not needed for comparison?
There was a problem hiding this comment.
I removed versionType from the comment. I will move this comparison to the Operation's equals.
| try (Translog.Snapshot snapshot = | ||
| newLuceneChangesSnapshot(source, mapperService, Math.max(0, startingSeqNo), Long.MAX_VALUE, false)) { | ||
| return snapshot.totalOperations(); | ||
| } catch (IOException ex) { |
There was a problem hiding this comment.
should we catch Exception here instead of IOException and check if we need to fail?
There was a problem hiding this comment.
I removed this catch and added a tragic check when we create a new snapshot.
| @Override | ||
| public void afterRefresh(boolean didRefresh) { | ||
| if (didRefresh) { | ||
| refreshedCheckpoint.set(pendingCheckpoint); |
There was a problem hiding this comment.
can we add an assertion that we actually set the pendingCheckpoint?
There was a problem hiding this comment.
There is a bug with the current implementation. I added a test and fixed this in
4ed7907
| */ | ||
| final long lastRefreshedCheckpoint() { | ||
| return lastRefreshedCheckpointListener.refreshedCheckpoint.get(); | ||
| } |
| * documents increases the chance of operation-based recoveries and allows querying a longer history of documents. | ||
| * If soft-deletes is enabled, an engine by default will retain all operations up to the global checkpoint. | ||
| **/ | ||
| public static final Setting<Long> INDEX_SOFT_DELETES_RETENTION_OPERATIONS_SETTING = |
There was a problem hiding this comment.
should we only allow setting this one if soft deletes are enabled?
| translogDeletionPolicy.setMinTranslogGenerationForRecovery(minRequiredGen); | ||
|
|
||
| softDeletesPolicy.setLocalCheckpointOfSafeCommit( | ||
| Long.parseLong(safeCommit.getUserData().get(SequenceNumbers.LOCAL_CHECKPOINT_KEY))); |
There was a problem hiding this comment.
Do we need to care about commits that don't have values for this key?
There was a problem hiding this comment.
We ensure that all index commits should have this key.
| private static final class IndexingStrategy { | ||
| private void addStaleDocs(final List<ParseContext.Document> docs, final IndexWriter indexWriter) throws IOException { | ||
| assert softDeleteEnabled : "Add history documents but soft-deletes is disabled"; | ||
| docs.forEach(d -> d.add(softDeletesField)); |
There was a problem hiding this comment.
nit: a for loop would feel more natural here?
| @Override | ||
| public Closeable acquireRetentionLockForPeerRecovery() { | ||
| final Closeable translogLock = translog.acquireRetentionLock(); | ||
| final Releasable softDeletesLock = softDeletesPolicy.acquireRetentionLock(); |
There was a problem hiding this comment.
do we need to take care of releasing the translog lock if acquiring the soft deletes lock fails?
There was a problem hiding this comment.
Great catch. We should acquire one of them but not both.
| final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo); | ||
| final Sort sortedBySeqNoThenByTerm = new Sort( | ||
| new SortedNumericSortField(SeqNoFieldMapper.NAME, SortField.Type.LONG), | ||
| new SortedNumericSortField(SeqNoFieldMapper.PRIMARY_TERM_NAME, SortField.Type.LONG, true) |
There was a problem hiding this comment.
you could use regular SortField instances since these fields are single-valued
| } | ||
|
|
||
| private TopDocs searchOperations(ScoreDoc after) throws IOException { | ||
| final Query rangeQuery = LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, fromSeqNo, toSeqNo); |
There was a problem hiding this comment.
I'm not sure how much it would help, but when after is not null, we could cast to a FieldDoc to extract is seq no and use it as a lower bound of the range query. That would help skip documents that have already been visited more efficiently.
There was a problem hiding this comment.
Yes, I use "lastSeenSeqNo + 1" as the lower bound.
| if (fromSeqNo < 0 || toSeqNo < 0 || fromSeqNo > toSeqNo) { | ||
| throw new IllegalArgumentException("Invalid range; from_seqno [" + fromSeqNo + "], to_seqno [" + toSeqNo + "]"); | ||
| } | ||
| if (searchBatchSize < 0) { |
There was a problem hiding this comment.
I think we need to reject 0 too?
| } | ||
| // TODO: Avoid recalculate numDocs everytime. | ||
| int numDocs = 0; | ||
| for (int i = 0; i < hardLiveDocs.length(); i++) { |
There was a problem hiding this comment.
what is not clean on your side?
Today we add a NoOp to Lucene and translog if we fail to process an indexing operation. However, we are only adding NoOps to translog for delete operations. In order to have a complete history in Lucene, we should add NoOps of failed delete operations to both Lucene and translog. Relates elastic#29530
|
Hmm. CI failed due to #32299. @elasticmachine retest this please. |
|
@elasticmachine run sample packaging tests please |
|
@elasticmachine retest this please. |
|
Another watcher test failure @elasticmachine test this please. |
Revert to correct co-author tags. This reverts commit 6dd0aa5.
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (elastic#31106) - Soft-deletes retention policy based on the global checkpoint (elastic#30335) - Read operation history from Lucene instead of translog (elastic#30120) - Use Lucene history in peer-recovery (elastic#30522) Relates elastic#30086 Closes elastic#29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch. Highlight works in this PR include: - Replace hard-deletes by soft-deletes in InternalEngine - Use _recovery_source if _source is disabled or modified (#31106) - Soft-deletes retention policy based on the global checkpoint (#30335) - Read operation history from Lucene instead of translog (#30120) - Use Lucene history in peer-recovery (#30522) Relates #30086 Closes #29530 --- These works have been done by the whole team; however, these individuals (lexical order) have significant contribution in coding and reviewing: Co-authored-by: Adrien Grand <jpountz@gmail.com> Co-authored-by: Boaz Leskes <b.leskes@gmail.com> Co-authored-by: Jason Tedor <jason@tedor.me> Co-authored-by: Martijn van Groningen <martijn.v.groningen@gmail.com> Co-authored-by: Nhat Nguyen <nhat.nguyen@elastic.co> Co-authored-by: Simon Willnauer <simonw@apache.org>
* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)
We can have multiple documents in Lucene with the same seq_no for parent-child documents (or without rollback). In this case, the usage "lastSeenSeqNo + 1" is an off-by-one error as it may miss some documents. This error merely affects the `skippedOperations` contract. See: #33222 (comment) Closes #33318
We can have multiple documents in Lucene with the same seq_no for parent-child documents (or without rollback). In this case, the usage "lastSeenSeqNo + 1" is an off-by-one error as it may miss some documents. This error merely affects the `skippedOperations` contract. See: #33222 (comment) Closes #33318
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates elastic#33656 Relates elastic#33222
Today we don't store the auto-generated timestamp of append-only operations in Lucene; and assign -1 to every index operations constructed from LuceneChangesSnapshot. This looks innocent but it generates duplicate documents on a replica if a retry append-only arrives first via peer-recovery; then an original append-only arrives via replication. Since the retry append-only (delivered via recovery) does not have timestamp, the replica will happily optimizes the original request while it should not. This change transmits the max auto-generated timestamp from the primary to replicas before translog phase in peer recovery. This timestamp will prevent replicas from optimizing append-only requests if retry counterparts have been processed. Relates #33656 Relates #33222
This change enables soft-deletes by default on ES 7.0.0 or later. Relates #33222 Co-authored-by: Jason Tedor <jason@tedor.me>
This PR integrates Lucene soft-deletes(LUCENE-8200) into Elasticsearch.
Highlight works in this PR include:
_recovery_sourceif source is omitted or modified #31106)These pieces were reviewed already in the feature branch but we
would like to give them an extra look before pulling into the upstream.
Relates #30086
Closes #29530
These works have been done by the whole team; however, these individuals
(lexical order) have significant contribution in coding and reviewing:
Co-authored-by: Adrien Grand jpountz@gmail.com
Co-authored-by: Boaz Leskes b.leskes@gmail.com
Co-authored-by: Jason Tedor jason@tedor.me
Co-authored-by: Martijn van Groningen martijn.v.groningen@gmail.com
Co-authored-by: Nhat Nguyen nhat.nguyen@elastic.co
Co-authored-by: Simon Willnauer simonw@apache.org