Retain soft-deleted documents for rollback by dnhatn · Pull Request #31846 · elastic/elasticsearch

dnhatn · 2018-07-06T02:41:09Z

An operation whose seqno is greater than the global checkpoint is
subject to undoing when the primary fails over. If that operation
updates or deletes existing documents in Lucene, those documents are
also subject to undoing. Thus, we need to retain them during merges
until they are no longer subject to rollback.

An operation whose seqno is greater than the global checkpoint is subject to undoing when the primary fails over. If that operation updates or deletes existing documents in Lucene, those documents are also subject to undoing. Thus, we need to retain them during merges until they are no longer subject to rollback.

elasticmachine · 2018-07-06T02:41:11Z

Pinging @elastic/es-distributed

bleskes · 2018-07-13T18:49:23Z

I looked at this and I was a bit unhappy that we had to make the resolving methods more complicated by introducing another wrapper class. I think that OpVsLuceneDocStatus enum has reached the end of it's usefulness. It was mostly there to hide how we came to the conclusion and also to encapsulate the the double resolving logic (seq# first, then term) in a method. However, in the world of rollbacks, we know for sure that if the seq# is the same as the incoming op, the op is the same because otherwise it would have been rolled back (and we can assert the term). As such, I think it's simpler to just have a method that resolves the current seq# and use that to both make indexing decision and set the rollback seq# if the current op is stale (and skip indexing all together if current op is identical).

There is some bwc complications that we agreed to address in a backport pr only.

dnhatn · 2018-07-14T13:10:49Z

@bleskes I've updated the resolveDocSeqNo method. Can you have a look? Thank you.

bleskes

LGTM, but I'd love @jpountz to check the lucene parts (I can supply context). Also, I think we miss a test for the retention policy, no?

bleskes · 2018-07-14T14:48:55Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

                } else {
-                    plan = IndexingStrategy.processNormally(opVsLucene == OpVsLuceneDocStatus.LUCENE_DOC_NOT_FOUND,
-                        index.seqNo(), index.version());
+                    plan = IndexingStrategy.processAsStaleOp(softDeleteEnabled, index.seqNo(), index.version(), prevSeqNo.getAsLong());


can you add an assertion that loads the primary term and checks equality?

I added this assertion but it was violated. I added a TODO in resolveDocSeqNo method

bleskes · 2018-07-14T14:51:14Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            this.indexIntoLucene = indexIntoLucene;
            this.addStaleOpToLucene = addStaleOpToLucene;
+            this.seqNoOfNewerVersion = seqNoOfNewerVersion;
+            assert addStaleOpToLucene == false || seqNoOfNewerVersion >= 0 : "stale op [" + seqNoForIndexing + "] with invalid newer seqno";


also assert it's higher than the seqNoForIndexing?

also assert that if addStateOpToLucene is true, seqNoNewerVersion is unassigned? (move the assertion next to the other ones please)

bleskes · 2018-07-14T14:51:44Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

        final long versionForIndexing;
        final boolean indexIntoLucene;
        final boolean addStaleOpToLucene;
+        final long seqNoOfNewerVersion; // the seqno of the newer copy of this _uid if exists; otherwise -1


maybe call this seqNoOfNewerDocIfStale? (trying to avoid the word version as it's confusing)

bleskes · 2018-07-14T14:57:07Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

            } else {
-                plan = DeletionStrategy.processNormally(opVsLucene == OpVsLuceneDocStatus.LUCENE_DOC_NOT_FOUND,
-                    delete.seqNo(), delete.version());
+                plan = DeletionStrategy.processAsStaleOp(softDeleteEnabled, false, delete.seqNo(), delete.version());


same comment about an assertion loading the term if seq no is equal

We have to wait for the actual rollback to add the assertion here. I added a TODO in resolveDocSeqNo.

jpountz

The parts that I understand look good, @bleskes kindly provided context so that I better understand this change. I just left a note that the new retention query might be slow. There are potential way that we could reduce the slow down by adding a small cache of the max value of the UPDATED_BY_SEQNO_NAME field as a follow-up.

jpountz · 2018-07-17T08:43:05Z

server/src/main/java/org/elasticsearch/index/engine/SoftDeletesPolicy.java

+            .add(LongPoint.newRangeQuery(SeqNoFieldMapper.NAME, getMinRetainedSeqNo(), Long.MAX_VALUE), BooleanClause.Occur.SHOULD)
+            .add(NumericDocValuesField.newSlowRangeQuery(SeqNoFieldMapper.UPDATED_BY_SEQNO_NAME,
+                globalCheckpointSupplier.getAsLong() + 1, Long.MAX_VALUE), BooleanClause.Occur.SHOULD)
+            .build();


adding a doc-value query as a SHOULD clause might mean we need to do a linear scan to find matches of this query. We don't have much of a choice so I would still let the change in and benchmark, but we might want to still add a comment here.

dnhatn · 2018-07-17T21:29:59Z

Also, I think we miss a test for the retention policy, no?

@bleskes The retention is tested in Engine#testForceMergeWithSoftDeletesRetention.

bleskes · 2018-07-18T11:44:49Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            Map<Long, Long> seqNos = readUpdatedBySeqNos.apply(engine);
+            assertThat(seqNos, hasEntry(0L, 1L)); // 0 -> 1
+            assertThat(seqNos, hasEntry(1L, 4L)); // 1 -> 3 -> 4
+            assertThat(seqNos, hasEntry(2L, 3L)); // 2 -> 3 (stale)


+1 . Can we also have stale delete operations ?

why don't we have an entry for term seq3? also check that the map size is 3?

bleskes · 2018-07-18T11:45:43Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            engine.index(replicaIndexForDoc(createParsedDoc("1", null), 5, 4, false));
+            Map<Long, Long> seqNos = readUpdatedBySeqNos.apply(engine);
+            assertThat(seqNos, hasEntry(0L, 1L)); // 0 -> 1
+            assertThat(seqNos, hasEntry(1L, 4L)); // 1 -> 3 -> 4


why don't we see 3 here?

I added a comment

bleskes · 2018-07-18T11:49:04Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -1314,7 +1297,8 @@ private DeleteResult deleteInLucene(Delete delete, DeletionStrategy plan)
                if (plan.addStaleOpToLucene || plan.currentlyDeleted) {


do we need UPDATED_BY_SEQNO_NAME to be updated here too?

Good question. A regular delete tombstone is soft-deleted before indexing into Lucene, thus "softUpdateDocument" method won't affect on that document. I think we should not add the updated_by_seqno field to the stale deletes to ensure the consistency between deletes tombstones(i.e., all deletes don't have updated_by_seqno value).

However, I found the case which might be an issue if we remove the safe commit. This happens as follows:

Index a doc (version=1) (seq#0)

Delete that doc (seq#1)

Index that doc (version=2) (seq#2)

Suppose the global checkpoint is 1, and the seq#2 in the danger zone, we trigger force merge. The problem is seq#2 only updates the updated_by_seqno of seq#0 because the delete is invisible while seq#0 is still visible (until refresh). A force merge will reclaim the delete and retain seq#0 and seq#2. A rollback seq#2 would restore seq#0 which is incorrect.

We can avoid this issue if a document is invisible immediately after it's soft-deleted.

we discussed this yesterday and this has to do with the fact that lucene treats a document that is index as soft deleted (i.e., the tombstone) differently than a doc that is index and then later soft deleted while that doc is in the indexing buffer. As a work around we decide to detect this and force a refresh to maintain semantics. This will allow this work to continue while we work on a solution in lucene.

bleskes · 2018-07-18T11:50:31Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            engine.refresh("test");
+            Map<Long, Long> seqNos = readUpdatedBySeqNos.apply(engine);
+            assertThat(seqNos, hasEntry(0L, 2L)); // 0 -> 1 -> 2
+            assertThat(seqNos, hasEntry(1L, 2L)); // 1 -> 2


assert that the total size is what we expect?

bleskes · 2018-07-18T11:54:31Z

@bleskes The retention is tested in Engine#testForceMergeWithSoftDeletesRetention.

sorry. I have no idea why I missed the tests. I left some comments there.

dnhatn · 2018-07-25T00:52:39Z

@bleskes and @ywelsch

I have another approach that requires some changes:

When indexing a stale document, we should assign updated_by_seqno to the corresponding tombstone if exists and valid (i.e., greater than indexing_seqno); otherwise, use the seqno from the resolveDocSeqNo.
Only restore the previous version if the updated_by_seqno of the restoring copy equals to the seqno of the colliding operation.
If the colliding operation is not the latest version, we don't restore the previous copy, but we need transfer its previous version to the most recent operation by changing the updated_by_seqno.

Suppose we have three operations of a single document id index-1 (i1), delete-2 (d2), and index-3 (i3) with the following processing orders:

a. i1, d2, i3 -> updated_by_seqno { 2, _, _ }
b. i1, i3, d2 -> updated_by_seqno { 3, _, 3 }

c. d2, i1, i3 -> updated_by_seqno { 2, _, _ }
d. d2, i3, i1 -> updated_by_seqno { 3, _, _}

e. i3, i1, d2 -> updated_by_seqno { 3, 3, _}
f. i3, d2, i1 -> updated_by_seqno { 3, 3, _}

Without change-1 the updated_by_seqno in case-d would be { 2, _, _ }
With change-2 we won't restore i1 in case-c if d2 is final
With change-3 we will restore i1 if both d2 and i3 are colliding.

I think we are good now with these changes. WDYT?

dnhatn · 2018-07-25T14:38:53Z

Discussed with Boaz, we agreed not to proceed with this approach because:

We prefer not to rely on the GC deletes
We prefer not to reply on updated_by_seqno on the rollback flow. The updated_by_seqno should be used only by the merge policy.

We will instead go with Simon's approach which re-adds a delete tombstone with updated_by_seqno when indexing and there is a delete tombstone.

dnhatn · 2018-07-26T03:14:53Z

@bleskes It's ready for another go. Can you please have a look? Thank you!

bleskes · 2018-07-27T19:49:10Z

@dnhatn and I discussed the current approach and sadly it still has some bugs - consider the following indexing order for the same doc: index v1 (seq 2) , delete (seq 4) and index (seq 10). With the current approach that loads the current doc version from lucene with a normal searcher, when index (seq 10) comes in, it will not see the delete (seq 4) and will not add an extra tombstone doc with update_by_seq_no set to 10. We can fix this by using a reader that exposes tombstone docs but then we have another problem - if the global checkpoint is 6, the merge policy is free to reclame the delete tombstone (it's not marked with update by seq no yet) and it is free to leave the index v1 (seq 2) in the index. This will trick the rollback to think index v1 is the rollback doc.

We've had some ideas on other approaches but it becomes clear that our testing is lacking as we keep finding issues during reviews rather than by failing tests. Nhat and I had an idea on how to write a test that should cover all these edge cases and that's going to be a first goal now (i.e., having a test that fails for this issue and also the problems we found with previous iterations). We're going to give this a day or two and if we fail to be able to write a test we feel confident about we're going to invest in a TLA+ model before proceeding. We chose to wait with the TLA+ model as it will take some time to write and will be based on our current model of lucene (which may be flawed - we are still working on mapping it out) rather than using lucene it self.

dnhatn · 2018-07-31T16:28:32Z

@bleskes I've added a test which simulates Lucene merge in f6487e1. This test is able to detect the current issue. Below is an example that it found. Can you please have a look? (/cc @ywelsch)

delete id=4 seq=12 gcp=-1 tombstones=[]
index  id=4 seq=15 gcp=-1 tombstones=[12]
delete id=4 seq=13 gcp=-1 tombstones=[]
prune tombstone id=4 seq=12
delete id=3 seq=4 gcp=-1 tombstones=[]
delete id=4 seq=2 gcp=-1 tombstones=[4]
delete id=1 seq=8 gcp=-1 tombstones=[4] (*)
delete id=1 seq=1 gcp=-1 tombstones=[4, 8]
index  id=3 seq=11 gcp=-1 tombstones=[4, 8]
delete id=3 seq=7 gcp=-1 tombstones=[8]
prune tombstone id=3 seq=4
index  id=1 seq=3 gcp=-1 tombstones=[8] (*)
delete id=4 seq=9 gcp=-1 tombstones=[8]
delete id=5 seq=0 gcp=-1 tombstones=[8]
delete id=3 seq=5 gcp=1 tombstones=[0, 8]
delete id=3 seq=14 gcp=4 tombstones=[0, 8]
delete id=2 seq=6 gcp=5 tombstones=[0, 8, 14]
index  id=5 seq=10 gcp=7 tombstones=[0, 6, 8, 14]
delete id=3 seq=19 gcp=11 tombstones=[6, 8, 14]
merge reclaim id=5 seq=0
merge reclaim id=3 seq=4
merge reclaim id=1 seq=8. (*)
merge reclaim id=2 seq=6
prune tombstone id=5 seq=0
delete id=1 seq=32 gcp=13 tombstones=[19, 6]
prune tombstone id=1 seq=8 (*)
prune tombstone id=3 seq=14
index  id=5 seq=23 gcp=13 tombstones=[32, 19, 6]
index  id=2 seq=27 gcp=13 tombstones=[32, 19, 6]
index  id=2 seq=18 gcp=14 tombstones=[32, 19]
prune tombstone id=2 seq=6
delete id=1 seq=29 gcp=14 tombstones=[32, 19]
index  id=2 seq=21 gcp=15 tombstones=[32, 19]
delete id=2 seq=17 gcp=15 tombstones=[32, 19]
index  id=3 seq=16 gcp=15 tombstones=[32, 19]
delete id=3 seq=31 gcp=19 tombstones=[32, 19]
index  id=1 seq=22 gcp=19 tombstones=[32, 31] (*)
prune tombstone id=3 seq=19
delete id=3 seq=26 gcp=19 tombstones=[32, 31]
delete id=3 seq=30 gcp=19 tombstones=[32, 31]
delete id=5 seq=28 gcp=19 tombstones=[32, 31]
index  id=4 seq=20 gcp=19 tombstones=[32, 28, 31]
delete id=5 seq=25 gcp=20 tombstones=[32, 28, 31]

[2018-07-31T13:18:08,775][INFO ][o.e.i.e.InternalEngineTests] [testKeepDocsForRollback] 
[ seed=[6F30941940A8977:DF45F2F4FA7C9F09] #6] after test

java.lang.AssertionError: rollback id=1 seq=22, found seq=3, expected seq=8
Expected: an instance of org.elasticsearch.index.engine.Engine$Delete
     but: <org.elasticsearch.index.engine.Engine$Index@1d10a272> is a org.elasticsearch.index.engine.Engine$Index

bleskes · 2018-08-01T09:44:17Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+        List<Engine.Operation> operations = new ArrayList<>();
+        int numOps = scaledRandomIntBetween(10, 200);
+        for (int seqNo = 0; seqNo < numOps; seqNo++) {
+            String id = Integer.toString(between(1, 5));


why not use the random history generator? maybe we should extend it if we need to?

I adjusted the history generator method and use it here.

bleskes · 2018-08-01T09:45:13Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+            .put(IndexSettings.INDEX_SOFT_DELETES_SETTING.getKey(), true);
+        final IndexMetaData indexMetaData = IndexMetaData.builder(defaultSettings.getIndexMetaData()).settings(settings).build();
+        final IndexSettings indexSettings = IndexSettingsModule.newIndexSettings(indexMetaData);
+        realisticShuffleOperations(operations);


why be realistic here? Pure random is a better test, no?

I ran several hundred iterations with the pure random but no failure. The reason is the operations were shuffled so much so that the local checkpoint was not advanced enough to allow us to reclaim documents or tombstone. Then I introduced this shuffle method.

bleskes · 2018-08-01T09:45:59Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+                }
+                processedOps.put(op.seqNo(), op);
+                if (between(1, 20) == 1) {
+                    assertDocumentsForRollback(engine, globalCheckpoint, processedOps);


why not do it every time?

An interesting case is when the updated_by_seqno is updated twice. This method needs to refresh, and if we refresh every operation, that case will disappear.

bleskes · 2018-08-01T09:46:09Z

server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java

+                if (between(1, 5) == 1) {
+                    engine.maybePruneDeletes();
+                }
+                if (between(1, 20) == 1) {


why not do it every time?

Make it 50%

bleskes · 2018-08-01T09:50:13Z

@dnhatn I think that test is good. I would prefer to change it a bit to not rely on the engine at all in it's expectations - it knows the series of indexing operations it has performed and thus can figure out what the "rollback" version should be. I also left some questions about things that weren't clear to me.

bleskes · 2018-08-01T14:19:26Z

sorry. I got confused, the test already keeps track of indexed ops on it's own and uses that to compute expected rollback targets.

jasontedor · 2018-09-02T20:45:20Z

@dnhatn I think that we can close this PR for now?

dnhatn · 2018-09-02T20:52:02Z

I am closing this as we are going to implement a commit-based rollback.

dnhatn added >feature :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. labels Jul 6, 2018

dnhatn requested review from bleskes, jpountz and s1monw July 6, 2018 02:41

dnhatn added 3 commits July 8, 2018 16:27

Merge branch 'ccr' into keep-docs-for-rollback

6163890

remove fieldname trick

af2899e

Merge branch 'ccr' into keep-docs-for-rollback

6963378

dnhatn added 3 commits July 13, 2018 15:17

Merge branch 'ccr' into keep-docs-for-rollback

f52baaf

Merge branch 'ccr' into keep-docs-for-rollback

c55d771

Only use seqno to resolve out of oder on replica

b71fa19

bleskes approved these changes Jul 14, 2018

View reviewed changes

jpountz approved these changes Jul 17, 2018

View reviewed changes

dnhatn added 4 commits July 17, 2018 11:56

Merge branch 'ccr' into keep-docs-for-rollback

f9cc272

feedback

b2dcacd

add comment to explain the slow range query

59906d8

relax the stale_seqno assertion

7891470

bleskes reviewed Jul 18, 2018

View reviewed changes

dnhatn added 3 commits July 18, 2018 16:11

add comment to the test

d6ff966

Merge branch 'ccr' into keep-docs-for-rollback

3c51985

Merge branch 'ccr' into keep-docs-for-rollback

5aa3689

dnhatn added 2 commits July 23, 2018 16:38

add updated_by_seqno for stale deletes

4c36ee2

remove unused imports

29d5f71

dnhatn mentioned this pull request Jul 24, 2018

Per doc replica rollbacks #31637

Closed

11 tasks

dnhatn added 3 commits July 25, 2018 19:53

reindex tombstone

326c51f

Merge branch 'ccr' into keep-docs-for-rollback

b8eefd9

relax tests

bed2f52

dnhatn requested review from bleskes and s1monw and removed request for s1monw July 26, 2018 03:40

dnhatn added 3 commits July 30, 2018 10:54

Merge branch 'ccr' into keep-docs-for-rollback

c316d1e

Merge branch 'ccr' into keep-docs-for-rollback

e484758

simulate merge

f6487e1

bleskes reviewed Aug 1, 2018

View reviewed changes

address feedbacks on the random test

615230d

dnhatn closed this Sep 2, 2018

dnhatn deleted the keep-docs-for-rollback branch September 2, 2018 20:52

		@@ -1314,7 +1297,8 @@ private DeleteResult deleteInLucene(Delete delete, DeletionStrategy plan)
		if (plan.addStaleOpToLucene \|\| plan.currentlyDeleted) {

Conversation

dnhatn commented Jul 6, 2018

Uh oh!

elasticmachine commented Jul 6, 2018

Uh oh!

bleskes commented Jul 13, 2018

Uh oh!

dnhatn commented Jul 14, 2018

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jul 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jul 18, 2018

Uh oh!

dnhatn commented Jul 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dnhatn commented Jul 25, 2018

Uh oh!

dnhatn commented Jul 26, 2018

Uh oh!

bleskes commented Jul 27, 2018

Uh oh!

dnhatn commented Jul 31, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Jul 25, 2018 •

edited

Loading