CCR: Replicate existing ops with old term on follower by dnhatn · Pull Request #34412 · elastic/elasticsearch

dnhatn · 2018-10-12T01:08:23Z

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario:

Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer.

The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively.
The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1.
The primary of a follower fails after it has replicated seq#1 to replicas.
Since the old primary did not respond, the FollowTask issues another
write request containing seq#1 (resend the previous write request).
The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1.

The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed.

This PR proposes to delay the write requests if there is a gap in the write-buffer. With this change, if a writer is waiting for seq_no N, then all the operations below N were delivered or were scheduled to deliver by other writers.

This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. I will make a follow-up to enforce the soft-deletes on the follower.

Relates #34288

Since , we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to delay the write requests if there is a gap in the write-buffer. With this change, if a writer is waiting for seq_no N, then all the operations below N were delivered or were scheduled to deliver by other writers.

elasticmachine · 2018-10-12T01:08:25Z

Pinging @elastic/es-distributed

dnhatn · 2018-10-12T12:12:09Z

Another approach is to let the following primary wait for the advancement of the global checkpoint only if its local checkpoint is at least the waiting_for_global checkpoint. Otherwise, it will return the unapplied operations to the FollowTask without waiting. In the latter case, the FollowTask puts back the unapplied operations to the buffer, then deliver the head (the current behavior) of the buffer (i.e., operations before the waiting_for_gcp).

Tracked at #34412

dnhatn · 2018-10-16T16:11:24Z

server/src/main/java/org/elasticsearch/common/lucene/uid/VersionsAndSeqNoResolver.java

+     * Looks up the primary term for a given seq_no in the provided directory reader. The caller must ensure that an operation with the
+     * given {@code seqNo} exists the provided {@code directoryReader}; otherwise this method will throw {@link IllegalStateException}.
+     */
+    public static long lookupPrimaryTerm(final DirectoryReader directoryReader, final long seqNo) throws IOException {


@jimczi Could you please have a look at this Lucene code.

since this wraps all docs live I am not sure it should life here in this class?

dnhatn · 2018-10-16T16:12:08Z

@bleskes This is ready. Could you please give it shot?

jimczi · 2018-10-16T17:38:53Z

server/src/main/java/org/elasticsearch/common/lucene/uid/VersionsAndSeqNoResolver.java

+            int docId;
+            while ((docId = docIdSetIterator.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
+                // make sure to skip the non-root nested documents
+                if (primaryTermDV.advanceExact(docId - leaf.docBase) && primaryTermDV.longValue() > 0) {


The docIdSetIterator returns the leaf doc id so you can use it directly to advance the primaryTermDV ?

Good catch. I pushed 6bf3f1d.

s1monw

left some comments

s1monw · 2018-10-16T18:26:47Z

server/src/main/java/org/elasticsearch/common/lucene/uid/VersionsAndSeqNoResolver.java

+     * Looks up the primary term for a given seq_no in the provided directory reader. The caller must ensure that an operation with the
+     * given {@code seqNo} exists the provided {@code directoryReader}; otherwise this method will throw {@link IllegalStateException}.
+     */
+    public static long lookupPrimaryTerm(final DirectoryReader directoryReader, final long seqNo) throws IOException {


since this wraps all docs live I am not sure it should life here in this class?

s1monw · 2018-10-17T07:12:48Z

server/src/main/java/org/elasticsearch/common/lucene/uid/VersionsAndSeqNoResolver.java

+        final Query query = LongPoint.newExactQuery(SeqNoFieldMapper.NAME, seqNo);
+        final Weight weight = searcher.createWeight(query, ScoreMode.COMPLETE_NO_SCORES, 1.0f);
+        // iterate backwards since the existing operation is likely in the most recent segments.
+        for (int i = reader.leaves().size() - 1; i >= 0; i--) {


is this optimization really relevant here? I wonder if we can't just do an ordinary search and then lookup the leaf reader based on the top hits?

s1monw · 2018-10-17T07:14:37Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngine.java

+            if (seqNo <= engineConfig.getGlobalCheckpointSupplier().getAsLong()) {
+                return OptionalLong.empty();
+            } else {
+                final long term = VersionsAndSeqNoResolver.lookupPrimaryTerm(searcher.getDirectoryReader(), seqNo);


I think we can just put lookupPrimaryTerm in this class.

dnhatn · 2018-10-17T15:46:40Z

@s1monw I have addressed your comments. Could you please have another look?

s1monw

LGTM

s1monw · 2018-10-17T18:42:00Z

@bleskes should also look at this.

bleskes

Looks great. I left some comments. The important one is about the testing.

bleskes · 2018-10-18T07:03:10Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+                    if (failure.getExistingPrimaryTerm().isPresent()) {
+                        appliedOperations.add(rewriteOperationWithPrimaryTerm(sourceOp, failure.getExistingPrimaryTerm().getAsLong()));
+                    } else {
+                        assert targetOp.seqNo() <= primary.getGlobalCheckpoint() : targetOp.seqNo() + " > " + primary.getGlobalCheckpoint();


I think this should also throw an exception so replication stops and we'll know about it (assertion is fine for testing too).

.../java/org/elasticsearch/xpack/ccr/index/engine/AlreadyProcessedFollowingEngineException.java

bleskes · 2018-10-18T07:14:05Z

x-pack/plugin/ccr/src/main/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngine.java

+                        return OptionalLong.of(primaryTermDV.longValue());
+                    }
+                }
+                assert false : "seq_no[" + seqNo + "] does not have primary_term";


can we show how many docs we found? this assumes 0 but it might be 2.

bleskes · 2018-10-18T07:15:48Z

.../plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/index/engine/FollowingEngineTests.java

                    assertThat(result.getResultType(), equalTo(Engine.Result.Type.FAILURE));
                    assertThat(result.getFailure(), instanceOf(AlreadyProcessedFollowingEngineException.class));
+                    AlreadyProcessedFollowingEngineException failure = (AlreadyProcessedFollowingEngineException) result.getFailure();
+                    assertThat(failure.getExistingPrimaryTerm().getAsLong(), equalTo(operationWithTerms.get(op.seqNo())));


this seems to mean we never deliver ops below the global checkpoint (and flush etc.) can we extend the test to that too?

dnhatn · 2018-10-18T09:59:23Z

@bleskes I've addressed your comments. Would you please take another look?

bleskes

LGTM

dnhatn · 2018-10-19T17:51:03Z

Thanks @bleskes @jimczi and @s1monw.

This is a follow-up for #34288. Relates #34412

This PR adds an assertion which asserts that the existing operation is equal to the processing operation except for the primary term. Relates elastic#34412

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. Relates #34288

This is a follow-up for #34288. Relates #34412

Since elastic#34412 and elastic#34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates elastic#34412 Relates elastic#34474

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

Tracked at #34412

Since #34288, we might hit deadlock if the FollowTask has more fetchers than writers. This can happen in the following scenario: Suppose the leader has two operations [seq#0, seq#1]; the FollowTask has two fetchers and one writer. 1. The FollowTask issues two concurrent fetch requests: {from_seq_no: 0, num_ops:1} and {from_seq_no: 1, num_ops:1} to read seq#0 and seq#1 respectively. 2. The second request which fetches seq#1 completes before, and then it triggers a write request containing only seq#1. 3. The primary of a follower fails after it has replicated seq#1 to replicas. 4. Since the old primary did not respond, the FollowTask issues another write request containing seq#1 (resend the previous write request). 5. The new primary has seq#1 already; thus it won't replicate seq#1 to replicas but will wait for the global checkpoint to advance at least seq#1. The problem is that the FollowTask has only one writer and that writer is waiting for seq#0 which won't be delivered until the writer completed. This PR proposes to replicate existing operations with the old primary term (instead of the current term) on the follower. In particular, when the following primary detects that it has processed an process already, it will look up the term of an existing operation with the same seq_no in the Lucene index, then rewrite that operation with the old term before replicating it to the following replicas. This approach is wait-free but requires soft-deletes on the follower. Relates #34288

This is a follow-up for #34288. Relates #34412

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

dnhatn added blocker :Distributed/CCR Issues around the Cross Cluster State Replication features team-discuss labels Oct 12, 2018

dnhatn requested review from bleskes, jasontedor and martijnvg October 12, 2018 01:08

use loose form

7d30491

dnhatn added a commit that referenced this pull request Oct 14, 2018

CCR/TEST: AwaitsFix testFailOverOnFollower

429c29e

Tracked at #34412

dnhatn added a commit that referenced this pull request Oct 14, 2018

CCR/TEST: AwaitsFix testFailOverOnFollower

f3aa5a8

Tracked at #34412

dnhatn added 6 commits October 15, 2018 08:24

backout delay writes

b257325

Merge branch 'master' into delay-writes

03b778c

Replicate existing ops with old term on follower

19be4d7

Merge branch 'master' into delay-writes

52b4450

unmute tests

9e75f5d

move to seq_no resolver

58c0686

dnhatn changed the title ~~CCR: Delay write requests if gaps in write buffer~~ CCR: Replicate existing ops with old term on follower Oct 16, 2018

dnhatn removed the team-discuss label Oct 16, 2018

dnhatn commented Oct 16, 2018

View reviewed changes

dnhatn requested a review from s1monw October 16, 2018 16:11

jimczi reviewed Oct 16, 2018

View reviewed changes

dnhatn added 3 commits October 16, 2018 14:44

use docId directly

6bf3f1d

Merge branch 'master' into delay-writes

f36e608

Merge branch 'master' into delay-writes

736983d

s1monw suggested changes Oct 17, 2018

View reviewed changes

move it back to FollowingEngine

2a45bf2

dnhatn requested a review from s1monw October 17, 2018 15:46

s1monw approved these changes Oct 17, 2018

View reviewed changes

bleskes suggested changes Oct 18, 2018

View reviewed changes

boaz’s feedback

ab21a58

bleskes approved these changes Oct 18, 2018

View reviewed changes

dnhatn added 2 commits October 19, 2018 09:07

Merge branch 'master' into delay-writes

7a76ca3

Merge branch 'master' into delay-writes

0b249b6

dnhatn merged commit bd92a28 into elastic:master Oct 19, 2018

dnhatn deleted the delay-writes branch October 19, 2018 17:56

dnhatn added the backport pending label Oct 19, 2018

dnhatn added a commit that referenced this pull request Oct 20, 2018

CCR: Following primary should process NoOps once (#34408)

d90b673

This is a follow-up for #34288. Relates #34412

dnhatn mentioned this pull request Oct 20, 2018

CCR: Assert existing operation in following primary #34664

Closed

tomcallahan added the >non-issue label Oct 20, 2018

dnhatn added a commit that referenced this pull request Oct 21, 2018

CCR: Following primary should process NoOps once (#34408)

28bbf45

This is a follow-up for #34288. Relates #34412

dnhatn removed the backport pending label Oct 21, 2018

dnhatn mentioned this pull request Oct 22, 2018

CCR: Requires soft-deletes on the follower #34725

Merged

kcm pushed a commit that referenced this pull request Oct 30, 2018

CCR/TEST: AwaitsFix testFailOverOnFollower

c1037a7

Tracked at #34412

kcm pushed a commit that referenced this pull request Oct 30, 2018

CCR: Following primary should process NoOps once (#34408)

4228520

This is a follow-up for #34288. Relates #34412

kcm pushed a commit that referenced this pull request Oct 30, 2018

CCR: Requires soft-deletes on the follower (#34725)

458fbb0

Since #34412 and #34474, a follower must have soft-deletes enabled to work correctly. This change requires soft-deletes on the follower. Relates #34412 Relates #34474

Conversation

dnhatn commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Oct 12, 2018

Uh oh!

dnhatn commented Oct 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 16, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 17, 2018

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

s1monw commented Oct 17, 2018

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 18, 2018

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

dnhatn commented Oct 19, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dnhatn commented Oct 12, 2018 •

edited

Loading