Use follower primary term when applying operations by jasontedor · Pull Request #31113 · elastic/elasticsearch

jasontedor · 2018-06-05T18:55:54Z

The primary shard copy on the following has authority of the replication operations that occur on the following side in cross-cluster replication. Yet today we are using the primary term directly from the operations on the leader side. Instead we should be replacing the primary term on the following side with the primary term of the primary on the following side. This commit does this by copying the translog operations with the corrected primary term. This ensures that we use this primary term while applying the operations on the primary, and when replicating them across to the replica (where the replica request was carrying the primary term of the primary shard copy on the follower).

elasticmachine · 2018-06-05T18:55:58Z

Pinging @elastic/es-distributed

jasontedor · 2018-06-05T21:17:35Z

run gradle build tests

* ccr: [DOCS] Creates rest-api folder in docs [Rollup] Disallow index patterns that match the rollup index (elastic#30491) Replace exact numDocs by soft-del count in SegmentInfo (elastic#31086) Upgrade to Lucene-7.4.0-snapshot-0a7c3f462f (elastic#31073) Add cors support to NioHttpServerTransport (elastic#30827) [DOCS] Fixes security example (elastic#31082) Allow terms query in _rollup_search (elastic#30973) Removing erroneous repeat Adapt bwc versions after backporting elastic#30983 to 6.4 [Tests] Muting RatedRequestsTests#testXContentParsingIsNotLenient TEST: Retry synced-flush if ongoing ops on primary (elastic#30978) Fix docs build. Only auto-update license signature if all nodes ready (elastic#30859) Add BlobContainer.writeBlobAtomic() (elastic#30902) Add a doc value format to binary fields. (elastic#30860)

bleskes

LGTM. Left some minor comments

bleskes · 2018-06-06T05:54:54Z

test/framework/src/main/java/org/elasticsearch/index/shard/IndexShardTestCase.java

+     *                another shard)
     */
    protected IndexShard newShard(boolean primary) throws IOException {
        ShardRouting shardRouting = TestShardRouting.newShardRouting(new ShardId("index", "_na_", 0), randomAlphaOfLength(10), primary,


call newShard(primary, Settings.EMPTY, new InternalEngineFactory()?

I pushed 8086d49.

bleskes · 2018-06-06T06:00:53Z

...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java

+                            primary.getPrimaryTerm(),
+                            index.version(),
+                            index.versionType(),
+                            index.source().toBytesRef().bytes,


I think this is dangerous - it's correct now since the way we stream guarantee that the bytes ref is offseted with 0 and has a length equal to the bytes array but this isn't guaranteed by the API. I think we should honor the API or, if this turns out difficult assert at the very least.

I pushed 071a4a3.

bleskes · 2018-06-06T06:02:14Z

...ugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/bulk/BulkShardOperationsTests.java

+        final IndexShard followerPrimary = newStartedShard(true, settings, new FollowingEngineFactory());
+
+        // we use this primary on the operations yet we expect the applied operations to have the primary term of the follower
+        final long primaryTerm = randomLongBetween(1, followerPrimary.getPrimaryTerm());


nit: we can do this fully random - there's no relationship between the primary term of the leader and the follower

The randomness was not meant to imply there was such a relationship. 😇

I pushed 0a8499b.

bleskes · 2018-06-06T06:02:57Z

...ugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/bulk/BulkShardOperationsTests.java

+            }
+        }
+
+        for (final Translog.Operation operation : result.replicaRequest().getOperations()) {


jasontedor · 2018-06-06T11:09:08Z

run gradle build tests

The primary shard copy on the following has authority of the replication operations that occur on the following side in cross-cluster replication. Yet today we are using the primary term directly from the operations on the leader side. Instead we should be replacing the primary term on the following side with the primary term of the primary on the following side. This commit does this by copying the translog operations with the corrected primary term. This ensures that we use this primary term while applying the operations on the primary, and when replicating them across to the replica (where the replica request was carrying the primary term of the primary shard copy on the follower).

Today we rewrite the operations from the leader with the term of the following primary because the follower should own its history. The problem is that a newly promoted primary may re-assign its term to operations which were replicated to replicas before by the previous primary. If this happens, some operations with the same seq_no may be assigned different terms. This is not good for the future optimistic locking using a combination of seqno and term. This change ensures that the primary of a follower only processes an operation if that operation was not processed before. The skipped operations are guaranteed to be delivered to replicas via either primary-replica resync or peer-recovery. However, the primary must not acknowledge until the global checkpoint is at least the highest seqno of all skipped ops (i.e., they all have been processed on every replica). Relates #31751 Relates #31113

jasontedor added review :Distributed/CCR Issues around the Cross Cluster State Replication features labels Jun 5, 2018

jasontedor requested review from bleskes and martijnvg June 5, 2018 18:55

bleskes approved these changes Jun 6, 2018

View reviewed changes

jasontedor added 3 commits June 6, 2018 05:16

More randomness

0a8499b

More reuse

8086d49

Safety first!

071a4a3

jasontedor merged commit bf1152f into elastic:ccr Jun 6, 2018

jasontedor deleted the follower-primary-term branch June 6, 2018 15:13

dnhatn mentioned this pull request Oct 4, 2018

CCR: Following primary should process operations once #34288

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use follower primary term when applying operations#31113

Use follower primary term when applying operations#31113
jasontedor merged 5 commits intoelastic:ccrfrom
jasontedor:follower-primary-term

jasontedor commented Jun 5, 2018

Uh oh!

elasticmachine commented Jun 5, 2018

Uh oh!

jasontedor commented Jun 5, 2018

Uh oh!

bleskes left a comment

Uh oh!

bleskes Jun 6, 2018

Uh oh!

jasontedor Jun 6, 2018

Uh oh!

bleskes Jun 6, 2018

Uh oh!

jasontedor Jun 6, 2018

Uh oh!

bleskes Jun 6, 2018

Uh oh!

jasontedor Jun 6, 2018

Uh oh!

bleskes Jun 6, 2018

Uh oh!

jasontedor commented Jun 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jasontedor commented Jun 5, 2018

Uh oh!

elasticmachine commented Jun 5, 2018

Uh oh!

jasontedor commented Jun 5, 2018

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

bleskes Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

bleskes Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

bleskes Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

bleskes Jun 6, 2018

Choose a reason for hiding this comment

Uh oh!

jasontedor commented Jun 6, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants