Remove TranslogRecoveryPerformer by ywelsch · Pull Request #24858 · elastic/elasticsearch

ywelsch · 2017-05-24T12:26:04Z

Splits TranslogRecoveryPerformer into three parts:

the translog operation to engine operation converter
the operation perfomer (that indexes the operation into the engine)
the translog statistics (for which there is already RecoveryState.Translog)

This makes it possible for peer recovery to use the same IndexShard interface as bulk shard requests (i.e. Engine operations instead of Translog operations). It also pushes the "fail on bad mapping" logic outside of IndexShard. Future pull requests could unify the BulkShard and peer recovery path even more.

bleskes

I'm +0 on this (with one exception, see comment) and OK with merging it. @s1monw, since you have designed the original scheme, I think it's good if you look as well.

bleskes · 2017-05-26T08:21:24Z

core/src/main/java/org/elasticsearch/index/engine/EngineConfig.java

+    /**
+     * Returns statistics object for the translog. Used during translog recovery, see also {@link Engine#recoverFromTranslog()}
+     */
+    public RecoveryState.Translog getTranslogStats() {


nit: call this (and derivatives) getRecoveryTranslogStats?

bleskes · 2017-05-26T08:43:00Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

+        for (Translog.Operation op : operations) {
+            Engine.Operation engineOp = indexShard().convertToEngineOp(op, Engine.Operation.Origin.PEER_RECOVERY);
+            if (engineOp instanceof Engine.Index && ((Engine.Index) engineOp).parsedDoc().dynamicMappingsUpdate() != null) {
+                translog.decrementRecoveredOperations(completedOps); // clean-up stats


I think this is a tricky place to put this - it doesn't really know what the retry semantics are (that we always retry a full batch). This is why we had the BatchOperationException. If we want to remove it from the "official exception list" (+1 on that), we can still make BatchOperationException dedicated non ElasticsearchException by always rethrowing it's cause.

can we also have mapping updates for deletes? I wonder if that is the case now that we allow type introduction for deletes too?!

I think the mapping is tricky too, I am not sure if we can hit it because of a broken mapping or anything in which case we should fail the recovery? Maybe we can use DelayRecoveryException for this purpose instead? it's really nothing else but a delay?

I've changed the flow of this method to first do the conversion for all the operations in the batch and then only proceed with the actual indexing once we've confirmed that there were no mapping updates. This makes the BatchOperationException obsolete.

@s1monw yes, the method TransportShardBulkAction.executeDeleteRequestOnReplica currently has other mapping conditions than our recovery code here. The main motivation for this PR was to change some of the internal APIs to address the divergence between recovery and replication code. We can fix the actual divergences in a follow-up.

…he recovery logic from a snapshot

bleskes · 2017-05-26T14:01:22Z

the latest commit pushed me over to +1...

s1monw

first I got exited, I thought @ywelsch found a way to encapsulate the translog recovery entirely inside the engine... Well I am still excited since it's moving things into the right places IMO. I also like to get rid of another exception. I left some suggestions, thanks @ywelsch for cleaning things up

s1monw · 2017-05-30T08:00:44Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

+        for (Translog.Operation op : operations) {
+            Engine.Operation engineOp = indexShard().convertToEngineOp(op, Engine.Operation.Origin.PEER_RECOVERY);
+            if (engineOp instanceof Engine.Index && ((Engine.Index) engineOp).parsedDoc().dynamicMappingsUpdate() != null) {
+                translog.decrementRecoveredOperations(completedOps); // clean-up stats


can we also have mapping updates for deletes? I wonder if that is the case now that we allow type introduction for deletes too?!

s1monw · 2017-05-30T08:02:32Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

+        for (Translog.Operation op : operations) {
+            Engine.Operation engineOp = indexShard().convertToEngineOp(op, Engine.Operation.Origin.PEER_RECOVERY);
+            if (engineOp instanceof Engine.Index && ((Engine.Index) engineOp).parsedDoc().dynamicMappingsUpdate() != null) {
+                translog.decrementRecoveredOperations(completedOps); // clean-up stats


I think the mapping is tricky too, I am not sure if we can hit it because of a broken mapping or anything in which case we should fail the recovery? Maybe we can use DelayRecoveryException for this purpose instead? it's really nothing else but a delay?

s1monw · 2017-05-30T08:03:27Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

    @Override
-    public long indexTranslogOperations(
-            List<Translog.Operation> operations, int totalTranslogOps) throws TranslogRecoveryPerformer.BatchOperationException {
+    public long indexTranslogOperations(List<Translog.Operation> operations, int totalTranslogOps) throws MapperException, IOException {


any chance we can get a unittest for this?

There is already extensive test coverage for this (e.g. test subclasses of ESIndexLevelReplicationTestCase).

s1monw · 2017-05-30T08:04:21Z

core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

-                        exception);
-                    final RecoveryState.Translog translog = recoveryTarget.state().getTranslog();
-                    translog.decrementRecoveredOperations(exception.completedOperations()); // do the maintainance and rollback competed ops
+                    logger.trace("delaying recovery due to missing mapping changes", exception);


I am leaning towards making this a debug or remove it entirely. I really think it's worth a debug statement

ok, I've changed it to "debug" level in 6535937

s1monw · 2017-05-30T08:05:36Z

core/src/main/java/org/elasticsearch/index/shard/TranslogOpToEngineOpConverter.java

+        return mapperService.documentMapperWithAutoCreate(type); // protected for testing
+    }
+
+    public Engine.Operation convertToEngineOp(Translog.Operation operation, Engine.Operation.Origin origin) {


this class seems simple enough to be unittested? Maybe we can add a test based on IndexShardTestCase

added in 0d2e800

s1monw · 2017-05-30T08:06:47Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

+        return translogOpToEngineOpConverter.convertToEngineOp(operation, origin);
+    }
+
+    private int runTranslogRecovery(Engine engine, Translog.Snapshot snapshot) throws IOException {


any chance we can add a test for this to IndexShardTests

added in 658b889

s1monw · 2017-05-30T08:06:59Z

core/src/main/java/org/elasticsearch/index/shard/IndexShard.java

        return new Engine.Index(uid, doc, seqNo, primaryTerm, version, versionType, origin, startTime, autoGeneratedIdTimestamp, isRetry);
    }

+    public Engine.Result applyOperation(Engine.Operation operation) throws IOException {


please add a javadoc to this.

added in 61f17e7

…eryperformer-out-of-engine-config

s1monw

LGTM

bleskes

Thx @ywelsch

bleskes · 2017-06-07T12:44:59Z

core/src/main/java/org/elasticsearch/indices/recovery/PeerRecoveryTargetService.java

-                        exception);
-                    final RecoveryState.Translog translog = recoveryTarget.state().getTranslog();
-                    translog.decrementRecoveredOperations(exception.completedOperations()); // do the maintainance and rollback competed ops
+                    logger.debug("delaying recovery due to missing mapping changes", exception);


unrelated, but can this still happen today? do we want to assert here?

I think this can still happen (although very rarely). In theory, we could avoid this by doing something similar as calling recoveryTarget.ensureClusterStateVersion(currentClusterStateVersion) before phase2. The tricky bit is that the current primary might have indexed something based on a mapping change that has not been fully applied yet (i.e. is being applied, but not yet available under ClusterService.state(). In this case we would rather want to know about the pre-applied state).

bleskes · 2017-06-07T12:45:45Z

core/src/main/java/org/elasticsearch/indices/recovery/RecoveryTarget.java

+            throw new IndexShardNotRecoveringException(shardId, indexShard().state());
+        }
+        // first convert all translog operations to engine operations to check for mapping updates
+        List<Engine.Operation> engineOps = operations.stream().map(


…ne-config

ywelsch · 2017-06-07T15:11:38Z

thanks @bleskes @s1monw. Special thanks for the merge conflicts @bleskes (2x).

ywelsch added :Internal >non-issue v6.0.0 labels May 24, 2017

ywelsch requested a review from bleskes May 24, 2017 12:26

ywelsch force-pushed the enhance/move-recoveryperformer-out-of-engine-config branch from 945a7d7 to b0ad489 Compare May 24, 2017 15:31

ywelsch changed the title ~~Move TranslogRecoveryPerformer out of EngineConfig~~ Remove TranslogRecoveryPerformer May 24, 2017

Remove TranslogRecoveryPerformer

ca5b9f9

ywelsch force-pushed the enhance/move-recoveryperformer-out-of-engine-config branch from b0ad489 to ca5b9f9 Compare May 24, 2017 15:37

checkstyle

84ef0ec

bleskes suggested changes May 26, 2017

View reviewed changes

Pass single TranslogRecoveryRunner object to Engine that implements t…

017598d

…he recovery logic from a snapshot

ywelsch requested a review from s1monw May 30, 2017 07:50

s1monw suggested changes May 30, 2017

View reviewed changes

ywelsch added 7 commits June 2, 2017 10:23

Merge remote-tracking branch 'elastic/master' into enhance/move-recov…

90ebfd5

…eryperformer-out-of-engine-config

clean up TranslogOpToEngineOpConverter

26cc064

add javadoc

61f17e7

Set log level of "delaying recovery" message to DEBUG

6535937

Simplify retry logic in indexTranslogOperations

fab8d46

Add unit test for IndexShard.runTranslogRecovery

658b889

add unit test for TranslogOpToEngineOpConverter

0d2e800

ywelsch requested review from bleskes and s1monw June 6, 2017 15:48

s1monw approved these changes Jun 7, 2017

View reviewed changes

bleskes approved these changes Jun 7, 2017

View reviewed changes

Merge branch 'master' into enhance/move-recoveryperformer-out-of-engi…

3db3d17

…ne-config

ywelsch merged commit 26ec891 into elastic:master Jun 7, 2017

ywelsch mentioned this pull request Jun 15, 2017

Simplify IndexShard indexing / deletion methods #25249

Merged

bleskes mentioned this pull request Jun 22, 2017

Add Sequence Numbers to write operations #10708

Closed

64 tasks

colings86 added v6.0.0-beta1 and removed v6.0.0 labels Jul 31, 2017

Conversation

ywelsch commented May 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bleskes commented May 26, 2017

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1monw left a comment

Choose a reason for hiding this comment

Uh oh!

bleskes left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch commented Jun 7, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ywelsch commented May 24, 2017 •

edited

Loading