Non-blocking primary relocation hand-off by ywelsch · Pull Request #19013 · elastic/elasticsearch

ywelsch · 2016-06-21T20:05:48Z

Primary relocation and indexing concurrently can currently lead to a deadlock situation as indexing operations are blocked on a (bounded) thread pool during the hand-off phase between old and new primary. This PR replaces blocking of indexing operations by putting operations that cannot be executed during relocation hand-off in a queue to be executed once relocation completes.

Relates to #18553, #15900.

bleskes · 2016-06-27T14:31:18Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

-            try {
-                if (primaryShardReference.isRelocated()) {
+
+            Callback<Throwable> onFailure = t -> {


all these small call backs make me think we need AsyncPrimaryAction which has the channel and the task as fields and implements ActionListener<PrimaryShardReference> so we can pass it as a value to the call backs?

I would very much prefer an AsyncPrimaryAction (especially so that it implements AbstractRunnable, making failures simpler).

bleskes · 2016-06-27T15:31:29Z

Thx @ywelsch . I left some comments that I think will simplify things. My main concern here is the extra IndexShardOperationsLock wrapper around SuspendableRefContainer. I'm not sure we need an extra abstraction instead of making SuspendableRefContainer implement the API we need (or just rename it). It makes things more complex, for example, IndexShardOperationsLock assumes that the only reason why tryAcquire can fail is that a block operation is going on. This is true now, but only because we use Integer.MAX_VALUE as the total amount of operations. Some one can change that and not realize the implications it has.

ywelsch · 2016-06-29T06:10:49Z

@bleskes I've updated the PR with the following main changes:

introduced AsyncPrimaryAction and use the original flow of the method to distinguish isRelocated on the PrimaryShardReference.
inlined SuspendableRefContainer into IndexShardOperationsLock.
simplified lock acquisition retry (eliminating the loop) to remove some of the non-blocking fanciness.
Please have another look.

bleskes · 2016-06-30T12:05:41Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -157,7 +160,7 @@ protected void resolveRequest(MetaData metaData, IndexMetaData indexMetaData, Re

    /**
     * Synchronous replica operation on nodes with replica copies. This is done under the lock form


form -> from

ywelsch added >enhancement review resiliency :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha4 labels Jun 21, 2016

ywelsch assigned bleskes Jun 21, 2016

ywelsch force-pushed the fix/relocation-handoff-deadlock branch 2 times, most recently from a964b78 to 4ae3d5f Compare June 21, 2016 20:23

clintongormley added v5.0.0-alpha5 and removed v5.0.0-alpha4 labels Jun 22, 2016

bleskes reviewed Jun 27, 2016
View reviewed changes

bleskes reviewed Jun 30, 2016
View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-blocking primary relocation hand-off#19013

Non-blocking primary relocation hand-off#19013
ywelsch merged 1 commit intoelastic:masterfrom
ywelsch:fix/relocation-handoff-deadlock

ywelsch commented Jun 21, 2016

Uh oh!

bleskes Jun 27, 2016 •

edited

Loading

Uh oh!

ywelsch Jun 27, 2016

Uh oh!

bleskes commented Jun 27, 2016

Uh oh!

ywelsch commented Jun 29, 2016

Uh oh!

bleskes Jun 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -157,7 +160,7 @@ protected void resolveRequest(MetaData metaData, IndexMetaData indexMetaData, Re

		/**
		* Synchronous replica operation on nodes with replica copies. This is done under the lock form

Conversation

ywelsch commented Jun 21, 2016

Uh oh!

bleskes Jun 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywelsch Jun 27, 2016

Choose a reason for hiding this comment

Uh oh!

bleskes commented Jun 27, 2016

Uh oh!

ywelsch commented Jun 29, 2016

Uh oh!

bleskes Jun 30, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bleskes Jun 27, 2016 •

edited

Loading