Skip to content

Non-blocking primary relocation hand-off#19013

Merged
ywelsch merged 1 commit intoelastic:masterfrom
ywelsch:fix/relocation-handoff-deadlock
Jul 2, 2016
Merged

Non-blocking primary relocation hand-off#19013
ywelsch merged 1 commit intoelastic:masterfrom
ywelsch:fix/relocation-handoff-deadlock

Conversation

@ywelsch
Copy link
Copy Markdown
Contributor

@ywelsch ywelsch commented Jun 21, 2016

Primary relocation and indexing concurrently can currently lead to a deadlock situation as indexing operations are blocked on a (bounded) thread pool during the hand-off phase between old and new primary. This PR replaces blocking of indexing operations by putting operations that cannot be executed during relocation hand-off in a queue to be executed once relocation completes.

Relates to #18553, #15900.

@ywelsch ywelsch added >enhancement review resiliency :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v5.0.0-alpha4 labels Jun 21, 2016
@ywelsch ywelsch force-pushed the fix/relocation-handoff-deadlock branch 2 times, most recently from a964b78 to 4ae3d5f Compare June 21, 2016 20:23
try {
if (primaryShardReference.isRelocated()) {

Callback<Throwable> onFailure = t -> {
Copy link
Copy Markdown
Contributor

@bleskes bleskes Jun 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these small call backs make me think we need AsyncPrimaryAction which has the channel and the task as fields and implements ActionListener<PrimaryShardReference> so we can pass it as a value to the call backs?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer an AsyncPrimaryAction (especially so that it implements AbstractRunnable, making failures simpler).

@bleskes
Copy link
Copy Markdown
Contributor

bleskes commented Jun 27, 2016

Thx @ywelsch . I left some comments that I think will simplify things. My main concern here is the extra IndexShardOperationsLock wrapper around SuspendableRefContainer. I'm not sure we need an extra abstraction instead of making SuspendableRefContainer implement the API we need (or just rename it). It makes things more complex, for example, IndexShardOperationsLock assumes that the only reason why tryAcquire can fail is that a block operation is going on. This is true now, but only because we use Integer.MAX_VALUE as the total amount of operations. Some one can change that and not realize the implications it has.

@ywelsch
Copy link
Copy Markdown
Contributor Author

ywelsch commented Jun 29, 2016

@bleskes I've updated the PR with the following main changes:

  • introduced AsyncPrimaryAction and use the original flow of the method to distinguish isRelocated on the PrimaryShardReference.
  • inlined SuspendableRefContainer into IndexShardOperationsLock.
  • simplified lock acquisition retry (eliminating the loop) to remove some of the non-blocking fanciness.
    Please have another look.

@@ -157,7 +160,7 @@ protected void resolveRequest(MetaData metaData, IndexMetaData indexMetaData, Re

/**
* Synchronous replica operation on nodes with replica copies. This is done under the lock form
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

form -> from

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement resiliency v5.0.0-alpha5

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants