Prevent deadlock by using separate schedulers by jakelandis · Pull Request #48697 · elastic/elasticsearch

jakelandis · 2019-10-30T16:21:23Z

Currently the BulkProcessor class uses a single scheduler to schedule
flushes and retries. Functionally these are very different concerns but
can result in a dead lock. Specifically, the single shared scheduler
can kick off a flush task, which only finishes it's task when the bulk
that is being flushed finishes. If (for what ever reason), any items in
that bulk fails it will (by default) schedule a retry. However, that retry
will never run it's task, since the flush task is consuming the 1 and
only thread available from the shared scheduler.

Since the BulkProcessor is mostly client based code, the client can
provide their own scheduler. As-is the scheduler would require
at minimum 2 worker threads to avoid the potential deadlock. Since the
number of threads is a configuration option in the scheduler, the code
can not enforce this 2 worker rule until runtime. For this reason this
commit splits the single task scheduler into 2 schedulers. This eliminates
the potential for the flush task to block the retry task and removes this
deadlock scenario.

This commit also deprecates the Java APIs that presume a single scheduler,
and updates any internal code to no longer use those APIs.

Fixes #47599

Note - #41451 fixed the general case where a bulk fails and is retried
that can result in a deadlock. This fix should address that case as well as
the case when a bulk failure from the flush needs to be retried.

This should considered for backporting to 6.x. Thoughts ?

Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes elastic#47599 Note - elastic#41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure *from the flush* needs to be retried.

elasticmachine · 2019-10-30T16:21:25Z

Pinging @elastic/es-core-features (:Core/Features/Java High Level REST Client)

hub-cap · 2019-10-30T19:49:43Z

I think that it would be nice to have in 6.8 since there are some incompatibilities with different version client vs server until we get things fully split up. the change looks good to me but Im not a subject matter expert in the bulk stuff so ill decline to add a proper review

martijnvg · 2019-10-31T09:40:36Z

This should considered for backporting to 6.x. Thoughts ?

👍 Yes, given the severity of the issues this can cause.

martijnvg

LGTM

Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes elastic#47599 Note - elastic#41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure *from the flush* needs to be retried.

Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes #47599 Note - #41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure *from the flush* needs to be retried.

* Prevent deadlock by using separate schedulers (#48697) Currently the BulkProcessor class uses a single scheduler to schedule flushes and retries. Functionally these are very different concerns but can result in a dead lock. Specifically, the single shared scheduler can kick off a flush task, which only finishes it's task when the bulk that is being flushed finishes. If (for what ever reason), any items in that bulk fails it will (by default) schedule a retry. However, that retry will never run it's task, since the flush task is consuming the 1 and only thread available from the shared scheduler. Since the BulkProcessor is mostly client based code, the client can provide their own scheduler. As-is the scheduler would require at minimum 2 worker threads to avoid the potential deadlock. Since the number of threads is a configuration option in the scheduler, the code can not enforce this 2 worker rule until runtime. For this reason this commit splits the single task scheduler into 2 schedulers. This eliminates the potential for the flush task to block the retry task and removes this deadlock scenario. This commit also deprecates the Java APIs that presume a single scheduler, and updates any internal code to no longer use those APIs. Fixes #47599 Note - #41451 fixed the general case where a bulk fails and is retried that can result in a deadlock. This fix should address that case as well as the case when a bulk failure *from the flush* needs to be retried.

jakelandis added >bug :Core/Features/Java High Level REST Client :Distributed/Watcher v8.0.0 v7.6.0 labels Oct 30, 2019

jakelandis requested a review from martijnvg October 30, 2019 16:21

jakelandis mentioned this pull request Oct 30, 2019

Fix BulkProcessor deadlock when bulk requests fail (#47599) #48013

Closed

fix javadoc

bec2893

martijnvg approved these changes Oct 31, 2019

View reviewed changes

jakelandis added v6.8.5 v7.5.0 labels Oct 31, 2019

jakelandis merged commit c38079d into elastic:master Oct 31, 2019

jakelandis deleted the bulk_processor_deadlock_47599 branch October 31, 2019 18:02

jakelandis added the backport pending label Oct 31, 2019

jakelandis mentioned this pull request Nov 11, 2019

[6.8] Prevent deadlock by using separate schedulers (#48697) #48963

Merged

jakelandis mentioned this pull request Nov 11, 2019

[7.x] Prevent deadlock by using separate schedulers (#48697) #48964

Merged

jakelandis mentioned this pull request Nov 11, 2019

[7.5] Prevent deadlock by using separate schedulers (#48697) #48965

Merged

jakelandis removed the backport pending label Nov 12, 2019

suxinglee mentioned this pull request Dec 2, 2019

[FLINK-11046][elasticsearch] Bump elasticsearch-rest-high-level-client to 7.5.0 in connectors apache/flink#10385

Closed

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis mentioned this pull request Apr 7, 2020

BulkProcessor can deadlock when bulk requests fail #47599

Closed

tankilo mentioned this pull request Apr 13, 2020

Java application using BulkProcessing hangs for threads deadlocked. #44556

Closed

bruckner mentioned this pull request Jul 3, 2020

Bulk processor concurrent requests #41451

Merged

zifeihan mentioned this pull request Nov 2, 2020

Fix deadlock problem when using elasticsearch-client-7.0.0 apache/skywalking#5775

Merged

2 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent deadlock by using separate schedulers#48697

Prevent deadlock by using separate schedulers#48697
jakelandis merged 2 commits intoelastic:masterfrom
jakelandis:bulk_processor_deadlock_47599

jakelandis commented Oct 30, 2019

Uh oh!

elasticmachine commented Oct 30, 2019

Uh oh!

hub-cap commented Oct 30, 2019

Uh oh!

martijnvg commented Oct 31, 2019

Uh oh!

martijnvg left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jakelandis commented Oct 30, 2019

Uh oh!

elasticmachine commented Oct 30, 2019

Uh oh!

hub-cap commented Oct 30, 2019

Uh oh!

martijnvg commented Oct 31, 2019

Uh oh!

martijnvg left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants