Skip to content

Relocation and indexing concurrently can lead to a deadlock #18553

@brwe

Description

@brwe

Elasticsearch version: master (f210605)

Edit: pushed a test to master with an AwaitsFix, see below. Reproduces for me on spinning disk.

This is what happens:

four nodes: node_t0, node_t1, node_t2, node_t3

node_t1 has P0 and R1 and node_t0 has P1 and R0

Index request for P0 comes in, executes on node_t1 and sends replication request to node_t0.
Index request for P1 comes in, executes on node_t0 and sends replication request to node_t1.
Both will hold one shard reference (by which I mean a permit in SuspendableRefContainer of the shard).

Before replication requests arrives at target (trapped in the network, queued in thread pool, ...):

P0 starts relocating from node_t1 to node_t2
P1 starts relocating from node_t0 to node_t3

Before relocation finishes we try to set the state of IndexShard to relocated and block all further incoming requests by trying to acquire acquiring all shard references.
We do this on both nodes. Both do not succeed in acquiring all counters because the two primary requests are still waiting for the replica to succeed.
In the meanwhile, new indexing requests come in and try to acquire shard references but they have to wait too because the relocation is queued before them.
Hence, they block the indexing threadpool.

Now, the replication requests arrive at their respective targets. But because the indexing thread pool is full on both nodes they too will have to wait.

Therefore the primary requests on node_t0 and node_t1 will never release their shard reference, the relocation can never be finished and the new indexing requests never finish too.

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed/RecoveryAnything around constructing a new shard, either from a local or a remote source.>bugv5.0.0-alpha5

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions