Elasticsearch version: master (f210605)
Edit: pushed a test to master with an AwaitsFix, see below. Reproduces for me on spinning disk.
This is what happens:
four nodes: node_t0, node_t1, node_t2, node_t3
node_t1 has P0 and R1 and node_t0 has P1 and R0
Index request for P0 comes in, executes on node_t1 and sends replication request to node_t0.
Index request for P1 comes in, executes on node_t0 and sends replication request to node_t1.
Both will hold one shard reference (by which I mean a permit in SuspendableRefContainer of the shard).
Before replication requests arrives at target (trapped in the network, queued in thread pool, ...):
P0 starts relocating from node_t1 to node_t2
P1 starts relocating from node_t0 to node_t3
Before relocation finishes we try to set the state of IndexShard to relocated and block all further incoming requests by trying to acquire acquiring all shard references.
We do this on both nodes. Both do not succeed in acquiring all counters because the two primary requests are still waiting for the replica to succeed.
In the meanwhile, new indexing requests come in and try to acquire shard references but they have to wait too because the relocation is queued before them.
Hence, they block the indexing threadpool.
Now, the replication requests arrive at their respective targets. But because the indexing thread pool is full on both nodes they too will have to wait.
Therefore the primary requests on node_t0 and node_t1 will never release their shard reference, the relocation can never be finished and the new indexing requests never finish too.
Elasticsearch version: master (f210605)
Edit: pushed a test to master with an AwaitsFix, see below. Reproduces for me on spinning disk.
This is what happens:
four nodes: node_t0, node_t1, node_t2, node_t3
node_t1 has P0 and R1 and node_t0 has P1 and R0
Index request for P0 comes in, executes on node_t1 and sends replication request to node_t0.
Index request for P1 comes in, executes on node_t0 and sends replication request to node_t1.
Both will hold one shard reference (by which I mean a permit in SuspendableRefContainer of the shard).
Before replication requests arrives at target (trapped in the network, queued in thread pool, ...):
P0 starts relocating from node_t1 to node_t2
P1 starts relocating from node_t0 to node_t3
Before relocation finishes we try to set the state of IndexShard to relocated and block all further incoming requests by trying to acquire acquiring all shard references.
We do this on both nodes. Both do not succeed in acquiring all counters because the two primary requests are still waiting for the replica to succeed.
In the meanwhile, new indexing requests come in and try to acquire shard references but they have to wait too because the relocation is queued before them.
Hence, they block the indexing threadpool.
Now, the replication requests arrive at their respective targets. But because the indexing thread pool is full on both nodes they too will have to wait.
Therefore the primary requests on node_t0 and node_t1 will never release their shard reference, the relocation can never be finished and the new indexing requests never finish too.