Processing pending deletes can block shard initialisation for 30 minutes

Test failure: http://build-us-00.elastic.co/job/es_core_master_window-2008/2553/testReport/junit/org.elasticsearch.indices.state/RareClusterStateIT/testDeleteCreateInOneBulk/

The test fails due to a race in acquiring `ShardLock` locks. 

When an index is deleted, an asynchronous process is started to process pending deletes on shards of that index. This process first acquires all `ShardLock` locks for the given index in numeric shard order. Meanwhile, the new index can already have been created, and some shard locks can already be held due to shard creation in `IndicesClusterStateService.applyInitializingShard`. For example, shard 0 is locked by `processPendingDeletes` but shard 1 is locked by `applyInitializingShard`. In that case, `processPendingDeletes` cannot lock shard 1 and blocks (and will hold lock on shard 0 for 30 minutes). This means that shard 0 cannot be initialised for 30 minutes.

Interesting bits of stack trace:

```
"elasticsearch[node_t1][generic][T#2]" ID=602 TIMED_WAITING on java.util.concurrent.Semaphore$NonfairSync@2fc45c3b
    at sun.misc.Unsafe.park(Native Method)
    - timed waiting on java.util.concurrent.Semaphore$NonfairSync@2fc45c3b
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:555)
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:485)
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:429)
    at org.elasticsearch.indices.IndicesService.processPendingDeletes(IndicesService.java:649)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction.lockIndexAndAck(NodeIndexDeletedAction.java:101)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction.access$300(NodeIndexDeletedAction.java:46)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction$1.doRun(NodeIndexDeletedAction.java:90)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.ThreadPoolExecutor$Worker@b17810e


"elasticsearch[node_t1][clusterService#updateTask][T#1]" ID=591 TIMED_WAITING on java.util.concurrent.Semaphore$NonfairSync@7fdcd730
    at sun.misc.Unsafe.park(Native Method)
    - timed waiting on java.util.concurrent.Semaphore$NonfairSync@7fdcd730
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:555)
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:485)
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:234)
    - locked org.elasticsearch.index.IndexService@707e1798
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:628)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:528)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
    - locked java.lang.Object@773b911a
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:517)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.ThreadPoolExecutor$Worker@26f887da
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing pending deletes can block shard initialisation for 30 minutes #14932

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Processing pending deletes can block shard initialisation for 30 minutes #14932

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions