Skip to content

Processing pending deletes can block shard initialisation for 30 minutes #14932

@ywelsch

Description

@ywelsch

Test failure: http://build-us-00.elastic.co/job/es_core_master_window-2008/2553/testReport/junit/org.elasticsearch.indices.state/RareClusterStateIT/testDeleteCreateInOneBulk/

The test fails due to a race in acquiring ShardLock locks.

When an index is deleted, an asynchronous process is started to process pending deletes on shards of that index. This process first acquires all ShardLock locks for the given index in numeric shard order. Meanwhile, the new index can already have been created, and some shard locks can already be held due to shard creation in IndicesClusterStateService.applyInitializingShard. For example, shard 0 is locked by processPendingDeletes but shard 1 is locked by applyInitializingShard. In that case, processPendingDeletes cannot lock shard 1 and blocks (and will hold lock on shard 0 for 30 minutes). This means that shard 0 cannot be initialised for 30 minutes.

Interesting bits of stack trace:

"elasticsearch[node_t1][generic][T#2]" ID=602 TIMED_WAITING on java.util.concurrent.Semaphore$NonfairSync@2fc45c3b
    at sun.misc.Unsafe.park(Native Method)
    - timed waiting on java.util.concurrent.Semaphore$NonfairSync@2fc45c3b
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:555)
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:485)
    at org.elasticsearch.env.NodeEnvironment.lockAllForIndex(NodeEnvironment.java:429)
    at org.elasticsearch.indices.IndicesService.processPendingDeletes(IndicesService.java:649)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction.lockIndexAndAck(NodeIndexDeletedAction.java:101)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction.access$300(NodeIndexDeletedAction.java:46)
    at org.elasticsearch.cluster.action.index.NodeIndexDeletedAction$1.doRun(NodeIndexDeletedAction.java:90)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.ThreadPoolExecutor$Worker@b17810e


"elasticsearch[node_t1][clusterService#updateTask][T#1]" ID=591 TIMED_WAITING on java.util.concurrent.Semaphore$NonfairSync@7fdcd730
    at sun.misc.Unsafe.park(Native Method)
    - timed waiting on java.util.concurrent.Semaphore$NonfairSync@7fdcd730
    at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
    at java.util.concurrent.Semaphore.tryAcquire(Semaphore.java:409)
    at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:555)
    at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:485)
    at org.elasticsearch.index.IndexService.createShard(IndexService.java:234)
    - locked org.elasticsearch.index.IndexService@707e1798
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:628)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:528)
    at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
    - locked java.lang.Object@773b911a
    at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:517)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:231)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:194)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.ThreadPoolExecutor$Worker@26f887da

Metadata

Metadata

Assignees

No one assigned

    Labels

    :Distributed/EngineAnything around managing Lucene and the Translog in an open shard.>bug>test-failureTriaged test failures from CI

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions