Skip to content

Generic threads deadlock related to ILMHistoryStore indexing #68468

@DaveCTurner

Description

@DaveCTurner

Elasticsearch version (bin/elasticsearch --version): 7.10.1

Plugins installed: Cloud

JVM version (java -version): 15.0.1+9

OS version (uname -a if on a Unix-like system): Cloud

Description of the problem including expected versus actual behavior:

All generic threads are waiting with the following stack trace:

at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:211)                                                                                                             
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Ljava/util/concurrent/locks/AbstractQueuedSynchronizer$Node;IZZZJ)I (AbstractQueuedSynchronizer.java:714)                             
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(I)V (AbstractQueuedSynchronizer.java:937)                                                                                             
at java.util.concurrent.locks.ReentrantLock$Sync.lock()V (ReentrantLock.java:153)                                                                                                                      
at java.util.concurrent.locks.ReentrantLock.lock()V (ReentrantLock.java:322)                                                                                                                           
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(Lorg/elasticsearch/action/DocWriteRequest;)V (BulkProcessor.java:379)                                                                       
at org.elasticsearch.action.bulk.BulkProcessor.add(Lorg/elasticsearch/action/DocWriteRequest;)Lorg/elasticsearch/action/bulk/BulkProcessor; (BulkProcessor.java:361)                                   
at org.elasticsearch.action.bulk.BulkProcessor.add(Lorg/elasticsearch/action/index/IndexRequest;)Lorg/elasticsearch/action/bulk/BulkProcessor; (BulkProcessor.java:347)                                
at org.elasticsearch.xpack.ilm.history.ILMHistoryStore.lambda$putAsync$0(Lorg/elasticsearch/action/index/IndexRequest;Lorg/elasticsearch/xpack/ilm/history/ILMHistoryItem;)V (ILMHistoryStore.java:150)
at org.elasticsearch.xpack.ilm.history.ILMHistoryStore$$Lambda$8091+0x0000000801eda370.run()V (Unknown Source)                                                                                         
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run()V (ThreadContext.java:678)                                                                                    
at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1130)                                                                 
at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:630)                                                                                                                 
at java.lang.Thread.run()V (Thread.java:832)                                                                                                                                                           

Meanwhile one of the scheduler threads is holding the lock on which they're waiting, and is blocked here:

jdk.internal.misc.Unsafe.park(ZJ)V (Native Method)                                                                                                                     
java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:211)                                                                                
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Ljava/util/concurrent/locks/AbstractQueuedSynchronizer$Node;IZZZJ)I (AbstractQueuedSynchronizer.java:714)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(I)V (AbstractQueuedSynchronizer.java:1046)                                            
java.util.concurrent.Semaphore.acquire()V (Semaphore.java:318)                                                                                                         
org.elasticsearch.action.bulk.BulkRequestHandler.execute(Lorg/elasticsearch/action/bulk/BulkRequest;J)V (BulkRequestHandler.java:59)                                   
org.elasticsearch.action.bulk.BulkProcessor.execute(Lorg/elasticsearch/action/bulk/BulkRequest;J)V (BulkProcessor.java:454)                                            
org.elasticsearch.action.bulk.BulkProcessor.execute()V (BulkProcessor.java:463)                                                                                        
org.elasticsearch.action.bulk.BulkProcessor.access$400(Lorg/elasticsearch/action/bulk/BulkProcessor;)V (BulkProcessor.java:54)                                         
org.elasticsearch.action.bulk.BulkProcessor$Flush.run()V (BulkProcessor.java:503)                                                                                      
org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun()V (Scheduler.java:213)                                                                              
org.elasticsearch.common.util.concurrent.AbstractRunnable.run()V (AbstractRunnable.java:37)                                                                            
java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object; (Executors.java:515)                                                                           
java.util.concurrent.FutureTask.run()V (FutureTask.java:264)                                                                                                           
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V (ScheduledThreadPoolExecutor.java:304)                                                     
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1130)                                    
java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:630)                                                                                    
java.lang.Thread.run()V (Thread.java:832)                                                                                                                              

The semaphore on which it is waiting is held, apparently, by another ongoing flush. I haven't chased this any further but I could believe that the ongoing flush needs a generic thread to make progress.

Steps to reproduce:

Unknown.

Provide logs (if relevant):

Not available, but I can share a heap dump privately.

Workaround:

The deadlocked node must be restarted, it will not recover on its own. If the issue persists then the problematic component can be disabled by setting indices.lifecycle.history_index_enabled: false in the elasticsearch.yml file on each master-eligible node and then restarting them all.

Metadata

Metadata

Assignees

Labels

:Data Management/ILM+SLMDO NOT USE. Use ":StorageEngine/ILM" or ":Distributed Coordination/SLM" instead.>bugTeam:Data Management (obsolete)DO NOT USE. This team no longer exists.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions