-
Notifications
You must be signed in to change notification settings - Fork 776
Description
Bug report
Hi! I hit a deadlock using the AWS Batch executor on 25.10.3. Specifically I believe the issue is related to #6729 which was backported to 25.10.3: f59137e
That PR also puts quite a bit more load on DescribeJobs. Specifically, it calls describeJob for every task:
nextflow/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy
Line 932 in 887443e
| def job = describeJob(jobId) |
Unlike the task polling supervisor, it does not pass in a context so these calls are not batched:
nextflow/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy
Lines 194 to 203 in 887443e
| if( context ) { | |
| // check if this response is cached in the batch collector | |
| if( context.contains(jobId) ) { | |
| log.trace "[AWS BATCH] hit cache for describe job=$jobId" | |
| return context.get(jobId) | |
| } | |
| log.trace "[AWS BATCH] missed cache for describe job=$jobId" | |
| // get next 100 job ids for which it's required to check the status | |
| batchIds = context.getBatchFor(jobId, 100) | |
| } |
AWS doesn't have a prescribed rate limit for DescribeJobs but it'd be so nice if these were batched as these calls also hit the throttle-wrapped AWS client. Anyway, that's not a chief issue here.
Expected behavior and actual behavior
I'd expect my pipeline to run
Steps to reproduce the problem
I have a pipeline that kicks off 176 tasks. It then pairs the results, kicking off another 88. On 25.10.3, it's deadlocking before submitting any of the 88. Unfortunately that pipeline is very complex and I don't have a minimal reproducible example but I have other resources that should help.
Program output
I generated a stack dump with jstack:
Admittedly I used claude to analyze it:
All 10 AWSBatch-executor threads are deadlocked waiting for each other:
AWSBatch-executor-1 through 10:
ThrottlingExecutor$Recoverable.call() ← Running a task submission
→ ParallelPollingMonitor$1.invoke()
→ submit0() → submit()
→ notifyTaskSubmit()
→ getTraceRecord() ← Called during submission!
→ getNumSpotInterruptions()
→ describeJob()
→ ClientProxyThrottler.invokeMethod()
→ doInvoke1()
→ FutureTask.get() ← BLOCKED waiting for result
The deadlock cycle:
1. Pool has 10 threads (likely availableProcessors * 5 on a 2-core machine)
2. All 10 threads are executing task submissions
3. During submission, getTraceRecord() is called (from Session.notifyTaskSubmit())
4. getTraceRecord() calls getNumSpotInterruptions() → describeJob() → submits to the same executor and blocks
5. No threads available to execute the submitted describeJob tasks
6. Deadlock
The "Task monitor" thread is also blocked on checkIfRunning() → describeJob() - waiting for a thread that will never be free.
However this explanation matches the behaviour I saw, and reinstating the isCompleted() check in getNumSpotInterruptsions fixes the issue.
Environment
- Nextflow version: 25.10.3
- Java version: 21
- Operating system: linux
- Bash version: N/A
Critically: I was running on a 2 core AWS instance. I believe this limits the number of workers available and can make it easier to hit deadlocks.
Additional context
Can we backport a fix to check isCompleted() again? I'm not sure if Google executors or other executors have the same issue and have fixed it differently.