Skip to content

Deadlock on 25.10.3 on AWS Batch #6802

@jchorl

Description

@jchorl

Bug report

Hi! I hit a deadlock using the AWS Batch executor on 25.10.3. Specifically I believe the issue is related to #6729 which was backported to 25.10.3: f59137e

That PR also puts quite a bit more load on DescribeJobs. Specifically, it calls describeJob for every task:

Unlike the task polling supervisor, it does not pass in a context so these calls are not batched:

if( context ) {
// check if this response is cached in the batch collector
if( context.contains(jobId) ) {
log.trace "[AWS BATCH] hit cache for describe job=$jobId"
return context.get(jobId)
}
log.trace "[AWS BATCH] missed cache for describe job=$jobId"
// get next 100 job ids for which it's required to check the status
batchIds = context.getBatchFor(jobId, 100)
}

AWS doesn't have a prescribed rate limit for DescribeJobs but it'd be so nice if these were batched as these calls also hit the throttle-wrapped AWS client. Anyway, that's not a chief issue here.

Expected behavior and actual behavior

I'd expect my pipeline to run

Steps to reproduce the problem

I have a pipeline that kicks off 176 tasks. It then pairs the results, kicking off another 88. On 25.10.3, it's deadlocking before submitting any of the 88. Unfortunately that pipeline is very complex and I don't have a minimal reproducible example but I have other resources that should help.

Program output

I generated a stack dump with jstack:

thread_dump.txt

Admittedly I used claude to analyze it:

  All 10 AWSBatch-executor threads are deadlocked waiting for each other:

  AWSBatch-executor-1 through 10:
    ThrottlingExecutor$Recoverable.call()        ← Running a task submission
      → ParallelPollingMonitor$1.invoke()
      → submit0() → submit()
      → notifyTaskSubmit()
      → getTraceRecord()                          ← Called during submission!
      → getNumSpotInterruptions()
      → describeJob()
      → ClientProxyThrottler.invokeMethod()
      → doInvoke1()
      → FutureTask.get()                          ← BLOCKED waiting for result


  The deadlock cycle:
  1. Pool has 10 threads (likely availableProcessors * 5 on a 2-core machine)
  2. All 10 threads are executing task submissions
  3. During submission, getTraceRecord() is called (from Session.notifyTaskSubmit())
  4. getTraceRecord() calls getNumSpotInterruptions() → describeJob() → submits to the same executor and blocks
  5. No threads available to execute the submitted describeJob tasks
  6. Deadlock
  The "Task monitor" thread is also blocked on checkIfRunning() → describeJob() - waiting for a thread that will never be free.

However this explanation matches the behaviour I saw, and reinstating the isCompleted() check in getNumSpotInterruptsions fixes the issue.

Environment

  • Nextflow version: 25.10.3
  • Java version: 21
  • Operating system: linux
  • Bash version: N/A

Critically: I was running on a 2 core AWS instance. I believe this limits the number of workers available and can make it easier to hit deadlocks.

Additional context

Can we backport a fix to check isCompleted() again? I'm not sure if Google executors or other executors have the same issue and have fixed it differently.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions