Skip to content

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

@oxinabox

Description

@oxinabox

I have a weird thing happening.
I am using pmap for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)

For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.

I use Parallelism.jl's robust_pmap,
which is just a thin wrapper around pmap with retry_check set to retry on a bunch of different error conditions, including ProcessExittedException.

Recently I have started to see a few time a ProcessExcittedException for one of the worker take down the manager.
Killing my whole program.
Which shouldn't be possible, since I retry on those.

Things that have changed recently include:

  • I am now working on a more memory intenstive version of the problem, so a lot more worker get killed. So if this was something chance then i have a lot more lotto tickets.
  • I started using CachingPools in robust_pmap, even though I am not actually using large closures this time (only small ones).
    Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.

The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the foreach. It is being returned from the fetch (not thrown by the fetch or stacktrack would show it) then thrown.

So I am wondering if the error is happening somewhere else in the pmap machinery that is is outside of the retry_wrapper that retry_check sets up.
Since the OOM killer can strike at any time.

Metadata

Metadata

Assignees

Labels

needs more infoClarification or a reproducible example is requiredparallelismParallel or distributed computation

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions