Sometimes worker ProcessExittedException kills manager even with pmap with retry_check

I have a weird thing happening.
I am using `pmap` for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)

For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
**This generally works great**, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.

I use Parallelism.jl's [`robust_pmap`](https://github.com/invenia/Parallelism.jl/blob/master/src/robust_pmap.jl),
which is just a thin wrapper around `pmap` with `retry_check` set to retry on a bunch of different error conditions, including `ProcessExittedException`.

Recently I have started to see a few time a `ProcessExcittedException` for one of the worker take down the manager.
Killing my whole program.
Which shouldn't be possible, since I retry on those.

Things that have changed recently include:
 - I am now working on a more memory intenstive version of the problem, so a lot more worker get killed. So if this was something chance then i have a lot more lotto tickets.
 - I started using `CachingPool`s in `robust_pmap`, even though I am not actually using large closures this time (only small ones).
Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.

The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the `foreach`. It is being returned from the `fetch` (not thrown by the `fetch` or stacktrack would show it) then thrown.

So I am wondering if the error is happening somewhere else in the `pmap` machinery that is is outside of the `retry_wrapper` that `retry_check` sets up.
Since the OOM killer can strike at any time.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Sometimes worker ProcessExittedException kills manager even with pmap with retry_check #36709

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions