I have a weird thing happening.
I am using pmap for local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)
For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.
I use Parallelism.jl's robust_pmap,
which is just a thin wrapper around pmap with retry_check set to retry on a bunch of different error conditions, including ProcessExittedException.
Recently I have started to see a few time a ProcessExcittedException for one of the worker take down the manager.
Killing my whole program.
Which shouldn't be possible, since I retry on those.
Things that have changed recently include:
- I am now working on a more memory intenstive version of the problem, so a lot more worker get killed. So if this was something chance then i have a lot more lotto tickets.
- I started using
CachingPools in robust_pmap, even though I am not actually using large closures this time (only small ones).
Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.
The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the foreach. It is being returned from the fetch (not thrown by the fetch or stacktrack would show it) then thrown.
So I am wondering if the error is happening somewhere else in the pmap machinery that is is outside of the retry_wrapper that retry_check sets up.
Since the OOM killer can strike at any time.
I have a weird thing happening.
I am using
pmapfor local distributed parallelism on very large machines. (96 CPU cores, 384GB of RAM)For my task it is hard to know how much memory it will need.
The memory bounds how much i can parallelize it.
To handle this I have taken the approach of starting 1 worker per core (96), and then letting the out of memory killer kill them off til it has enough memory.
This generally works great, and I normally achieve 90% memory utilization, and depending on the version of the task 20-70 parallel workers remaining.
I use Parallelism.jl's
robust_pmap,which is just a thin wrapper around
pmapwithretry_checkset to retry on a bunch of different error conditions, includingProcessExittedException.Recently I have started to see a few time a
ProcessExcittedExceptionfor one of the worker take down the manager.Killing my whole program.
Which shouldn't be possible, since I retry on those.
Things that have changed recently include:
CachingPools inrobust_pmap, even though I am not actually using large closures this time (only small ones).Preliminary testing suggests if i go back to not using caching pool and i cut down so less workers have to be killed it is solved.
The error is being thrown in https://github.com/JuliaLang/julia/blob/release-1.3/base/asyncmap.jl#L178
from the anon-function in the
foreach. It is being returned from thefetch(not thrown by thefetchor stacktrack would show it) then thrown.So I am wondering if the error is happening somewhere else in the
pmapmachinery that is is outside of theretry_wrapperthatretry_checksets up.Since the OOM killer can strike at any time.