help tracking thread explosions?

I'm having a super weird problem with dask on my SLURMcluster using dask-jobqueue but i'm confident that's not the problem.

Actual problem:
Slurm is unable to submit jobs to some nodes because my jobs are generating a highly excessive number of processes(threads)

brief workflow
1. submit script to slurm
2. script from(1) generates a SLURMcluster (aka a dask client)
3. (2) submits to SLURM and makes the actual workers

Description of problem:
I will attatch 2 logfiles from `pstree -l -c -u danpf -a`
logfile `normal_op.txt` is from normal operation
Could someone help me figure out what all of these extra python processes are?
why does each forkserver need 6 threads, and why does each worker need 13 threads?
This is a node processer load load of approx 13 or so per worker thread. this is pretty high. Ideally it would be around 6-7 (or so my admin tells me, they would prefer it to be as low as possible)

However, after a worker fails, which seems to occur for various reasons (including but not limited to):
1. the main scheduler encountering an exception and dying
2. a SIGKILL command
3. a worker dying for some reason ( not too sure I get a lot of weird dask errors sometimes)

If i'm ssh'd into a node and i track the number of processes  / threads when this happens, I see an explosion of the number of threads from ~300-500 to ~7-8000. The result from `pstree` command from above after i've killed the master(main Client) thread is `cancel_op.txt`

At this point, I can't even ssh into a node because there's too many processes, and any time SLURM tries to submit something to that node, the command it submits dies almost instantly. (I'm unaware if this is limited to my submissions, or if i'm actually causing other people's submissions to die as well)

I guess the questions I  would like answered /help answering are:
1. what are all of these threads in the first place (normal per-worker load appears to be ~13 threads/processes) (is there a way to slim this down?)
2. what could be happening to cause these 'forkserver`   processes  to generate so many threads when they lose contact with the master process?
3. Would a different multiprocessing-method help with this? it wouldn't appear so from what i've read, but figured I would ask (it looks like only spawn, fork, and forkserver are options)

[normal_op.txt](https://github.com/dask/distributed/files/2646408/normal_op.txt)
[cancel_op.txt](https://github.com/dask/distributed/files/2646409/cancel_op.txt)

Thanks for any help!!!
Dan

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

help tracking thread explosions? #2398

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

help tracking thread explosions? #2398

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions