Skip to content

help tracking thread explosions? #2398

@danpf

Description

@danpf

I'm having a super weird problem with dask on my SLURMcluster using dask-jobqueue but i'm confident that's not the problem.

Actual problem:
Slurm is unable to submit jobs to some nodes because my jobs are generating a highly excessive number of processes(threads)

brief workflow

  1. submit script to slurm
  2. script from(1) generates a SLURMcluster (aka a dask client)
  3. (2) submits to SLURM and makes the actual workers

Description of problem:
I will attatch 2 logfiles from pstree -l -c -u danpf -a
logfile normal_op.txt is from normal operation
Could someone help me figure out what all of these extra python processes are?
why does each forkserver need 6 threads, and why does each worker need 13 threads?
This is a node processer load load of approx 13 or so per worker thread. this is pretty high. Ideally it would be around 6-7 (or so my admin tells me, they would prefer it to be as low as possible)

However, after a worker fails, which seems to occur for various reasons (including but not limited to):

  1. the main scheduler encountering an exception and dying
  2. a SIGKILL command
  3. a worker dying for some reason ( not too sure I get a lot of weird dask errors sometimes)

If i'm ssh'd into a node and i track the number of processes / threads when this happens, I see an explosion of the number of threads from ~300-500 to ~7-8000. The result from pstree command from above after i've killed the master(main Client) thread is cancel_op.txt

At this point, I can't even ssh into a node because there's too many processes, and any time SLURM tries to submit something to that node, the command it submits dies almost instantly. (I'm unaware if this is limited to my submissions, or if i'm actually causing other people's submissions to die as well)

I guess the questions I would like answered /help answering are:

  1. what are all of these threads in the first place (normal per-worker load appears to be ~13 threads/processes) (is there a way to slim this down?)
  2. what could be happening to cause these 'forkserver` processes to generate so many threads when they lose contact with the master process?
  3. Would a different multiprocessing-method help with this? it wouldn't appear so from what i've read, but figured I would ask (it looks like only spawn, fork, and forkserver are options)

normal_op.txt
cancel_op.txt

Thanks for any help!!!
Dan

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions