Skip to content

NonDaemonicSpawnProcess hangs at exit #1497

@albertz

Description

@albertz

One worker looks like this:

Thread 1130071 (idle): "MainThread"
    __call__ (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:145)

This is the atexit handler. This is this line:

                os.waitpid(self.proc_pid, 0)

So it hangs for some sub proc, after it has sent SIGINT to it.

Looking at that proc tree (after I send a few SIGINT to some of the MPD workers, which are now in defunct state):

1130071          \_ python3.11
1130231          |   \_ python3.11
1130232          |   \_ watch memory
1130350          |   \_ MPD worker 0 <defunct>
1130353          |   \_ MPD worker 1 <defunct>
1130354          |   \_ MPD worker 2 <defunct>
1130355          |   \_ MPD worker 3 <defunct>
1130811          |   \_ python3.11
1131110          |   \_ MPD worker 0
1131111          |   \_ MPD worker 1
1131112          |   \_ MPD worker 2
1131114          |   \_ MPD worker 3
1131433          |   \_ MPD worker 0
1131434          |   \_ MPD worker 1
1131435          |   \_ MPD worker 2
1131467          |   \_ MPD worker 3 <defunct>
1131835          |   \_ TDL worker 0
1132208          |   |   \_ MPD worker 0
1132314          |   |   \_ MPD worker 1
1132420          |   |   \_ MPD worker 2
1132528          |   |   \_ MPD worker 3
1136336          |   \_ TDL worker 0
1136702          |   |   \_ MPD worker 0
1136806          |   |   \_ MPD worker 1
1136929          |   |   \_ MPD worker 2
1137038          |   |   \_ MPD worker 3
1137253          |   \_ TDL worker 0
1137614          |       \_ MPD worker 0
1137718          |       \_ MPD worker 1
1137845          |       \_ MPD worker 2
1137969          |       \_ MPD worker 3

As the main proc hangs in waitpid, maybe it hangs for some TDL worker.

The last TDL worker:

$ py-spy dump -p 1137253                       
Process 1137253: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=24, pipe_handle=184) --multiprocessing-fork
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)

Thread 1137253 (idle): "MainThread"
    poll (multiprocessing/popen_fork.py:27)
    wait (multiprocessing/popen_fork.py:43)
    join (multiprocessing/process.py:149)
    join (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:66)
    _exit_function (multiprocessing/util.py:357)
    _bootstrap (multiprocessing/process.py:317)
    _main (multiprocessing/spawn.py:133)
    spawn_main (multiprocessing/spawn.py:120)
    <module> (<string>:1)
Thread 1138171 (idle): "Thread-1 (_serve)"
    accept (socket.py:294)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:462)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

So, this also waits for some sub proc. But they all look like this:

Thread 1137967 (idle): "MainThread"
    _recv (multiprocessing/connection.py:378)
    _recv_bytes (multiprocessing/connection.py:413)
    recv (multiprocessing/connection.py:249)
    _worker_proc_loop (returnn/returnn/datasets/multi_proc.py:240)
    run (multiprocessing/process.py:108)
    _bootstrap (multiprocessing/process.py:314)
    _main (multiprocessing/spawn.py:133)
    spawn_main (multiprocessing/spawn.py:120)
    <module> (<string>:1)

Originally posted by @albertz in #1496 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions