-
Notifications
You must be signed in to change notification settings - Fork 134
NonDaemonicSpawnProcess hangs at exit #1497
Copy link
Copy link
Closed
Description
One worker looks like this:
Thread 1130071 (idle): "MainThread"
__call__ (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:145)
This is the atexit handler. This is this line:
os.waitpid(self.proc_pid, 0)So it hangs for some sub proc, after it has sent SIGINT to it.
Looking at that proc tree (after I send a few SIGINT to some of the MPD workers, which are now in defunct state):
1130071 \_ python3.11
1130231 | \_ python3.11
1130232 | \_ watch memory
1130350 | \_ MPD worker 0 <defunct>
1130353 | \_ MPD worker 1 <defunct>
1130354 | \_ MPD worker 2 <defunct>
1130355 | \_ MPD worker 3 <defunct>
1130811 | \_ python3.11
1131110 | \_ MPD worker 0
1131111 | \_ MPD worker 1
1131112 | \_ MPD worker 2
1131114 | \_ MPD worker 3
1131433 | \_ MPD worker 0
1131434 | \_ MPD worker 1
1131435 | \_ MPD worker 2
1131467 | \_ MPD worker 3 <defunct>
1131835 | \_ TDL worker 0
1132208 | | \_ MPD worker 0
1132314 | | \_ MPD worker 1
1132420 | | \_ MPD worker 2
1132528 | | \_ MPD worker 3
1136336 | \_ TDL worker 0
1136702 | | \_ MPD worker 0
1136806 | | \_ MPD worker 1
1136929 | | \_ MPD worker 2
1137038 | | \_ MPD worker 3
1137253 | \_ TDL worker 0
1137614 | \_ MPD worker 0
1137718 | \_ MPD worker 1
1137845 | \_ MPD worker 2
1137969 | \_ MPD worker 3
As the main proc hangs in waitpid, maybe it hangs for some TDL worker.
The last TDL worker:
$ py-spy dump -p 1137253
Process 1137253: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=24, pipe_handle=184) --multiprocessing-fork
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)
Thread 1137253 (idle): "MainThread"
poll (multiprocessing/popen_fork.py:27)
wait (multiprocessing/popen_fork.py:43)
join (multiprocessing/process.py:149)
join (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:66)
_exit_function (multiprocessing/util.py:357)
_bootstrap (multiprocessing/process.py:317)
_main (multiprocessing/spawn.py:133)
spawn_main (multiprocessing/spawn.py:120)
<module> (<string>:1)
Thread 1138171 (idle): "Thread-1 (_serve)"
accept (socket.py:294)
accept (multiprocessing/connection.py:608)
accept (multiprocessing/connection.py:462)
_serve (multiprocessing/resource_sharer.py:138)
run (threading.py:975)
_bootstrap_inner (threading.py:1038)
_bootstrap (threading.py:995)
So, this also waits for some sub proc. But they all look like this:
Thread 1137967 (idle): "MainThread"
_recv (multiprocessing/connection.py:378)
_recv_bytes (multiprocessing/connection.py:413)
recv (multiprocessing/connection.py:249)
_worker_proc_loop (returnn/returnn/datasets/multi_proc.py:240)
run (multiprocessing/process.py:108)
_bootstrap (multiprocessing/process.py:314)
_main (multiprocessing/spawn.py:133)
spawn_main (multiprocessing/spawn.py:120)
<module> (<string>:1)
Originally posted by @albertz in #1496 (comment)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels