Skip to content

[train] Add Torch process group shutdown timeout#56182

Merged
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/shutdown-process-workaround
Sep 3, 2025
Merged

[train] Add Torch process group shutdown timeout#56182
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/shutdown-process-workaround

Conversation

@TimothySeah
Copy link
Copy Markdown
Contributor

@TimothySeah TimothySeah commented Sep 3, 2025

Summary

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang.

When this happens, before_worker_group_shutdown finishes after the timeout and then workers get killed by ray.kill: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review September 3, 2025 03:06
@TimothySeah TimothySeah requested a review from a team as a code owner September 3, 2025 03:06
@ray-gardener ray-gardener bot added the train Ray Train Related Issue label Sep 3, 2025
Copy link
Copy Markdown
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@justinvyu justinvyu enabled auto-merge (squash) September 3, 2025 17:00
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 3, 2025
@justinvyu justinvyu merged commit b738755 into ray-project:master Sep 3, 2025
7 checks passed
sampan-s-nayak pushed a commit to sampan-s-nayak/ray that referenced this pull request Sep 8, 2025
Shutting down a healthy torch process group, which we may want to do for
reasons like restarting a group of workers if an async checkpoint upload
fails, can hang. This is a workaround until we figure out how to avoid
this hang.

When this happens, `before_worker_group_shutdown` finishes after the
timeout and then workers get killed by `ray.kill`:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: sampan <sampan@anyscale.com>
jugalshah291 pushed a commit to jugalshah291/ray_fork that referenced this pull request Sep 11, 2025
Shutting down a healthy torch process group, which we may want to do for
reasons like restarting a group of workers if an async checkpoint upload
fails, can hang. This is a workaround until we figure out how to avoid
this hang.

When this happens, `before_worker_group_shutdown` finishes after the
timeout and then workers get killed by `ray.kill`:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
wyhong3103 pushed a commit to wyhong3103/ray that referenced this pull request Sep 12, 2025
Shutting down a healthy torch process group, which we may want to do for
reasons like restarting a group of workers if an async checkpoint upload
fails, can hang. This is a workaround until we figure out how to avoid
this hang.

When this happens, `before_worker_group_shutdown` finishes after the
timeout and then workers get killed by `ray.kill`:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>
dstrodtman pushed a commit that referenced this pull request Oct 6, 2025
Shutting down a healthy torch process group, which we may want to do for
reasons like restarting a group of workers if an async checkpoint upload
fails, can hang. This is a workaround until we figure out how to avoid
this hang.

When this happens, `before_worker_group_shutdown` finishes after the
timeout and then workers get killed by `ray.kill`:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Shutting down a healthy torch process group, which we may want to do for
reasons like restarting a group of workers if an async checkpoint upload
fails, can hang. This is a workaround until we figure out how to avoid
this hang.

When this happens, `before_worker_group_shutdown` finishes after the
timeout and then workers get killed by `ray.kill`:
https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants