[train] Add Torch process group shutdown timeout by TimothySeah · Pull Request #56182 · ray-project/ray

TimothySeah · 2025-09-03T00:21:26Z

Summary

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang.

When this happens, before_worker_group_shutdown finishes after the timeout and then workers get killed by ray.kill: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127.

Testing

Unit tests

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

Thanks!

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang. When this happens, `before_worker_group_shutdown` finishes after the timeout and then workers get killed by `ray.kill`: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: sampan <sampan@anyscale.com>

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang. When this happens, `before_worker_group_shutdown` finishes after the timeout and then workers get killed by `ray.kill`: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang. When this happens, `before_worker_group_shutdown` finishes after the timeout and then workers get killed by `ray.kill`: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: yenhong.wong <yenhong.wong@grabtaxi.com>

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang. When this happens, `before_worker_group_shutdown` finishes after the timeout and then workers get killed by `ray.kill`: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

Shutting down a healthy torch process group, which we may want to do for reasons like restarting a group of workers if an async checkpoint upload fails, can hang. This is a workaround until we figure out how to avoid this hang. When this happens, `before_worker_group_shutdown` finishes after the timeout and then workers get killed by `ray.kill`: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/worker_group/state.py#L127. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah added 2 commits September 2, 2025 17:21

[train] Add Torch process group shutdown timeout

7fa6c0d

Signed-off-by: Timothy Seah <tseah@anyscale.com>

add unit test + fix unit tests

66f66f4

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu September 3, 2025 03:06

TimothySeah marked this pull request as ready for review September 3, 2025 03:06

TimothySeah requested a review from a team as a code owner September 3, 2025 03:06

ray-gardener bot added the train Ray Train Related Issue label Sep 3, 2025

justinvyu approved these changes Sep 3, 2025

View reviewed changes

justinvyu enabled auto-merge (squash) September 3, 2025 17:00

github-actions bot added the go add ONLY when ready to merge, run all tests label Sep 3, 2025

justinvyu merged commit b738755 into ray-project:master Sep 3, 2025
7 checks passed

justinvyu mentioned this pull request Sep 3, 2025

Remove Placement Group on Train Run Abort #56011

Merged

matthewdeng mentioned this pull request Oct 16, 2025

[train][jax_trainer] add jax.distributed.shutdown() for JaxBackend #57802

Merged

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add Torch process group shutdown timeout#56182

[train] Add Torch process group shutdown timeout#56182
justinvyu merged 2 commits intoray-project:masterfrom
TimothySeah:tseah/shutdown-process-workaround

TimothySeah commented Sep 3, 2025 •

edited

Loading

Uh oh!

justinvyu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TimothySeah commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Sep 3, 2025 •

edited

Loading