Set all_reduce_token to None when exiting by yitongh · Pull Request #6247 · pytorch/xla

yitongh · 2024-01-02T11:07:20Z

This pr fixes #6246.

JackCaoG · 2024-01-02T18:03:14Z

I am curious why does reduce_token matters when we exit the pytorch/xla?

JackCaoG · 2024-01-02T18:07:44Z

oh ok I saw #6246, my bad. Can you add the test case you mentioned in the issue to a separate test? An example would be https://github.com/pytorch/xla/blob/master/test/test_mp_collective_permute.py which will be run on

xla/test/run_tests.sh

Line 231 in d7c4430

run_test "$CDIR/test_mp_collective_permute.py"

yitongh · 2024-01-03T02:37:45Z

@JackCaoG I have added a test case, but I cannot run it. The program hangs after setting GPU_NUM_DEVICES=2. Other mp test cases also hang.

JackCaoG · 2024-01-03T19:12:30Z

let's see if CI will be able to run it then.

JackCaoG · 2024-01-03T23:30:25Z

CI seems to be happy, if you can fix the linter I can help you land it.

yitongh · 2024-01-04T02:08:46Z

@JackCaoG I have fixed the linter, sorry forgot this.

vanbasten23 · 2024-01-18T18:49:22Z

@JackCaoG I have added a test case, but I cannot run it. The program hangs after setting GPU_NUM_DEVICES=2. Other mp test cases also hang.

This seems to be a new issue. I have also observed the same #6320. Before this change, all the mp test cases don't hang. @yitongh @JackCaoG

JackCaoG · 2024-01-18T19:05:26Z

@vanbasten23 does this pr breaks multi device GPU training or just the test?

JackCaoG · 2024-01-18T19:09:49Z

@vanbasten23 let's revert his pr. @ManfeiBai can you also help reverting this in the 2.2 release branch? Release date is approaching, I don't want to take this risk.

alanwaketan · 2024-01-24T23:51:52Z



 def _prepare_to_exit():
+  device = _XLAC._xla_get_default_device()


Does this always return the same device regardless of different processes in the pool?

I think it will return the device that belong to current process, assuming each process only has one device.

Then I'm not sure why it will break other test cases...

I think the error is due to the ComputationClient has exited in atexit._run_exitfuncs when using xla_multiprocessing, while the client still exists when using torchrun. Therefore, in the xla_multiprocessing, this invocation will result in the creation of a new PjRtComputationClient, causing a hang.

So the better solution is setting the all reduce token in PrepareToExit, like this:

diff --git a/torch_xla/__init__.py b/torch_xla/__init__.py index 8d4997e28..d753f8f7c 100644 --- a/torch_xla/__init__.py +++ b/torch_xla/__init__.py @@ -148,8 +148,6 @@ _aws_ec2_inf_trn_init() def _prepare_to_exit(): - device = _XLAC._xla_get_default_device() - _XLAC._set_all_reduce_token(device, None) _XLAC._prepare_to_exit() if int(os.environ.get('PT_XLA_DEBUG', '0')): _summarize_fn_tracker() diff --git a/torch_xla/csrc/init_python_bindings.cpp b/torch_xla/csrc/init_python_bindings.cpp index 3281f0e9a..b255bb043 100644 --- a/torch_xla/csrc/init_python_bindings.cpp +++ b/torch_xla/csrc/init_python_bindings.cpp @@ -97,6 +97,8 @@ void PrepareToExit() { runtime::ComputationClient* client = runtime::GetComputationClientIfInitialized(); if (client != nullptr) { + auto xla_device = GetDeviceOrCurrent(""); + SetAllReduceToken(xla_device, nullptr); XLAGraphExecutor::Get()->WaitDeviceOps({}); } }

yitongh force-pushed the fix_token branch from 30babbf to d0f9759 Compare January 3, 2024 02:31

JackCaoG self-requested a review January 3, 2024 23:30

Set all_reduce_token to None when exiting

4e97ccf

yitongh force-pushed the fix_token branch from d0f9759 to 4e97ccf Compare January 4, 2024 02:05

JackCaoG approved these changes Jan 5, 2024

View reviewed changes

JackCaoG merged commit f9c12fc into pytorch:master Jan 5, 2024

yitongh deleted the fix_token branch January 5, 2024 01:56

jeffhataws pushed a commit that referenced this pull request Jan 9, 2024

Set all_reduce_token to None when exiting (#6247)

8ae83a6

ManfeiBai mentioned this pull request Jan 12, 2024

2.2 backport PR request list #6036

Open

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Set all_reduce_token to None when exiting (#6247)

385eb53

This was referenced Jan 18, 2024

Revert "Set all_reduce_token to None when exiting" #6321

Merged

test_zero1 passes on GPU with latest torch/xla (commit e1c94df) #6268

Merged

alanwaketan reviewed Jan 24, 2024

View reviewed changes

vanbasten23 mentioned this pull request Jan 25, 2024

GPU unit tests hang #6320

Closed

vanbasten23 added a commit that referenced this pull request Jan 25, 2024

re-apply #6247

6f27e85

This was referenced Jan 25, 2024

Correctly set the exit code when an exception is raised in the atexit callback _prepare_to_exit. #6383

Merged

GPU CI should fail on hanging tests. #6385

Closed

vanbasten23 added a commit that referenced this pull request Feb 5, 2024

re-apply #6247

f1f10db

vanbasten23 added a commit that referenced this pull request Feb 6, 2024

re-apply #6247

ebb4848

yitongh mentioned this pull request Apr 8, 2024

Set all_reduce_token to null when exiting #6898

Merged

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Set all_reduce_token to None when exiting (#6247)

691cee6



		def _prepare_to_exit():
		device = _XLAC._xla_get_default_device()

Conversation

yitongh commented Jan 2, 2024

Uh oh!

JackCaoG commented Jan 2, 2024

Uh oh!

JackCaoG commented Jan 2, 2024

Uh oh!

yitongh commented Jan 3, 2024

Uh oh!

JackCaoG commented Jan 3, 2024

Uh oh!

JackCaoG commented Jan 3, 2024

Uh oh!

yitongh commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vanbasten23 commented Jan 18, 2024

Uh oh!

JackCaoG commented Jan 18, 2024

Uh oh!

JackCaoG commented Jan 18, 2024

Uh oh!

alanwaketan Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG Jan 24, 2024

Choose a reason for hiding this comment

Uh oh!

alanwaketan Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

yitongh Jan 25, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yitongh commented Jan 4, 2024 •

edited

Loading