Not creating a coordinator service for the single processing job. by vanbasten23 · Pull Request #6023 · pytorch/xla

vanbasten23 · 2023-12-04T23:42:38Z

Currently we create a coordinator service even if we run single processing on CUDA. I observed that when the single process runs for a long time (>1h), there is a chance that the coordinator service would crash. But single processing really doesn't need the coordinator service.

Test plan:

PJRT_DEVICE=CUDA GPU_NUM_DEVICES=1 python pytorch/xla/test/pjrt/test_runtime_gpu.py
PJRT_DEVICE=CUDA GPU_NUM_DEVICES=2 python pytorch/xla/test/pjrt/test_runtime_gpu.py
PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc-per-node 2 pytorch/xla/test/pjrt/test_torchrun.py
PJRT_DEVICE=CUDA python pytorch/xla/test/test_operations.py MpDecoratorTest.test_mp_decorator

jonb377 · 2023-12-05T18:44:37Z

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

vanbasten23 · 2023-12-06T23:51:05Z

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun.

jonb377 · 2023-12-07T00:24:42Z

@vanbasten23 Is the crash only in the case of single-process workloads? I wonder if it's a more general issue that we should handle. Agree that we don't need a coordinator for distributed kv store in single-process though.

It fails in the persistent cache test https://gist.github.com/vanbasten23/2cb90b2f72a40ef965b965bc12bc5ded. I fixed your comment and let me rerun.

@vanbasten23 I've merged the persistent cache change, if you want to test the single device test with this fix you can do a rebase on master and modify the single-device test to run on GPU here

jonb377 · 2023-12-07T00:25:44Z

+          global_process_rank, global_world_size, master_addr, port);
+      std::shared_ptr<xla::DistributedRuntimeClient> distributed_client =
+          coordinator_->GetClient();
+      if (distributed_client != nullptr) {


I wonder if we should be doing an XLA_CHECK(distributed_client != nullptr) here - is there a case where we want the ComputationClient creation to succeed without a DistributedRuntimeClient?

I like the idea!

Isn't this case already checked in GetClient?

jonb377

LGTM, thanks!

JackCaoG · 2023-12-07T18:13:59Z

Do we need this in 2.2?

vanbasten23 · 2023-12-07T19:40:03Z

Do we need this in 2.2?

We don't need this for 2.2. This pr is more of optimization.

…job. (#6023)" This reverts commit 467c18f.

…torch#6023)

)

vanbasten23 changed the title ~~Not creating the coordinator servie for single process.~~ Not creating a coordinator service for the single processing job. Dec 4, 2023

vanbasten23 requested review from jonb377 and will-cromar December 4, 2023 23:51

jonb377 reviewed Dec 5, 2023

View reviewed changes

Comment thread torch_xla/csrc/runtime/xla_coordinator.cc Outdated

jonb377 mentioned this pull request Dec 6, 2023

Enable persistent compilation caching #5804

Merged

vanbasten23 force-pushed the notCreateCoordinatorServiceWhenNumNodesIsOne branch from 269554a to 4c856b3 Compare December 6, 2023 23:45

jonb377 reviewed Dec 7, 2023

View reviewed changes

vanbasten23 added 4 commits December 7, 2023 06:08

Not creating the coordinator servie for single process.

9255bec

fix comments

2c6c93b

revert a unwanted change

9f8c2a8

fix linter

8b0f9f5

vanbasten23 force-pushed the notCreateCoordinatorServiceWhenNumNodesIsOne branch from 5235c90 to 8b0f9f5 Compare December 7, 2023 06:12

will-cromar approved these changes Dec 7, 2023

View reviewed changes

jonb377 approved these changes Dec 7, 2023

View reviewed changes

fix one last comment

a5ec7e8

vanbasten23 merged commit 467c18f into master Dec 8, 2023

ysiraichi mentioned this pull request Dec 8, 2023

Use subprocess for checking XLA supported devices. #5960

Closed

jonb377 mentioned this pull request Dec 8, 2023

Re-enable single-device GPU persistent cache test #6080

Draft

lsy323 pushed a commit that referenced this pull request Dec 11, 2023

Revert "Not creating a coordinator service for the single processing …

74b99ec

…job. (#6023)" This reverts commit 467c18f.

lsy323 pushed a commit that referenced this pull request Dec 11, 2023

Revert "Not creating a coordinator service for the single processing …

c4a3008

…job. (#6023)" This reverts commit 467c18f.

chunnienc pushed a commit to chunnienc/xla that referenced this pull request Dec 14, 2023

Not creating a coordinator service for the single processing job. (py…

91da6ac

…torch#6023)

golechwierowicz pushed a commit that referenced this pull request Jan 12, 2024

Not creating a coordinator service for the single processing job. (#6023

e156086

)

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Not creating a coordinator service for the single processing job. (#6023

8352978

)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not creating a coordinator service for the single processing job.#6023

Not creating a coordinator service for the single processing job.#6023
vanbasten23 merged 5 commits intomasterfrom
notCreateCoordinatorServiceWhenNumNodesIsOne

vanbasten23 commented Dec 4, 2023 •

edited

Loading

Uh oh!

jonb377 commented Dec 5, 2023 •

edited

Loading

Uh oh!

Uh oh!

vanbasten23 commented Dec 6, 2023

Uh oh!

jonb377 commented Dec 7, 2023

Uh oh!

jonb377 Dec 7, 2023

Uh oh!

vanbasten23 Dec 7, 2023

Uh oh!

will-cromar Dec 7, 2023 •

edited

Loading

Uh oh!

jonb377 left a comment

Uh oh!

JackCaoG commented Dec 7, 2023

Uh oh!

vanbasten23 commented Dec 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vanbasten23 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 commented Dec 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

vanbasten23 commented Dec 6, 2023

Uh oh!

jonb377 commented Dec 7, 2023

Uh oh!

jonb377 Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

will-cromar Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

JackCaoG commented Dec 7, 2023

Uh oh!

vanbasten23 commented Dec 7, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vanbasten23 commented Dec 4, 2023 •

edited

Loading

jonb377 commented Dec 5, 2023 •

edited

Loading

will-cromar Dec 7, 2023 •

edited

Loading