Fix global_device_count(), local_device_count() for single process on CUDA by vanbasten23 · Pull Request #6022 · pytorch/xla

vanbasten23 · 2023-12-04T23:20:08Z

This PR fixes

global_device_count() so that it returns all GPU devices across all processes/hosts. The fix also works for multi-host case.
local_device_count()
for single process on CUDA so that it returns all GPU devices on the current host.

Before this PR, both APIs always return 1 as reported in this issue.).

global_runtime_device_count is not fixed since it seems it's only used in spmd case and it's been fixed in another pr.

Test:

PJRT_DEVICE=CUDA python pytorch/xla/test/pjrt/test_runtime.py
PJRT_DEVICE=CUDA torchrun --nnodes 1 --nproc-per-node 2 pytorch/xla/test/pjrt/test_torchrun.py

Note: here is the behavior of torch.cuda.device_count() on multi-host case

jonb377

Looks great, thanks!

jonb377 · 2023-12-04T23:35:43Z

+    std::optional<std::set<int>> allowed_devices;
+    if (global_world_size > 1) {
+      allowed_devices =
+          std::make_optional<std::set<int>>(std::set{local_process_rank});


nit: do you still need to make_optional here, or can you directly assign? e.g. allowed_devices = std::set{local_process_rank}

vanbasten23 · 2023-12-04T23:49:59Z

Thanks for the review!

vanbasten23 · 2023-12-06T18:56:55Z

A few tests are failing:

//test/cpp:test_replication https://btx-internal.corp.google.com/invocations/a14fc207-5237-47cb-81c7-fbbba96b800a

Failed at torch_xla::runtime::GetComputationClient()->Compile(std::move(instances)); due to timing out.

The test is doing an all-reduce (xla::CrossReplicaSum) on a single process with 4 devices, should we disable this test on GPU?

test (main.TestParallelTensorMNIST)
No link. Fails due to E external/xla/xla/service/rendezvous.cc:31] This thread has been waiting for 10 seconds and may be stuck:

I think we can disable this test because it tests the class DataParallel which Enable the execution of a model network in replicated mode using threads. But for GPU, it uses process instead of thread.

//test/cpp:test_xla_sharding https://source.cloud.google.com/results/invocations/fd070252-3aa9-4196-8141-a750c75a1526
XLAShardingTest.CreateTensorsData

jonb377 · 2023-12-06T21:22:29Z

-    auto allowed_devices =
-        std::make_optional<std::set<int>>(std::set{local_process_rank});
+    std::optional<std::set<int>> allowed_devices;
+    if (global_world_size > 1) {


Synced offline - this is great for single-host single-process development, but in cases where there is a single process per host, this would break in a multihost environment. Outside of SPMD, I'm not aware of a use case for a multihost environment with a single process per host (cc @JackCaoG)

Since we don't officially support SPMD on GPU at the moment, this looks fine to me for now. Once we decide on the right entrypoint for SPMD, we'll need to revisit this.

vanbasten23 · 2024-01-09T01:01:37Z

hi @JackCaoG , this PR fixes
Fix local_device_count(), global_device_count() for single processing case so that they return the total number of devices on the host. Currently the test test/test_operations.py TestWaitDeviceOps.test_wait_device_ops fails with OOM (I summarized here). I wonder if you some pointers on the test failure.

JackCaoG · 2024-01-10T00:21:19Z

@vanbasten23 can you rebase and rerun the CI?

will-cromar

LGTM once you fix the formatting. Thanks!

will-cromar · 2024-01-11T19:05:20Z


-
+@unittest.skipIf(xr.device_type() == 'CUDA',
+    'Parallelism for DataParallel uses multi-threads. But cuda assumes one GPU device per process instead of relying on threads.')


Not for this PR, but IMO we should just delete these tests. Do we support DataParallel anymore @JackCaoG?

jonb377 · 2024-01-19T22:42:48Z

-      kv_store = xla::GetDistributedKeyValueStore(distributed_client,
-                                                  /*key_prefix=*/"gpu:");
+    std::optional<std::set<int>> allowed_devices;
+    bool spmd = sys_util::GetEnvBool("XLA_USE_SPMD", false);


Conditioning on SPMD mode here could cause issues using xr.use_spmd() after the runtime has been initialized.

Is it correct to say that allowed_devices is only needed in the MP case? If so, can we invert the logic to check for MP using one of the env vars instead of checking for SPMD mode?

Is there ever a reason to call xr.use_spmd() after the runtime is initialized? In any case, I think we can also assume that if LOCAL_WORLD_SIZE=1, then we can use all of the devices (which should be compatible with SPMD)

AFAIK there's not a strong use case at the moment, but for example our unit tests will check xr.global_runtime_device_count() before calling xr.use_spmd(). Keeping the runtime independent of SPMD mode was something we wanted to maintain, cc @yeounoh

That's a good point!

Conditioning on SPMD mode here could cause issues using xr.use_spmd() after the runtime has been initialized.

This seems to be a downside of using xr.use_spmd() as opposed to a env flag XLA_USE_SPMD=1. With the latter, it's less flexible but less error-prone. It guarantee we'll use spmd mode at the beginning. With the former, it may also impact other SPMD special cases: user does something pytorch ops, then call xr.use_spmd(), then continue to do something else.

can we invert the logic to check for MP using one of the env vars instead of checking for SPMD mode?
I think we can also assume that if LOCAL_WORLD_SIZE=1, then we can use all of the devices

I'm thinking about the case where the user has 2 GPU machines and she wants to use 1 GPU device on each machine and to do multi-host training. In that case (multi-host-single-process), each process has access to all devices and I guess the user can still do multi-host training

I think we can also assume that if LOCAL_WORLD_SIZE=1, then we can use all of the devices

Perhaps we also need to check GLOBAL_WORLD_SIZE:

if LOCAL_WORLD_SIZE==1: if GLOBAL_WORLD_SIZE>1: # multi-host-single-process initialize coordinator service else: # single-host-single-process do nothing else: multi-process for single-host and multi-host initialize coordinator service allowd_devices={current_device}

jonb377 · 2024-01-19T22:44:16Z


+std::unique_ptr<XlaCoordinator> SetKeyValueCallback(
+    int global_process_rank, int global_world_size,
+    std::unique_ptr<XlaCoordinator> coordinator,


Why do we need the coordinator as input here?

We need it get the DistributedRuntimeClient and later create the kv_store below

It looks like it's being recreated on L60 - should we just make this function return the new value?

vanbasten23 · 2024-01-22T18:10:09Z

It looks like a bunch of tests are failing with error

+ PJRT_DEVICE=CUDA
+ run_coverage /tmp/pytorch/xla/test/test_autocast.py
+ '[' 0 '!=' 0 ']'
+ python3 /tmp/pytorch/xla/test/test_autocast.py
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1705733143.143549     996 se_gpu_pjrt_client.cc:751] Using BFC allocator.
/opt/conda/lib/python3.8/site-packages/torch_xla-2.2.0+gitf16e0c7-py3.8-linux-x86_64.egg/torch_xla/core/xla_model.py:101: UserWarning: `devkind` argument is deprecated and will be removed in a future release.
  warnings.warn("`devkind` argument is deprecated and will be removed in a "
test_autocast_banned (__main__.TestAutocastCuda) ... ok
test_autocast_linalg_fp16 (__main__.TestAutocastCuda) ... 2024-01-20 06:45:53.672768: E external/xla/xla/service/rendezvous.cc:33] This thread has been waiting for 10 seconds and may be stuck:
2024-01-20 06:46:23.673040: E external/xla/xla/service/rendezvous.cc:43] Termination timeout of 30 seconds exceeded. Exiting to ensure a consistent program state.
Error: The operation was canceled.

One of the examples is PJRT_DEVICE=CUDA python pytorch/xla/test/test_autocast.py

The error doesn't exist on the current master branch (01/19, after the openxla pin update). Also, the error doesn't exist on the feature branch before the pin update: #6346. Probably something happened during the pin update.

jonb377

Looking good, thanks Xiongfei!

jonb377 · 2024-02-02T20:58:05Z

+    # if self.n_devices>=4, mesh=(2, 2)
+    # if self.n_devices>=2, mesh=(2,1)
+    # if self.n_devices=1, mesh=(1,1)


Thanks for generalizing these tests!

Could we change these comments to e.g. # if self.n_devices==4, mesh=(2, 2)? Other device counts will have different meshes.

jonb377 · 2024-02-02T21:15:11Z

+    if (local_world_size == 1) {
+      if (global_world_size > 1) {
+        coordinator = SetGpuClientKVCallBack(global_process_rank,
+                                             global_world_size, kv_store);
+      }
+    } else {
      allowed_devices = std::set{local_process_rank};
+      coordinator = SetGpuClientKVCallBack(global_process_rank,
+                                           global_world_size, kv_store);
    }

-    std::shared_ptr<xla::KeyValueStoreInterface> kv_store;
-    if (global_world_size > 1) {
-      // Use the distributed key-value store from DistributedRuntimeClient.
-      coordinator = std::make_unique<XlaCoordinator>(
-          global_process_rank, global_world_size, master_addr, port);
-      std::shared_ptr<xla::DistributedRuntimeClient> distributed_client =
-          coordinator->GetClient();
-      kv_store = xla::GetDistributedKeyValueStore(distributed_client,
-                                                  /*key_prefix=*/"gpu:");
-    }
-    TF_VLOG(3) << "Getting StreamExecutorGpuClient for node_id="
-               << global_process_rank << ", num_nodes=" << global_world_size;
-


I think we could simplify the logic here some. We want to restrict allowed_devices if local_world_size > 1 and create the coordinator if global_world_size > 1. I'm assuming local_world_size > 1 => global_world_size > 1, would this be equivalent?

if (local_world_size > 1) { allowed_devices = std::set{local_process_rank}; } if (global_world_size > 1) { // We can keep the old initialization block here and remove `SetGpuClientKVCallBack` }

vanbasten23 · 2024-02-05T19:21:32Z

Thanks for the review!

yeounoh · 2024-02-05T21:35:24Z

  TF_VLOG(INFO) << "OpSharding (ShardingType: " << sharding_type << "):\n"
-                << sharding.DebugString();
+                << sharding.DebugString()
+                << ", sharding.type()=" << sharding.type();


DebugString should include the type?

Actually, it doesn't. The debugString is empty in that case.

… CUDA (pytorch#6022)

… CUDA (#6022)

vanbasten23 requested review from jonb377 and will-cromar December 4, 2023 23:23

jonb377 approved these changes Dec 4, 2023

View reviewed changes

vanbasten23 requested a review from yeounoh December 5, 2023 00:19

jonb377 reviewed Dec 6, 2023

View reviewed changes

vanbasten23 force-pushed the fix_global_runtime_device_count branch from 54bad35 to d0faac5 Compare January 4, 2024 19:27

miladm reviewed Jan 4, 2024

View reviewed changes

Comment thread torch_xla/csrc/runtime/pjrt_registry.cc

vanbasten23 mentioned this pull request Jan 4, 2024

multi gpu training use spmd core dump #6225

Closed

vanbasten23 added the DO_NOT_REVIEW_YET label Jan 5, 2024

vanbasten23 changed the title ~~Fix xr.global_runtime_device_count for single process on CUDA~~ Fix global_device_count(), local_device_count for single process on CUDA Jan 9, 2024

vanbasten23 mentioned this pull request Jan 10, 2024

Backport PjRtStreamExecutorLoadedExecutable:GetCompileOptions #6282

Merged

vanbasten23 force-pushed the fix_global_runtime_device_count branch from 3292787 to fe03a80 Compare January 11, 2024 18:34

will-cromar approved these changes Jan 11, 2024

View reviewed changes

will-cromar mentioned this pull request Jan 12, 2024

Support PJRT C API create_options #6289

Merged

vanbasten23 force-pushed the fix_global_runtime_device_count branch from fe03a80 to b98ef93 Compare January 16, 2024 20:14

This comment was marked as outdated.

Sign in to view

vanbasten23 changed the title ~~Fix global_device_count(), local_device_count for single process on CUDA~~ Fix global_device_count(), local_device_count() for single process on CUDA Jan 17, 2024

vanbasten23 mentioned this pull request Jan 17, 2024

Need to reenable ZeRO1 for GPU to enable coverage for reduce-scatter/all-gather #6260

Closed

vanbasten23 force-pushed the fix_global_runtime_device_count branch from cbeabb8 to 6c2b64f Compare January 19, 2024 22:08

vanbasten23 removed the DO_NOT_REVIEW_YET label Jan 19, 2024

vanbasten23 requested review from jonb377 and will-cromar January 19, 2024 22:23

will-cromar approved these changes Jan 19, 2024

View reviewed changes

Comment thread torch_xla/core/xla_model.py Outdated

Comment thread torch_xla/core/xla_model.py Outdated

Comment thread torch_xla/csrc/runtime/pjrt_registry.cc Outdated

jonb377 reviewed Jan 19, 2024

View reviewed changes

will-cromar added a commit that referenced this pull request Jan 19, 2024

use all devices in SPMD case (see #6022)

61bf72f

vanbasten23 added 19 commits February 2, 2024 14:58

remove prints

ee243d0

remove more prints

3fa038f

remove global_runtime_device_count test case

9823269

fix linter

b10570b

fix the broken tests

1832477

fix a build issue

d590e30

fix linter

4d74168

fix build after pin update and a failing spmd test

b154538

fix linter

6bfdf19

fix comments

c5e717e

Incorporate the fix cl/601517680

0291311

add comment to the patch and fix linter

3c774fe

add a single processing gpu test.

862f7da

add print in spmd tests. Local test works but fail in the CI

5d64007

fix test and linter

d59457f

clean up prints

e8e1697

clean up another prints

db3d0c8

fix linter

10e459b

fix BasicShardingTest.test_2d_tensor_3d_mesh on cpu

aee08df

vanbasten23 force-pushed the fix_global_runtime_device_count branch from 960d609 to aee08df Compare February 2, 2024 14:58

jonb377 approved these changes Feb 2, 2024

View reviewed changes

vanbasten23 added 2 commits February 2, 2024 21:36

fix comments

b3991a7

remove unwanted function

953d9b4

vanbasten23 merged commit 8fc8d57 into master Feb 3, 2024

yeounoh reviewed Feb 5, 2024

View reviewed changes

amithrm pushed a commit to amithrm/xla that referenced this pull request Mar 1, 2024

Fix global_device_count(), local_device_count() for single process on…

5899157

… CUDA (pytorch#6022)

bhavya01 pushed a commit that referenced this pull request Apr 22, 2024

Fix global_device_count(), local_device_count() for single process on…

d156fd8

… CUDA (#6022)

miladm assigned vanbasten23 Nov 22, 2024

miladm added the xla:gpu label Nov 22, 2024



		@unittest.skipIf(xr.device_type() == 'CUDA',
		'Parallelism for DataParallel uses multi-threads. But cuda assumes one GPU device per process instead of relying on threads.')

Conversation

vanbasten23 commented Dec 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Dec 4, 2023

Uh oh!

vanbasten23 commented Dec 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vanbasten23 commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackCaoG commented Jan 10, 2024

Uh oh!

will-cromar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vanbasten23 commented Feb 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

vanbasten23 commented Dec 4, 2023 •

edited

Loading

vanbasten23 commented Dec 6, 2023 •

edited

Loading

vanbasten23 commented Jan 9, 2024 •

edited

Loading

vanbasten23 Jan 20, 2024 •

edited

Loading

vanbasten23 Jan 20, 2024 •

edited

Loading

vanbasten23 commented Jan 22, 2024 •

edited

Loading