[RLlib] Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use `ray.get_gpu_ids()` (b/c no GPUs assigned by ray). by sven1977 · Pull Request #17444 · ray-project/ray

sven1977 · 2021-07-29T20:10:37Z

Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use ray.get_gpu_ids() (b/c no GPUs assigned by ray).

Issue #17397
Issue #17425
Issue #16715
Issue #17174

Why are these changes needed?

Related issue number

Closes #17397
Closes #17425
Closes #16715
Closes #17174

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…torch_policy_get_gpu_ids_error

richardliaw · 2021-07-29T23:04:07Z

rllib/policy/torch_policy.py

+            from ray.worker import global_worker
+            if global_worker.mode == 1:


global_worker can be None sometimes right?

also, can we use the "WORKER_MODE" enum?

sven1977 · 2021-07-30T01:52:28Z

Also added a few test cases and better error messages for tf and torch.

…torch_policy_get_gpu_ids_error

XuehaiPan · 2021-07-30T03:14:53Z

rllib/policy/tf_policy.py

+            elif len(gpu_ids) < num_gpus:
+                raise ValueError(
+                    "TFPolicy was not able to find enough GPU IDs! Found "
+                    f"{gpu_ids}, but num_gpus={num_gpus}.")


I think we should use if len(self.devices) > 0 bellow. This condition fails on num_gpus=0.5. for i, _ in enumerate(...) if i < num_gpus can handle fractional GPUs.

Not sure this would be necessary:

E.g.:
if num_gpus=0.5 and gpu_ids=["/physical_device:gpu:0"]

then this tf check would pass, no (and the error would not be raised)?

Also:

self.devices = [f"/gpu:{i}" for i, _ in enumerate(gpu_ids) if i < num_gpus]

would still generate a device list with exactly 1 gpu in it despite num_gpu being 0.5.

XuehaiPan · 2021-07-30T03:16:14Z

rllib/policy/torch_policy.py

+            elif len(gpu_ids) < num_gpus:
+                raise ValueError(
+                    "TorchPolicy was not able to find enough GPU IDs! Found "
+                    f"{gpu_ids}, but num_gpus={num_gpus}.")


I think we should use if len(self.devices) > 0 bellow. Same reason fractional GPUs.

XuehaiPan · 2021-07-30T03:24:30Z

rllib/evaluation/rollout_worker.py

+            if policy_config["framework"] in ["tf2", "tf", "tfe"]:
+                if len(get_tf_gpu_devices()) < num_gpus:
+                    raise RuntimeError(
+                        f"Not enough GPUs found for num_gpus={num_gpus}! "
+                        f"Found only these IDs: {get_tf_gpu_devices()}.")
+            elif policy_config["framework"] == "torch":
+                if torch.cuda.device_count() < num_gpus:
+                    raise RuntimeError(
+                        f"Not enough GPUs found ({torch.cuda.device_count()}) "
+                        f"for num_gpus={num_gpus}!")


Maybe add math.ceil(num_gpus) to handle fractional GPUs.

Great catch @XuehaiPan! This would indeed fail for fractional numbers due to the range not handling floats. I will update.
Running some last tests now on a multi-GPU machine.

sven1977 · 2021-07-30T15:22:51Z

Running the new test case on local laptop (no GPUs) and 4-GPU machine looks all ok now.
The added test checks whether:

direct Trainer (on driver w/o ray.tune) compiles or throws expected error, if num_gpus=0|0.1|1|8.
tune.run() runs ok with num_gpus=0|0.1|1|8.
all of the above run ok in local_mode=True (no GPUs used!)
all of the above run ok with _fake_gpus=True in the config.
all frameworks are tested.

richardliaw · 2021-07-30T17:00:25Z

rllib/evaluation/rollout_worker.py

+                ray.worker._mode() != ray.worker.LOCAL_MODE and \
+                not policy_config.get("_fake_gpus"):


nice. let's file a feature request for a better way of detecting local mode (on #api-changes)

Will do, you don't like ray.worker._mode() != ray.worker.LOCAL_MODE? :D

…torch_policy_get_gpu_ids_error

akshaygh0sh · 2021-08-01T00:55:46Z

Any idea on when this pull request will be merged/authorized?

…torch_policy_get_gpu_ids_error

sven1977 · 2021-08-02T21:20:02Z

All tests are passing now, including the new one, which tests all combinations of num_gpus, num_gpus_per_worker, framework, _fake_gpus, tune.run-vs-direct-RLlib on both CPU and 4-GPU machines.

jovany-wang · 2022-08-19T07:03:57Z

rllib/policy/torch_policy.py

+            num_gpus = config["num_gpus"]
+        else:
+            num_gpus = config["num_gpus_per_worker"]
+        gpu_ids = list(range(torch.cuda.device_count()))


I don't think we can get all devices directly here.
Image that if we run the driver on the 5 devices node, and a remote worker is also scheduled to this node, all 5 devices are available to the remote worker, which I don't think make sense.

The environment variable CUDA_VISIBLE_DEVICES is set for the remote worker. torch.cuda.device_count() will respect the CUDA_VISIBLE_DEVICES and return the number of CUDA visible GPUs.

wip

1f82c66

sven1977 requested a review from michaelzhiluo July 29, 2021 20:13

sven1977 assigned michaelzhiluo Jul 29, 2021

sven1977 added 2 commits July 29, 2021 16:42

wip

282a729

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

96b87f8

…torch_policy_get_gpu_ids_error

richardliaw reviewed Jul 29, 2021

View reviewed changes

wip.

9558a3d

sven1977 added 5 commits July 29, 2021 21:57

wip.

231ccf2

fix.

ed24991

docs update.

2f0d339

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

e407fb3

…torch_policy_get_gpu_ids_error

wip

f0667b1

XuehaiPan reviewed Jul 30, 2021

View reviewed changes

XuehaiPan mentioned this pull request Jul 30, 2021

[Rllib] Fix multi-GPU discovery for Torch/TF policies #17398

Closed

6 tasks

XuehaiPan reviewed Jul 30, 2021

View reviewed changes

sven1977 added 7 commits July 30, 2021 08:55

wip

742ddba

wip

7f65180

LINT

888eb3d

wip

39e92c4

wip

dd98fbd

wip

ddd0c6d

wip

a16b965

sven1977 added 5 commits July 30, 2021 11:41

wip

fbde006

wip

b7785cb

wip

0f68360

wip

ebb257f

wip

ca232ef

wip

ace4240

richardliaw reviewed Jul 30, 2021

View reviewed changes

sven1977 added 2 commits July 30, 2021 13:15

wip

93fb1f6

wip

46bb845

sven1977 mentioned this pull request Jul 30, 2021

[rllib] Torch fractional gpu does not work correctly #16880

Closed

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

5d2cb4b

…torch_policy_get_gpu_ids_error

richardliaw approved these changes Aug 2, 2021

View reviewed changes

sven1977 added 3 commits August 2, 2021 12:59

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

af70061

…torch_policy_get_gpu_ids_error

fixes and LINT

e7c43dd

wip.

8056844

sven1977 merged commit 8a844ff into ray-project:master Aug 2, 2021

akshaygh0sh mentioned this pull request Aug 4, 2021

[rllib] Strange error thrown in DictFlatteningPreprocessor when trying to train with DQN algorithm on latest Nightly version of ray. #17568

Closed

juliusfrost mentioned this pull request Aug 5, 2021

[rllib] Use CUDA only when num_gpus is set #14940

Closed

6 tasks

jovany-wang reviewed Aug 19, 2022

View reviewed changes

sven1977 deleted the fix_torch_policy_get_gpu_ids_error branch June 2, 2023 20:15

		from ray.worker import global_worker
		if global_worker.mode == 1:

		ray.worker._mode() != ray.worker.LOCAL_MODE and \
		not policy_config.get("_fake_gpus"):

Conversation

sven1977 commented Jul 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sven1977 commented Jul 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

XuehaiPan Jul 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sven1977 commented Jul 30, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akshaygh0sh commented Aug 1, 2021

Uh oh!

sven1977 commented Aug 2, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

sven1977 commented Jul 29, 2021 •

edited

Loading

XuehaiPan Jul 30, 2021 •

edited

Loading