Skip to content

[RLlib] Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use ray.get_gpu_ids() (b/c no GPUs assigned by ray).#17444

Merged
sven1977 merged 28 commits intoray-project:masterfrom
sven1977:fix_torch_policy_get_gpu_ids_error
Aug 2, 2021
Merged

Conversation

@sven1977
Copy link
Contributor

@sven1977 sven1977 commented Jul 29, 2021

Issues: 17397, 17425, 16715, 17174. When on driver, Torch|TFPolicy should not use ray.get_gpu_ids() (b/c no GPUs assigned by ray).

Issue #17397
Issue #17425
Issue #16715
Issue #17174

Why are these changes needed?

Related issue number

Closes #17397
Closes #17425
Closes #16715
Closes #17174

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines +156 to +157
from ray.worker import global_worker
if global_worker.mode == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_worker can be None sometimes right?

also, can we use the "WORKER_MODE" enum?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@sven1977
Copy link
Contributor Author

Also added a few test cases and better error messages for tf and torch.

Comment on lines +176 to +179
elif len(gpu_ids) < num_gpus:
raise ValueError(
"TFPolicy was not able to find enough GPU IDs! Found "
f"{gpu_ids}, but num_gpus={num_gpus}.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use if len(self.devices) > 0 bellow. This condition fails on num_gpus=0.5. for i, _ in enumerate(...) if i < num_gpus can handle fractional GPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this would be necessary:

E.g.:
if num_gpus=0.5 and gpu_ids=["/physical_device:gpu:0"]

then this tf check would pass, no (and the error would not be raised)?

Also:

self.devices = [f"/gpu:{i}" for i, _ in enumerate(gpu_ids) if i < num_gpus]

would still generate a device list with exactly 1 gpu in it despite num_gpu being 0.5.

Comment on lines +162 to +165
elif len(gpu_ids) < num_gpus:
raise ValueError(
"TorchPolicy was not able to find enough GPU IDs! Found "
f"{gpu_ids}, but num_gpus={num_gpus}.")
Copy link
Contributor

@XuehaiPan XuehaiPan Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use if len(self.devices) > 0 bellow. Same reason fractional GPUs.

Comment on lines +572 to +581
if policy_config["framework"] in ["tf2", "tf", "tfe"]:
if len(get_tf_gpu_devices()) < num_gpus:
raise RuntimeError(
f"Not enough GPUs found for num_gpus={num_gpus}! "
f"Found only these IDs: {get_tf_gpu_devices()}.")
elif policy_config["framework"] == "torch":
if torch.cuda.device_count() < num_gpus:
raise RuntimeError(
f"Not enough GPUs found ({torch.cuda.device_count()}) "
f"for num_gpus={num_gpus}!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add math.ceil(num_gpus) to handle fractional GPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch @XuehaiPan! This would indeed fail for fractional numbers due to the range not handling floats. I will update.
Running some last tests now on a multi-GPU machine.

@sven1977
Copy link
Contributor Author

Running the new test case on local laptop (no GPUs) and 4-GPU machine looks all ok now.
The added test checks whether:

  • direct Trainer (on driver w/o ray.tune) compiles or throws expected error, if num_gpus=0|0.1|1|8.
  • tune.run() runs ok with num_gpus=0|0.1|1|8.
  • all of the above run ok in local_mode=True (no GPUs used!)
  • all of the above run ok with _fake_gpus=True in the config.
  • all frameworks are tested.

Comment on lines +557 to +558
ray.worker._mode() != ray.worker.LOCAL_MODE and \
not policy_config.get("_fake_gpus"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. let's file a feature request for a better way of detecting local mode (on #api-changes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do, you don't like ray.worker._mode() != ray.worker.LOCAL_MODE? :D

@akshaygh0sh
Copy link

Any idea on when this pull request will be merged/authorized?

@sven1977
Copy link
Contributor Author

sven1977 commented Aug 2, 2021

All tests are passing now, including the new one, which tests all combinations of num_gpus, num_gpus_per_worker, framework, _fake_gpus, tune.run-vs-direct-RLlib on both CPU and 4-GPU machines.

num_gpus = config["num_gpus"]
else:
num_gpus = config["num_gpus_per_worker"]
gpu_ids = list(range(torch.cuda.device_count()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can get all devices directly here.
Image that if we run the driver on the 5 devices node, and a remote worker is also scheduled to this node, all 5 devices are available to the remote worker, which I don't think make sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The environment variable CUDA_VISIBLE_DEVICES is set for the remote worker. torch.cuda.device_count() will respect the CUDA_VISIBLE_DEVICES and return the number of CUDA visible GPUs.

@sven1977 sven1977 deleted the fix_torch_policy_get_gpu_ids_error branch June 2, 2023 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

6 participants