-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Description
What is the problem?
ray.get_gpu_ids() gets an empty list on my machine when I'm using TorchPolicy with config['num_gpu'] set. It will get an IndexError at self.devices[0] when using TorchPolicy on GPUs:
ray/rllib/policy/torch_policy.py
Lines 154 to 159 in 1f35470
| gpu_ids = ray.get_gpu_ids() | |
| self.devices = [ | |
| torch.device("cuda:{}".format(i)) | |
| for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"] | |
| ] | |
| self.device = self.devices[0] |
This issue can be reproduced on multiple machines. Ray version and other system information (Python version, TensorFlow version, OS):
My Runtime Environment
Machine 1:
- OS version: Ubuntu 20.04 LTS
- Python version: 3.8.10
- Ray version: 1.5.0 from PyPI (tested with nightly build as well)
- PyTorch version: 1.9.0
- NVIDIA driver version: 470.57.02
- CUDA version: 11.1.1
Machine 2:
- OS version: Ubuntu 16.04 LTS
- Python version: 3.7.10
- Ray version: 1.5.0 from PyPI (tested with nightly build as well)
- PyTorch version: 1.4.0
- NVIDIA driver version: 430.64
- CUDA version: 10.0.0
Same issue on Windows: https://discuss.ray.io/t/error-with-torch-policy-and-ray-get-gpu-ids-on-windows/2711
Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
conda create --name test python=3.8 --yes
conda activate test
pip3 install https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
python3 -c 'import ray; print(ray.get_gpu_ids())'
nvidia-smi --list-gpusIf the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.