Skip to content

[Rllib] TorchPolicy and TFPolicy cannot find any GPUs #17397

@XuehaiPan

Description

@XuehaiPan

What is the problem?

ray.get_gpu_ids() gets an empty list on my machine when I'm using TorchPolicy with config['num_gpu'] set. It will get an IndexError at self.devices[0] when using TorchPolicy on GPUs:

gpu_ids = ray.get_gpu_ids()
self.devices = [
torch.device("cuda:{}".format(i))
for i, id_ in enumerate(gpu_ids) if i < config["num_gpus"]
]
self.device = self.devices[0]

This issue can be reproduced on multiple machines. Ray version and other system information (Python version, TensorFlow version, OS):

My Runtime Environment

Machine 1:
  • OS version: Ubuntu 20.04 LTS
  • Python version: 3.8.10
  • Ray version: 1.5.0 from PyPI (tested with nightly build as well)
  • PyTorch version: 1.9.0
  • NVIDIA driver version: 470.57.02
  • CUDA version: 11.1.1
Machine 2:
  • OS version: Ubuntu 16.04 LTS
  • Python version: 3.7.10
  • Ray version: 1.5.0 from PyPI (tested with nightly build as well)
  • PyTorch version: 1.4.0
  • NVIDIA driver version: 430.64
  • CUDA version: 10.0.0

Same issue on Windows: https://discuss.ray.io/t/error-with-torch-policy-and-ray-get-gpu-ids-on-windows/2711

Reproduction (REQUIRED)

Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):

conda create --name test python=3.8 --yes
conda activate test
pip3 install https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-2.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl
python3 -c 'import ray; print(ray.get_gpu_ids())'
nvidia-smi --list-gpus

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'ttriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions