Skip to content

[rllib] Training crashes because get_gpu_ids() returns empty list #16715

@cassidylaidlaw

Description

@cassidylaidlaw

What is the problem?

When running a simple RLlib training script, almost identical to the example here, I get the following error:

Traceback (most recent call last):
  File "test.py", line 12, in <module>
    trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 123, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 562, in __init__
    super().__init__(config, logger_creator)
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/tune/trainable.py", line 100, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 722, in setup
    self._init(self.config, self.env_creator)
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in _init
    self.workers = self._make_workers(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 804, in _make_workers
    return WorkerSet(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 92, in __init__
    self._local_worker = self._make_worker(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 368, in _make_worker
    worker = cls(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 545, in __init__
    self.policy_map, self.preprocessors = self._build_policy_map(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1323, in _build_policy_map
    policy_map[name] = cls(obs_space, act_space, merged_conf)
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 256, in __init__
    self.parent_cls.__init__(
  File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 156, in __init__
    self.device = self.devices[0]
IndexError: list index out of range

The script that reproduces the error is below in the reproduction section. It looks like this error is caused by ray.get_gpu_ids() returning an empty list ([]) despite there being GPUs attached to the system:

>>> import ray
>>> ray.init(num_gpus=4)
2021-06-28 12:57:14,182 INFO services.py:1330 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '128.32.175.10', 'raylet_ip_address': '128.32.175.10', 'redis_address': '128.32.175.10:6379', 'object_store_address': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105', 'metrics_export_port': 59050, 'node_id': 'b6854536c3fea8b39eac4a0723a2af43d56571adf35a750f99bc9982'}
>>> ray.get_gpu_ids()
[]
>>> import torch
>>> torch.cuda.is_available()
True

I'm not sure why this is happening—it didn't happen with RLLib 1.3. Interestingly, I can train on GPUs using rllib train (the RLlib CLI) with no issue.

Ray version and other system information (Python version, TensorFlow version, OS):

  • Ubuntu 18.04.5
  • Python 3.8.10 (anaconda)
  • CUDA 11.3
Installed Python packages
Package                  Version                                                                                                                           
------------------------ -------------------                                                                                                               
aiohttp                  3.7.4.post0                                                                                                                       
aiohttp-cors             0.7.0                                                                                                                             
aioredis                 1.3.1                                                                                                                             
async-timeout            3.0.1                                                                                                                             
attrs                    21.2.0                                                                                                                            
blessings                1.7                                                                                                                               
cachetools               4.2.2                                                                                                                             
certifi                  2021.5.30                                                                                                                         
chardet                  4.0.0                                                                                                                             
click                    8.0.1                                                                                                                             
cloudpickle              1.6.0                                                                                                                             
colorama                 0.4.4                                                                                                                             
dm-tree                  0.1.6                                                                                                                             
filelock                 3.0.12                                                                                                                            
google-api-core          1.30.0                                                                                                                            
google-auth              1.32.0                                                                                                                            
googleapis-common-protos 1.53.0                                                                                                                            
gpustat                  0.6.0                                                                                                                             
grpcio                   1.38.1
gym                      0.18.3
hiredis                  2.0.0
idna                     2.10
jsonschema               3.2.0
msgpack                  1.0.2
multidict                5.1.0
numpy                    1.21.0
nvidia-ml-py3            7.352.0
opencensus               0.7.13
opencensus-context       0.1.2
opencv-python            4.5.2.54
packaging                20.9
pandas                   1.2.5
Pillow                   8.2.0
pip                      21.1.2
prometheus-client        0.11.0
protobuf                 3.17.3
psutil                   5.8.0
py-spy                   0.3.7
pyasn1                   0.4.8
pyasn1-modules           0.2.8
pydantic                 1.8.2
pyglet                   1.5.15
pyparsing                2.4.7
pyrsistent               0.18.0
python-dateutil          2.8.1
pytz                     2021.1
PyYAML                   5.4.1
ray                      2.0.0.dev0
redis                    3.5.3
requests                 2.25.1
rsa                      4.7.2
scipy                    1.7.0
setuptools               52.0.0.post20210125
six                      1.16.0
tabulate                 0.8.9
torch                    1.9.0
typing-extensions        3.10.0.0
urllib3                  1.26.6
wheel                    0.36.2
yarl                     1.6.3

Reproduction (REQUIRED)

test.py
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print

ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config.update({
  "num_gpus": 1,
  "num_workers": 1,
  "framework": "torch",
})
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

# Can optionally call trainer.restore(path) to load a checkpoint.

for i in range(1000):
 # Perform one iteration of training the policy with PPO
 result = trainer.train()
 print(pretty_print(result))

 if i % 100 == 0:
     checkpoint = trainer.save()
     print("checkpoint saved at", checkpoint)
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'trllibRLlib related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions