-
Notifications
You must be signed in to change notification settings - Fork 7.3k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'trllibRLlib related issuesRLlib related issues
Description
What is the problem?
When running a simple RLlib training script, almost identical to the example here, I get the following error:
Traceback (most recent call last):
File "test.py", line 12, in <module>
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 123, in __init__
Trainer.__init__(self, config, env, logger_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 562, in __init__
super().__init__(config, logger_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/tune/trainable.py", line 100, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 722, in setup
self._init(self.config, self.env_creator)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer_template.py", line 147, in _init
self.workers = self._make_workers(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/agents/trainer.py", line 804, in _make_workers
return WorkerSet(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 92, in __init__
self._local_worker = self._make_worker(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/worker_set.py", line 368, in _make_worker
worker = cls(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 545, in __init__
self.policy_map, self.preprocessors = self._build_policy_map(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/evaluation/rollout_worker.py", line 1323, in _build_policy_map
policy_map[name] = cls(obs_space, act_space, merged_conf)
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/policy_template.py", line 256, in __init__
self.parent_cls.__init__(
File "/home/cassidy/miniconda3/envs/rllib_test/lib/python3.8/site-packages/ray/rllib/policy/torch_policy.py", line 156, in __init__
self.device = self.devices[0]
IndexError: list index out of range
The script that reproduces the error is below in the reproduction section. It looks like this error is caused by ray.get_gpu_ids() returning an empty list ([]) despite there being GPUs attached to the system:
>>> import ray
>>> ray.init(num_gpus=4)
2021-06-28 12:57:14,182 INFO services.py:1330 -- View the Ray dashboard at http://127.0.0.1:8265
{'node_ip_address': '128.32.175.10', 'raylet_ip_address': '128.32.175.10', 'redis_address': '128.32.175.10:6379', 'object_store_address': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105/sockets/raylet', 'webui_url': '127.0.0.1:8265', 'session_dir': '/tmp/ray/session_2021-06-28_12-57-13_130563_33105', 'metrics_export_port': 59050, 'node_id': 'b6854536c3fea8b39eac4a0723a2af43d56571adf35a750f99bc9982'}
>>> ray.get_gpu_ids()
[]
>>> import torch
>>> torch.cuda.is_available()
True
I'm not sure why this is happening—it didn't happen with RLLib 1.3. Interestingly, I can train on GPUs using rllib train (the RLlib CLI) with no issue.
Ray version and other system information (Python version, TensorFlow version, OS):
- Ubuntu 18.04.5
- Python 3.8.10 (anaconda)
- CUDA 11.3
Installed Python packages
Package Version
------------------------ -------------------
aiohttp 3.7.4.post0
aiohttp-cors 0.7.0
aioredis 1.3.1
async-timeout 3.0.1
attrs 21.2.0
blessings 1.7
cachetools 4.2.2
certifi 2021.5.30
chardet 4.0.0
click 8.0.1
cloudpickle 1.6.0
colorama 0.4.4
dm-tree 0.1.6
filelock 3.0.12
google-api-core 1.30.0
google-auth 1.32.0
googleapis-common-protos 1.53.0
gpustat 0.6.0
grpcio 1.38.1
gym 0.18.3
hiredis 2.0.0
idna 2.10
jsonschema 3.2.0
msgpack 1.0.2
multidict 5.1.0
numpy 1.21.0
nvidia-ml-py3 7.352.0
opencensus 0.7.13
opencensus-context 0.1.2
opencv-python 4.5.2.54
packaging 20.9
pandas 1.2.5
Pillow 8.2.0
pip 21.1.2
prometheus-client 0.11.0
protobuf 3.17.3
psutil 5.8.0
py-spy 0.3.7
pyasn1 0.4.8
pyasn1-modules 0.2.8
pydantic 1.8.2
pyglet 1.5.15
pyparsing 2.4.7
pyrsistent 0.18.0
python-dateutil 2.8.1
pytz 2021.1
PyYAML 5.4.1
ray 2.0.0.dev0
redis 3.5.3
requests 2.25.1
rsa 4.7.2
scipy 1.7.0
setuptools 52.0.0.post20210125
six 1.16.0
tabulate 0.8.9
torch 1.9.0
typing-extensions 3.10.0.0
urllib3 1.26.6
wheel 0.36.2
yarl 1.6.3
Reproduction (REQUIRED)
test.py
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config.update({
"num_gpus": 1,
"num_workers": 1,
"framework": "torch",
})
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
# Can optionally call trainer.restore(path) to load a checkpoint.
for i in range(1000):
# Perform one iteration of training the policy with PPO
result = trainer.train()
print(pretty_print(result))
if i % 100 == 0:
checkpoint = trainer.save()
print("checkpoint saved at", checkpoint)
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'trllibRLlib related issuesRLlib related issues