-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
- Ray installed from (source or binary): source
- Ray version: 0.6.6
- Python version: 3.6.6
- Exact command to reproduce:
Describe the problem
Run ray with rllib on a cluster and the following error happens sometimes.
When it happens sometimes it will continue to run the experiment and sometimes it will just hang.
Source code / logs
2019-04-18 09:37:11,244 INFO sampler.py:309 -- Info return from env: {3: {'agent0': {}}}2019-04-18 09:37:13,468 ERROR worker.py:1679 -- The monitor failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/monitor.py", line 379, in <module>
monitor.run()
File "/usr/local/lib/python3.6/dist-packages/ray/monitor.py", line 316, in run
self.autoscaler.update()
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 396, in update
raise e
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 388, in update
self._update()
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 409, in _update
self.log_info_string(nodes)
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 637, in log_info_string
logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 653, in info_string
len(nodes), self.target_num_workers(), suffix)
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 521, in target_num_workers
cur_used = self.load_metrics.approx_workers_used()
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 196, in approx_workers_used
return self._info()["NumNodesUsed"]
File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 214, in _info
used = amount - avail_resources[resource_id]
KeyError: b'GPU'
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels