Skip to content

Key Error happens sometimes when running on a cluster #4655

@nikola-j

Description

@nikola-j

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): 18.04
  • Ray installed from (source or binary): source
  • Ray version: 0.6.6
  • Python version: 3.6.6
  • Exact command to reproduce:

Describe the problem

Run ray with rllib on a cluster and the following error happens sometimes.
When it happens sometimes it will continue to run the experiment and sometimes it will just hang.

Source code / logs

2019-04-18 09:37:11,244       INFO sampler.py:309 -- Info return from env: {3: {'agent0': {}}}2019-04-18 09:37:13,468 ERROR worker.py:1679 -- The monitor failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/monitor.py", line 379, in <module>
    monitor.run()
  File "/usr/local/lib/python3.6/dist-packages/ray/monitor.py", line 316, in run
    self.autoscaler.update()
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 396, in update
    raise e
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 388, in update
    self._update()
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 409, in _update
    self.log_info_string(nodes)
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 637, in log_info_string
    logger.info("StandardAutoscaler: {}".format(self.info_string(nodes)))
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 653, in info_string
    len(nodes), self.target_num_workers(), suffix)
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 521, in target_num_workers
    cur_used = self.load_metrics.approx_workers_used()
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 196, in approx_workers_used
    return self._info()["NumNodesUsed"]
  File "/usr/local/lib/python3.6/dist-packages/ray/autoscaler/autoscaler.py", line 214, in _info
    used = amount - avail_resources[resource_id]
KeyError: b'GPU'

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions