Skip to content

[rllib] tensorflow 1.14 doesn't work with GPUs any longer #9631

@andrew-rosenfeld-ts

Description

@andrew-rosenfeld-ts

What is the problem?

Using a recent nightly build of Ray/RLlib, you can't train using GPUs with TensorFlow 1.14 due to an API mismatch.

rollout_worker.py assumes that tensorflow has a function list_physical_devices but in 1.14, it's only experimental_list_devices, so you get

AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'

Here's the code in question in rollout_worker.py:

if (ray.is_initialized() and
                ray.worker._mode() != ray.worker.LOCAL_MODE):
            # Check available number of GPUs
            if not ray.get_gpu_ids():
                logger.debug(
                    "Creating policy evaluation worker {}".format(
                        worker_index) +
                    " on CPU (please ignore any CUDA init errors)")
            elif (policy_config["framework"] in ["tf2", "tf", "tfe"] and
                  not tf.config.list_physical_devices("GPU")) or \
                    (policy_config["framework"] == "torch" and
                     not torch.cuda.is_available()):
                raise RuntimeError(
                    "GPUs were assigned to this worker by Ray, but "
                    "your DL framework ({}) reports GPU acceleration is "
                    "disabled. This could be due to a bad CUDA- or {} "
                    "installation.".format(
                        policy_config["framework"],
                        policy_config["framework"]))

vs the API in tensorflow/_api/v1/config/__init__.py:

from tensorflow.python.eager.context import list_devices as experimental_list_devices

and here's the full stacktrace:

Traceback (most recent call last):
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/worker.py", line 1532, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AttributeError): ray::PPO.train() (pid=7711, ip=10.128.0.4)
  File "python/ray/_raylet.pyx", line 433, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 468, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 472, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 473, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 426, in ray._raylet.execute_task.function_executor
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 88, in __init__
    Trainer.__init__(self, config, env, logger_creator)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 475, in __init__
    super().__init__(config, logger_creator)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/tune/trainable.py", line 232, in __init__
    self.setup(copy.deepcopy(self.config))
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 639, in setup
    self._init(self.config, self.env_creator)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer_template.py", line 102, in _init
    env_creator, self._policy, config, self.config["num_workers"])
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/agents/trainer.py", line 709, in _make_workers
    logdir=self.logdir)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 67, in __init__
    RolloutWorker, env_creator, policy, 0, self._local_config)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/worker_set.py", line 296, in _make_worker
    extra_python_environs=extra_python_environs)
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/ray/rllib/evaluation/rollout_worker.py", line 415, in __init__
    not tf.config.list_physical_devices("GPU")) or \
  File "/home/andrew/miniconda3/envs/ray_nightly_tf14/lib/python3.7/site-packages/tensorflow/python/util/deprecation_wrapper.py", line 106, in __getattr__
    attr = getattr(self._dw_wrapped_module, name)
AttributeError: module 'tensorflow._api.v1.config' has no attribute 'list_physical_devices'

Reproduction (REQUIRED)

Ray: latest nightly wheel as of 2020-07-22
TensorFlow: 1.14
Python: 3.7
OS: Ubuntu 20.04

from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer,
         config={
             "env": "CartPole-v0",
             "num_workers": 4,
             "num_envs_per_worker": 2,
             "num_gpus": 0.5,
             "num_gpus_per_worker": 0.1,
         })

Metadata

Metadata

Assignees

Labels

P0Issues that should be fixed in short orderbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions