Skip to content

Trials always fail with RayGetError #3170

@llan-ml

Description

@llan-ml

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
  • Ray installed from (source or binary): binary
  • Ray version: a221f55
  • Python version: 3.6.5

I run experiments with a lot of trails, but trials fail after run for a while.

All failed trials raise ray.worker.RayGetError in different places of the code, but all are related to Actor.
Here are some samples:

Traceback (most recent call last):
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000b4d20acfc3d2a0957da8f94a483252c5). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
    *arguments)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/lanlin/Workspaces/morrl/maml.py", line 159, in train
    return Agent.__base__.train(self)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/lanlin/Workspaces/morrl/maml.py", line 150, in _train
    fetches = self.optimizer.step()
  File "/home/lanlin/Workspaces/morrl/maml_optimizer.py", line 68, in step
    self.sync_weights()
  File "/home/lanlin/Workspaces/morrl/maml_optimizer.py", line 28, in sync_weights
    e.set_weights.remote(weights) for e in self.remote_evaluators])
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000571c3d8c18772aaa96281fd5e490352c). It was created by remote function  which failed with:

Remote function  failed with:

Invalid return value: likely worker died or was killed while executing the task.
Traceback (most recent call last):
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000070a8f76287e7e97fdf04f6f3fea6ef14). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
    *arguments)
  File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/llan/Workspaces/morrl/maml.py", line 159, in train
    return Agent.__base__.train(self)
  File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/llan/Workspaces/morrl/maml.py", line 150, in _train
    fetches = self.optimizer.step()
  File "/llan/Workspaces/morrl/maml_optimizer.py", line 48, in step
    for e in self.remote_evaluators]))
  File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000b8ad0fd6f340492888c8cd34049722a0). It was created by remote function  which failed with:

Remote function  failed with:

Invalid return value: likely worker died or was killed while executing the task.
Traceback (most recent call last):
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000de3f51eb7835f543290efaecdf49b687). It was created by remote function train which failed with:

Remote function train failed with:

Traceback (most recent call last):
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
    *arguments)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/llan/Workspaces/morrl/maml.py", line 159, in train
    return Agent.__base__.train(self)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
    result = self._train()
  File "/home/llan/Workspaces/morrl/maml.py", line 150, in _train
    fetches = self.optimizer.step()
  File "/home/llan/Workspaces/morrl/maml_optimizer.py", line 39, in step
    for e in self.remote_evaluators])
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
    raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000014d521475a4d89587da8c214381aee91). It was created by remote function inner_update which failed with:

Remote function inner_update failed with:

Traceback (most recent call last):
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
    *arguments)
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
    method_returns = method(actor, *args)
  File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 121, in inner_update
    inner_grad_values, inner_infos, samples = self._inner_update_once()
  File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 104, in _inner_update_once
    samples = self.sample()
  File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 100, in sample
    self.reset_sample()
  File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 84, in reset_sample
    async_env.new_obs = async_env.vector_env.vector_reset()
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/rllib/env/vector_env.py", line 76, in vector_reset
    return [e.reset() for e in self.envs]
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/rllib/env/vector_env.py", line 76, in 
    return [e.reset() for e in self.envs]
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/gym/core.py", line 308, in reset
    observation = self.env.reset(**kwargs)
  File "/home/llan/Workspaces/morrl/reset_wrapper.py", line 36, in reset
    reset_args = ray.get(self.reset_args_holder.get.remote())
  File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
    raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(010000007409f067b5eda6f654084b741f365669). It was created by remote function  which failed with:

Remote function  failed with:

Invalid return value: likely worker died or was killed while executing the task.

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions