-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Trials always fail with RayGetError #3170
Copy link
Copy link
Closed
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Description
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Ray installed from (source or binary): binary
- Ray version: a221f55
- Python version: 3.6.5
I run experiments with a lot of trails, but trials fail after run for a while.
All failed trials raise ray.worker.RayGetError in different places of the code, but all are related to Actor.
Here are some samples:
Traceback (most recent call last):
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
result = self.trial_executor.fetch_result(trial)
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
result = ray.get(trial_future[0])
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000b4d20acfc3d2a0957da8f94a483252c5). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
*arguments)
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/lanlin/Workspaces/morrl/maml.py", line 159, in train
return Agent.__base__.train(self)
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
result = self._train()
File "/home/lanlin/Workspaces/morrl/maml.py", line 150, in _train
fetches = self.optimizer.step()
File "/home/lanlin/Workspaces/morrl/maml_optimizer.py", line 68, in step
self.sync_weights()
File "/home/lanlin/Workspaces/morrl/maml_optimizer.py", line 28, in sync_weights
e.set_weights.remote(weights) for e in self.remote_evaluators])
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000571c3d8c18772aaa96281fd5e490352c). It was created by remote function which failed with:
Remote function failed with:
Invalid return value: likely worker died or was killed while executing the task.
Traceback (most recent call last):
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
result = self.trial_executor.fetch_result(trial)
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
result = ray.get(trial_future[0])
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000070a8f76287e7e97fdf04f6f3fea6ef14). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
*arguments)
File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/llan/Workspaces/morrl/maml.py", line 159, in train
return Agent.__base__.train(self)
File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
result = self._train()
File "/llan/Workspaces/morrl/maml.py", line 150, in _train
fetches = self.optimizer.step()
File "/llan/Workspaces/morrl/maml_optimizer.py", line 48, in step
for e in self.remote_evaluators]))
File "/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000b8ad0fd6f340492888c8cd34049722a0). It was created by remote function which failed with:
Remote function failed with:
Invalid return value: likely worker died or was killed while executing the task.
Traceback (most recent call last):
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 243, in _process_events
result = self.trial_executor.fetch_result(trial)
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 200, in fetch_result
result = ray.get(trial_future[0])
File "/home/lanlin/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(01000000de3f51eb7835f543290efaecdf49b687). It was created by remote function train which failed with:
Remote function train failed with:
Traceback (most recent call last):
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
*arguments)
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/llan/Workspaces/morrl/maml.py", line 159, in train
return Agent.__base__.train(self)
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/tune/trainable.py", line 146, in train
result = self._train()
File "/home/llan/Workspaces/morrl/maml.py", line 150, in _train
fetches = self.optimizer.step()
File "/home/llan/Workspaces/morrl/maml_optimizer.py", line 39, in step
for e in self.remote_evaluators])
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2249, in get
raise RayGetError(object_ids[i], value)
ray.worker.RayGetError: Could not get objectid ObjectID(0100000014d521475a4d89587da8c214381aee91). It was created by remote function inner_update which failed with:
Remote function inner_update failed with:
Traceback (most recent call last):
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 801, in _process_task
*arguments)
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/function_manager.py", line 481, in actor_method_executor
method_returns = method(actor, *args)
File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 121, in inner_update
inner_grad_values, inner_infos, samples = self._inner_update_once()
File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 104, in _inner_update_once
samples = self.sample()
File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 100, in sample
self.reset_sample()
File "/home/llan/Workspaces/morrl/maml_policy_evaluator.py", line 84, in reset_sample
async_env.new_obs = async_env.vector_env.vector_reset()
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/rllib/env/vector_env.py", line 76, in vector_reset
return [e.reset() for e in self.envs]
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/rllib/env/vector_env.py", line 76, in
return [e.reset() for e in self.envs]
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/gym/core.py", line 308, in reset
observation = self.env.reset(**kwargs)
File "/home/llan/Workspaces/morrl/reset_wrapper.py", line 36, in reset
reset_args = ray.get(self.reset_args_holder.get.remote())
File "/home/llan/.pyenv/versions/anaconda3-5.2.0/lib/python3.6/site-packages/ray/worker.py", line 2257, in get
raise RayGetError(object_ids, value)
ray.worker.RayGetError: Could not get objectid ObjectID(010000007409f067b5eda6f654084b741f365669). It was created by remote function which failed with:
Remote function failed with:
Invalid return value: likely worker died or was killed while executing the task.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't