-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
tuneTune-related issuesTune-related issues
Description
System information
- OS Platform and Distribution: Linux Ubuntu 16.04.6 LTS
- Ray installed from: binary
- Ray version: 0.7.3
- Python version: 3.7.3
Describe the problem
We use ray to tune the hyperparameters of our model with PBT running multiple workers in parallel. Occasionally, some workers crash with the following error message:
Traceback (most recent call last):
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 537, in _process_trial
trial, force=result.get(SHOULD_CHECKPOINT, False))
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 567, in _checkpoint_trial_if_needed
self.trial_executor.save(trial, storage=Checkpoint.DISK)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 513, in save
self._checkpoint_and_erase(trial)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 538, in _checkpoint_and_erase
ray.get(trial.runner.delete_checkpoint.remote(trial.history[-1]))
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/worker.py", line 2247, in get
raise value
ray.exceptions.RayTaskError: ^[[36mray_SupervisedTrainable:delete_checkpoint()^[[39m (pid=56457, host=ip-172-31-34-33)
File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trainable.py", line 246, in delete_checkpoint
shutil.rmtree(checkpoint_dir)
File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 482, in rmtree
onerror(os.lstat, path, sys.exc_info())
File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 480, in rmtree
orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/logs/model-20190824_174949/model_0_2019-08-24_17-57-57x48g5wke/checkpoint_5/checkpoint-ec1b867c-8dee-4bad-afac-705ba26bc1fa.pth.tar'
Apparently, ray tries to delete a checkpoint folder that does not exist or has already been deleted. A quick fix seems to be to overwrite delete_checkpoint of the Trainable class to check whether the checkpoint directory exists before deleting it:
def delete_checkpoint(self, checkpoint_dir):
if os.path.exists(checkpoint_dir):
if os.path.isfile(checkpoint_dir):
shutil.rmtree(os.path.dirname(checkpoint_dir))
else:
shutil.rmtree(checkpoint_dir)However, this does not seem to fix the underlying issue for which is not clear to me where it stems from. The checkpoint directory is not on a remote drive.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
tuneTune-related issuesTune-related issues