Skip to content

[tune] FileNotFoundError when deleting checkpoint #5549

@FelixOpolka

Description

@FelixOpolka

System information

  • OS Platform and Distribution: Linux Ubuntu 16.04.6 LTS
  • Ray installed from: binary
  • Ray version: 0.7.3
  • Python version: 3.7.3

Describe the problem

We use ray to tune the hyperparameters of our model with PBT running multiple workers in parallel. Occasionally, some workers crash with the following error message:

Traceback (most recent call last):
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 537, in _process_trial
    trial, force=result.get(SHOULD_CHECKPOINT, False))
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 567, in _checkpoint_trial_if_needed
    self.trial_executor.save(trial, storage=Checkpoint.DISK)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 513, in save
    self._checkpoint_and_erase(trial)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 538, in _checkpoint_and_erase
    ray.get(trial.runner.delete_checkpoint.remote(trial.history[-1]))
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/worker.py", line 2247, in get
    raise value
ray.exceptions.RayTaskError: ^[[36mray_SupervisedTrainable:delete_checkpoint()^[[39m (pid=56457, host=ip-172-31-34-33)
  File "/home/user/.cache/pypoetry/virtualenvs/project-py3.7/lib/python3.7/site-packages/ray/tune/trainable.py", line 246, in delete_checkpoint
    shutil.rmtree(checkpoint_dir)
  File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 482, in rmtree
    onerror(os.lstat, path, sys.exc_info())
  File "/home/user/.pyenv/versions/3.7.3/lib/python3.7/shutil.py", line 480, in rmtree
    orig_st = os.lstat(path)
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/logs/model-20190824_174949/model_0_2019-08-24_17-57-57x48g5wke/checkpoint_5/checkpoint-ec1b867c-8dee-4bad-afac-705ba26bc1fa.pth.tar'

Apparently, ray tries to delete a checkpoint folder that does not exist or has already been deleted. A quick fix seems to be to overwrite delete_checkpoint of the Trainable class to check whether the checkpoint directory exists before deleting it:

    def delete_checkpoint(self, checkpoint_dir):
        if os.path.exists(checkpoint_dir):
            if os.path.isfile(checkpoint_dir):
                shutil.rmtree(os.path.dirname(checkpoint_dir))
            else:
                shutil.rmtree(checkpoint_dir)

However, this does not seem to fix the underlying issue for which is not clear to me where it stems from. The checkpoint directory is not on a remote drive.

Metadata

Metadata

Assignees

Labels

tuneTune-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions