Skip to content

[tune] Population-based training: broken when using keep_checkpoint_num #9036

@huberl

Description

@huberl

When using population-based training TUNE stops after some times throwing the following error:

There are paused trials, but no more pending trials with sufficient resources.

This is caused by not finding the latest checkpoint:

Failure # 1 (occurred at 2020-06-19_11-26-36)
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 294, in start_trial
    self._start_trial(trial, checkpoint, train=train)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 235, in _start_trial
    self.restore(trial, checkpoint)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 673, in restore
    data_dict = TrainableUtil.pickle_checkpoint(value)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 62, in pickle_checkpoint
    checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
  File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 87, in find_checkpoint_dir
    raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] /content/TRASH_TUNE_PBT_oversampling_mimic_densenet121/TUNE_Model_0_2020-06-19_11-24-215xncry9c/checkpoint_6/

The error appears to be somewhat random since it only appears after quite some iterations

The error can be reproduced in this colab notebook. It is not a COLAB related issue since the same problem arises on our own server.

@richardliaw Is this related to #8772 ?

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'ttuneTune-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions