-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'ttuneTune-related issuesTune-related issues
Description
When using population-based training TUNE stops after some times throwing the following error:
There are paused trials, but no more pending trials with sufficient resources.
This is caused by not finding the latest checkpoint:
Failure # 1 (occurred at 2020-06-19_11-26-36)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 294, in start_trial
self._start_trial(trial, checkpoint, train=train)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 235, in _start_trial
self.restore(trial, checkpoint)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/ray_trial_executor.py", line 673, in restore
data_dict = TrainableUtil.pickle_checkpoint(value)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 62, in pickle_checkpoint
checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
File "/usr/local/lib/python3.6/dist-packages/ray/tune/trainable.py", line 87, in find_checkpoint_dir
raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] /content/TRASH_TUNE_PBT_oversampling_mimic_densenet121/TUNE_Model_0_2020-06-19_11-24-215xncry9c/checkpoint_6/
The error appears to be somewhat random since it only appears after quite some iterations
The error can be reproduced in this colab notebook. It is not a COLAB related issue since the same problem arises on our own server.
@richardliaw Is this related to #8772 ?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Issue that should be fixed within a few weeksIssue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'ttuneTune-related issuesTune-related issues