-
Notifications
You must be signed in to change notification settings - Fork 7.4k
Closed
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't
Description
What is the problem?
I am running PBT with keep_checkpoints_num=3. After some time (this time after running for a few hours), a process died with the error that the checkpoint could not be found, probably because it was deleted since it was outdated.
Ray version and other system information (Python version, TensorFlow version, OS):
- ray 0.9.0.dev0 (b95e28f)
- Python 3.7.5
- Torch 1.5.0
- Ubuntu 18.04
Failure # 1 (occurred at 2020-05-13_20-31-02)
Traceback (most recent call last):
File "[...]/ray/tune/ray_trial_executor.py", line 295, in start_trial
self._start_trial(trial, checkpoint, train=train)
File "[...]/ray/tune/ray_trial_executor.py", line 235, in _start_trial
self.restore(trial, checkpoint)
File "[...]/ray/tune/ray_trial_executor.py", line 675, in restore
data_dict = TrainableUtil.pickle_checkpoint(value)
File "[...]/ray/tune/trainable.py", line 33, in pickle_checkpoint
checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
File "[...]/ray/tune/trainable.py", line 57, in find_checkpoint_dir
raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] [...]/checkpoint_110/checkpoint-110
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I can't reproduce this problem for now as I can't spend time and computation power running the PBT examples, but this is my PBT configuration:
def explore(config):
config["gamma"] = np.clip(config["gamma"], 0.5, 1.0)
config["clip_param"] = np.clip(config["clip_param"], 0.05, 0.5)
config["lambda"] = np.clip(config["lambda"], 0.0, 1.0)
config["lr"] = np.clip(config["lr"], 1e-8, 1e-2)
return config
pbt = PopulationBasedTraining(
time_attr="time_total_s",
metric="evaluation/custom_metrics/perf_100_mean",
mode="max",
perturbation_interval=60,
resample_probability=0.25,
# Specifies the mutations of these hyperparams
hyperparam_mutations={
"lr": lambda: random.uniform(1e-6, 1e-3),
"lambda": lambda: random.uniform(0.0, 1.0),
"clip_param": lambda: random.uniform(0.1, 0.4),
"num_sgd_iter": lambda: random.randint(1, 14),
"vf_loss_coeff": lambda: random.uniform(1e-6, 1.0),
"gamma": lambda: random.uniform(0.5, 1),
"vf_clip_param": lambda: random.uniform(1, 1000)
},
custom_explore_fn=explore)
tune config:
name="pbt_coverage_coop",
local_dir="[...]/ray_results",
scheduler=pbt,
num_samples=16,
checkpoint_freq=5,
keep_checkpoints_num=3,
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalbugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn't