Skip to content

[tune] PBT: Path to checkpoint does not exist error after some time #8441

@janblumenkamp

Description

@janblumenkamp

What is the problem?

I am running PBT with keep_checkpoints_num=3. After some time (this time after running for a few hours), a process died with the error that the checkpoint could not be found, probably because it was deleted since it was outdated.

Ray version and other system information (Python version, TensorFlow version, OS):

  • ray 0.9.0.dev0 (b95e28f)
  • Python 3.7.5
  • Torch 1.5.0
  • Ubuntu 18.04
Failure # 1 (occurred at 2020-05-13_20-31-02)
Traceback (most recent call last):
  File "[...]/ray/tune/ray_trial_executor.py", line 295, in start_trial
    self._start_trial(trial, checkpoint, train=train)
  File "[...]/ray/tune/ray_trial_executor.py", line 235, in _start_trial
    self.restore(trial, checkpoint)
  File "[...]/ray/tune/ray_trial_executor.py", line 675, in restore
    data_dict = TrainableUtil.pickle_checkpoint(value)
  File "[...]/ray/tune/trainable.py", line 33, in pickle_checkpoint
    checkpoint_dir = TrainableUtil.find_checkpoint_dir(checkpoint_path)
  File "[...]/ray/tune/trainable.py", line 57, in find_checkpoint_dir
    raise FileNotFoundError("Path does not exist", checkpoint_path)
FileNotFoundError: [Errno Path does not exist] [...]/checkpoint_110/checkpoint-110

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

I can't reproduce this problem for now as I can't spend time and computation power running the PBT examples, but this is my PBT configuration:

   def explore(config):
        config["gamma"] = np.clip(config["gamma"], 0.5, 1.0)
        config["clip_param"] = np.clip(config["clip_param"], 0.05, 0.5)
        config["lambda"] = np.clip(config["lambda"], 0.0, 1.0)
        config["lr"] = np.clip(config["lr"], 1e-8, 1e-2)
        return config

    pbt = PopulationBasedTraining(
        time_attr="time_total_s",
        metric="evaluation/custom_metrics/perf_100_mean",
        mode="max",
        perturbation_interval=60,
        resample_probability=0.25,
        # Specifies the mutations of these hyperparams
        hyperparam_mutations={
            "lr": lambda: random.uniform(1e-6, 1e-3),
            "lambda": lambda: random.uniform(0.0, 1.0),
            "clip_param": lambda: random.uniform(0.1, 0.4),
            "num_sgd_iter": lambda: random.randint(1, 14),
            "vf_loss_coeff": lambda: random.uniform(1e-6, 1.0),
            "gamma": lambda: random.uniform(0.5, 1),
            "vf_clip_param": lambda: random.uniform(1, 1000)
        },
        custom_explore_fn=explore)

tune config:

        name="pbt_coverage_coop",
        local_dir="[...]/ray_results",
        scheduler=pbt,
        num_samples=16,
        checkpoint_freq=5,
        keep_checkpoints_num=3,

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions