Skip to content

[tune] trainable restore_from_object fails with FileNotFound #8772

@talipini

Description

@talipini

What is the problem?

When trainable tries to restore_from_object, the restore process fails because of the wrong path. After loading the checkpoint object, the method writes it in the tmpdir (see code snippet below) but passes tmpdir/checkpoint dir to restore function. Shouldn't the write happen to the tmpdir/checkpoint dir?
Current code does this --
path = os.path.join(tmpdir, relpath_name) #<-------- written to tmpdir. This should be tmpdir/checkpoint?
.....
self.restore(checkpoint_path) #<------ passing tmpdir/checkpoint dir to restore

Full function snippet below.

Ray version and other system information (Python version, TensorFlow version, OS):
ray - 0.9.0.dev0
python - 3.7.7
TF - 2.2.0

Reproduction (REQUIRED)

#From the Trainable class
def restore_from_object(self, obj):
"""Restores training state from a checkpoint object.

    These checkpoints are returned from calls to save_to_object().
    """
    info = pickle.loads(obj)
    data = info["data"]
    tmpdir = tempfile.mkdtemp("restore_from_object", dir=self.logdir)
    checkpoint_path = os.path.join(tmpdir, info["checkpoint_name"])

    for relpath_name, file_contents in data.items():
       path = os.path.join(tmpdir, relpath_name)  #<-------- written to tmpdir

        # This may be a subdirectory, hence not just using tmpdir
        os.makedirs(os.path.dirname(path), exist_ok=True)
        with open(path, "wb") as f:
            f.write(file_contents)

    self.restore(checkpoint_path)  #<------ passing tmpdir/checkpoint dir to restore
    shutil.rmtree(tmpdir)

If we cannot run your script, we cannot fix your issue.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Important issue, but not time-criticalbugSomething that is supposed to be working; but isn'ttuneTune-related issues

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions