-
Notifications
You must be signed in to change notification settings - Fork 7.4k
[tune] trainable restore_from_object fails with FileNotFound #8772
Description
What is the problem?
When trainable tries to restore_from_object, the restore process fails because of the wrong path. After loading the checkpoint object, the method writes it in the tmpdir (see code snippet below) but passes tmpdir/checkpoint dir to restore function. Shouldn't the write happen to the tmpdir/checkpoint dir?
Current code does this --
path = os.path.join(tmpdir, relpath_name) #<-------- written to tmpdir. This should be tmpdir/checkpoint?
.....
self.restore(checkpoint_path) #<------ passing tmpdir/checkpoint dir to restore
Full function snippet below.
Ray version and other system information (Python version, TensorFlow version, OS):
ray - 0.9.0.dev0
python - 3.7.7
TF - 2.2.0
Reproduction (REQUIRED)
#From the Trainable class
def restore_from_object(self, obj):
"""Restores training state from a checkpoint object.
These checkpoints are returned from calls to save_to_object().
"""
info = pickle.loads(obj)
data = info["data"]
tmpdir = tempfile.mkdtemp("restore_from_object", dir=self.logdir)
checkpoint_path = os.path.join(tmpdir, info["checkpoint_name"])
for relpath_name, file_contents in data.items():
path = os.path.join(tmpdir, relpath_name) #<-------- written to tmpdir
# This may be a subdirectory, hence not just using tmpdir
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "wb") as f:
f.write(file_contents)
self.restore(checkpoint_path) #<------ passing tmpdir/checkpoint dir to restore
shutil.rmtree(tmpdir)
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.