Skip to content

[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration#7547

Merged
edoakes merged 198 commits intoray-project:masterfrom
richardliaw:improve_tune_trainable
Mar 30, 2020
Merged

[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration#7547
edoakes merged 198 commits intoray-project:masterfrom
richardliaw:improve_tune_trainable

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw commented Mar 11, 2020

Why are these changes needed?

This PR introduces two new changes.

  1. A new API for tuning TorchTrainer. We leverage RayTune for this.
  2. A new API for save/restore. Instead of capturing serialization within the class, we expose state.

Thoughts:

There's quite a bit of layering. You need to create like, 6 different creator functions to get TorchTrainer to work with PBT.

Related issue number

Checks

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23856/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23859/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23868/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23869/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23870/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23871/
Test FAILed.


.. code-block:: python

trainer.train(max_retries=N, checkpoint="auto")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is now removed.

Comment on lines +452 to +455
mean_train_loss1 = df.loc[0, "train_loss"]
mean_train_loss2 = df.loc[1, "train_loss"]
mean_val_loss1 = df.loc[0, "val_loss"]
mean_val_loss2 = df.loc[1, "val_loss"]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mean prefix is now removed... it was too confusing imo

Comment on lines -145 to -152
cpu_state_dicts = []
for model in self.models:
state_dict = model.module.state_dict()
# This is so that we create a duplicate of weights into CPU rather
# than move the model weights out of the GPU so that we can
# resume training while saving intermediate checkpoints.
cpu_state_dicts += [{k: v.cpu() for k, v in state_dict.items()}]
return cpu_state_dicts
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We removed this because you can now actually just use the state_dict from the Local Runner without copying it into CPU.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23873/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23874/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23886/
Test FAILed.

@richardliaw richardliaw changed the title [tune/raysgd] Tune for TorchTrainer + Fix State Restoration [tune/raysgd] Tune API for TorchTrainer + Fix State Restoration Mar 29, 2020
@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23890/
Test PASSed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23904/
Test PASSed.

@richardliaw richardliaw added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 29, 2020
@maximsmol
Copy link
Copy Markdown
Contributor

lgtm! I have no particular comments

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to @maximsmol

@edoakes edoakes merged commit 86cff17 into ray-project:master Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tests-ok The tagger certifies test failures are unrelated and assumes personal liability.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants