[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration#7547
[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration#7547edoakes merged 198 commits intoray-project:masterfrom
Conversation
…o fp16_pytorch
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
|
||
| .. code-block:: python | ||
|
|
||
| trainer.train(max_retries=N, checkpoint="auto") |
There was a problem hiding this comment.
This API is now removed.
| mean_train_loss1 = df.loc[0, "train_loss"] | ||
| mean_train_loss2 = df.loc[1, "train_loss"] | ||
| mean_val_loss1 = df.loc[0, "val_loss"] | ||
| mean_val_loss2 = df.loc[1, "val_loss"] |
There was a problem hiding this comment.
mean prefix is now removed... it was too confusing imo
| cpu_state_dicts = [] | ||
| for model in self.models: | ||
| state_dict = model.module.state_dict() | ||
| # This is so that we create a duplicate of weights into CPU rather | ||
| # than move the model weights out of the GPU so that we can | ||
| # resume training while saving intermediate checkpoints. | ||
| cpu_state_dicts += [{k: v.cpu() for k, v in state_dict.items()}] | ||
| return cpu_state_dicts |
There was a problem hiding this comment.
We removed this because you can now actually just use the state_dict from the Local Runner without copying it into CPU.
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
|
Test PASSed. |
|
lgtm! I have no particular comments |
edoakes
left a comment
There was a problem hiding this comment.
Deferring to @maximsmol
Why are these changes needed?
This PR introduces two new changes.
Thoughts:
There's quite a bit of layering. You need to create like, 6 different creator functions to get TorchTrainer to work with PBT.
Related issue number
Checks
scripts/format.shto lint the changes in this PR.