[raysgd] Custom training operator#7211
Conversation
…o fp16_pytorch
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
maximsmol
left a comment
There was a problem hiding this comment.
Suggesting some documentation clarification.
edoakes
left a comment
There was a problem hiding this comment.
Looks good! Sorry for the delay on reviewing.
Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test PASSed. |
Why are these changes needed?
Provide an abstraction for stateful custom training loops. See requirement doc here:
https://docs.google.com/document/d/1Cjp8Xtu9dG_vOO9yyrEIGL2MypgSsJ5wApzBa_BkUOQ/edit#
Need to:
This PR leaves a couple usability stories to do:
train_epoch(num_steps)should let you intercept training during an epoch for custom methods such as checkpoints. Do we continue training from the point we left off?Related issue number
Checks
scripts/format.shto lint the changes in this PR.