Skip to content

[raysgd] Custom training operator#7211

Merged
richardliaw merged 67 commits intoray-project:masterfrom
richardliaw:raysgd-operator
Mar 2, 2020
Merged

[raysgd] Custom training operator#7211
richardliaw merged 67 commits intoray-project:masterfrom
richardliaw:raysgd-operator

Conversation

@richardliaw
Copy link
Copy Markdown
Contributor

@richardliaw richardliaw commented Feb 18, 2020

Why are these changes needed?

Provide an abstraction for stateful custom training loops. See requirement doc here:

https://docs.google.com/document/d/1Cjp8Xtu9dG_vOO9yyrEIGL2MypgSsJ5wApzBa_BkUOQ/edit#

Need to:

  • write documentation
  • write tests

This PR leaves a couple usability stories to do:

  1. train_epoch(num_steps) should let you intercept training during an epoch for custom methods such as checkpoints. Do we continue training from the point we left off?
  2. If my 100 batches returns some loss, how do I aggregate this?
  3. In the current interface, custom learning rate schedulers are unwieldy. Is there a clear way of handling them?

Related issue number

Checks

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22451/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22469/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22471/
Test FAILed.

Copy link
Copy Markdown
Contributor

@maximsmol maximsmol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting some documentation clarification.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Sorry for the delay on reviewing.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22572/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22576/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22573/
Test FAILed.

Copy link
Copy Markdown
Collaborator

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22594/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22602/
Test PASSed.

@richardliaw richardliaw merged commit 48cdca8 into ray-project:master Mar 2, 2020
@richardliaw richardliaw deleted the raysgd-operator branch March 2, 2020 05:22
ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants