[raysgd] Custom training operator by richardliaw · Pull Request #7211 · ray-project/ray

richardliaw · 2020-02-18T23:26:00Z

Why are these changes needed?

Provide an abstraction for stateful custom training loops. See requirement doc here:

https://docs.google.com/document/d/1Cjp8Xtu9dG_vOO9yyrEIGL2MypgSsJ5wApzBa_BkUOQ/edit#

Need to:

write documentation
write tests

This PR leaves a couple usability stories to do:

train_epoch(num_steps) should let you intercept training during an epoch for custom methods such as checkpoints. Do we continue training from the point we left off?
If my 100 batches returns some loss, how do I aggregate this?
In the current interface, custom learning rate schedulers are unwieldy. Is there a clear way of handling them?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.

…o fp16_pytorch

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

AmplabJenkins · 2020-02-26T23:13:40Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22451/
Test FAILed.

AmplabJenkins · 2020-02-27T02:05:12Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22469/
Test FAILed.

AmplabJenkins · 2020-02-27T02:07:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22471/
Test FAILed.

maximsmol

Suggesting some documentation clarification.

doc/source/raysgd/raysgd_pytorch.rst

edoakes

Looks good! Sorry for the delay on reviewing.

doc/source/raysgd/raysgd_pytorch.rst

python/ray/util/sgd/pytorch/examples/dcgan.py

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

AmplabJenkins · 2020-02-29T01:05:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22572/
Test FAILed.

AmplabJenkins · 2020-02-29T01:47:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22576/
Test FAILed.

AmplabJenkins · 2020-02-29T03:38:03Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22573/
Test FAILed.

edoakes

LGTM

AmplabJenkins · 2020-03-02T00:41:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22594/
Test FAILed.

AmplabJenkins · 2020-03-02T02:55:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/22602/
Test PASSed.

richardliaw and others added 30 commits February 5, 2020 09:24

Init fp16

df57735

fp16 and schedulers

2342e76

scheduler linking and fp16

49abded

to fp16

cc4465f

loss scaling and documentation

00f9926

more documentation

bee750a

add tests, refactor config

80e43a8

moredocs

9a89b89

more docs

1d20075

fix logo, add test mode, add fp16 flag

3e98f6d

fix tests

168b813

fix scheduler

9a77e49

fix apex

fa1eaaa

improve safety

bea67f6

fix tests

420a2f2

fix tests

f53e9e6

remove pin memory default

fce88a5

rm

39f5328

fix

f8caf4b

Update doc/examples/doc_code/raysgd_torch_signatures.py

9d76e21

fix

b7ca982

Merge branch 'fp16_pytorch' of https://github.com/richardliaw/ray int…

c278e9b

…o fp16_pytorch

migrate changes from other PR

dff9521

ok thanks

20428a7

pass

62ccf6d

signatures

0c388fa

lint'

beeae3c

Update python/ray/experimental/sgd/pytorch/utils.py

748a553

Apply suggestions from code review

acacee0

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

should address most comments

daf5794

richardliaw added 2 commits February 26, 2020 14:29

ok done

5b64b32

operator

7ecb2d3

richardliaw added 4 commits February 26, 2020 16:53

sgd test fixes

df76199

ok

0354b8f

trainer

848ef30

format

1d83e80

richardliaw assigned edoakes Feb 27, 2020

maximsmol requested changes Feb 27, 2020

View reviewed changes

edoakes reviewed Feb 28, 2020

View reviewed changes

richardliaw and others added 6 commits February 28, 2020 16:07

Apply suggestions from code review

6023dc3

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

Update doc/source/raysgd/raysgd_pytorch.rst

cc08549

Merge branch 'master' into raysgd-operator

19ef379

docstring

c9819b5

dcgan

5c1b8d6

doc

8688c02

edoakes approved these changes Mar 1, 2020

View reviewed changes

fix-test

e53f362

test

fd3d2d2

richardliaw merged commit 48cdca8 into ray-project:master Mar 2, 2020

richardliaw deleted the raysgd-operator branch March 2, 2020 05:22

ffbin pushed a commit to antgroup/ant-ray that referenced this pull request Mar 20, 2020

[raysgd] Custom training operator (ray-project#7211)

a7dc444

Conversation

richardliaw commented Feb 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Feb 26, 2020

Uh oh!

AmplabJenkins commented Feb 27, 2020

Uh oh!

AmplabJenkins commented Feb 27, 2020

Uh oh!

maximsmol left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Feb 29, 2020

Uh oh!

AmplabJenkins commented Feb 29, 2020

Uh oh!

AmplabJenkins commented Feb 29, 2020

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 2, 2020

Uh oh!

AmplabJenkins commented Mar 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richardliaw commented Feb 18, 2020 •

edited

Loading