[tune/raysgd] Tune API for TorchTrainer + Fix State Restoration by richardliaw · Pull Request #7547 · ray-project/ray

richardliaw · 2020-03-11T02:00:34Z

Why are these changes needed?

This PR introduces two new changes.

A new API for tuning TorchTrainer. We leverage RayTune for this.
A new API for save/restore. Instead of capturing serialization within the class, we expose state.

Thoughts:

There's quite a bit of layering. You need to create like, 6 different creator functions to get TorchTrainer to work with PBT.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://ray.readthedocs.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested (please justify below)

…o fp16_pytorch

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

AmplabJenkins · 2020-03-28T05:26:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23856/
Test FAILed.

AmplabJenkins · 2020-03-28T05:27:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23859/
Test FAILed.

AmplabJenkins · 2020-03-28T16:58:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23868/
Test FAILed.

AmplabJenkins · 2020-03-28T18:38:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23869/
Test FAILed.

AmplabJenkins · 2020-03-28T18:39:26Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23870/
Test FAILed.

AmplabJenkins · 2020-03-28T19:19:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23871/
Test FAILed.

richardliaw · 2020-03-28T19:55:00Z

doc/source/raysgd/raysgd_pytorch.rst

-
-.. code-block:: python
-
-    trainer.train(max_retries=N, checkpoint="auto")


This API is now removed.

richardliaw · 2020-03-28T19:55:25Z

python/ray/util/sgd/tests/test_torch.py

+        mean_train_loss1 = df.loc[0, "train_loss"]
+        mean_train_loss2 = df.loc[1, "train_loss"]
+        mean_val_loss1 = df.loc[0, "val_loss"]
+        mean_val_loss2 = df.loc[1, "val_loss"]


mean prefix is now removed... it was too confusing imo

richardliaw · 2020-03-28T19:55:58Z

python/ray/util/sgd/torch/distributed_torch_runner.py

-        cpu_state_dicts = []
-        for model in self.models:
-            state_dict = model.module.state_dict()
-            # This is so that we create a duplicate of weights into CPU rather
-            # than move the model weights out of the GPU so that we can
-            # resume training while saving intermediate checkpoints.
-            cpu_state_dicts += [{k: v.cpu() for k, v in state_dict.items()}]
-        return cpu_state_dicts


We removed this because you can now actually just use the state_dict from the Local Runner without copying it into CPU.

AmplabJenkins · 2020-03-28T20:15:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23873/
Test FAILed.

AmplabJenkins · 2020-03-28T21:02:57Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23874/
Test FAILed.

AmplabJenkins · 2020-03-29T02:51:33Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23886/
Test FAILed.

AmplabJenkins · 2020-03-29T05:34:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23890/
Test PASSed.

AmplabJenkins · 2020-03-29T19:43:42Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23904/
Test PASSed.

maximsmol · 2020-03-30T17:16:53Z

lgtm! I have no particular comments

edoakes

Deferring to @maximsmol

richardliaw and others added 30 commits February 5, 2020 09:24

Init fp16

df57735

fp16 and schedulers

2342e76

scheduler linking and fp16

49abded

to fp16

cc4465f

loss scaling and documentation

00f9926

more documentation

bee750a

add tests, refactor config

80e43a8

moredocs

9a89b89

more docs

1d20075

fix logo, add test mode, add fp16 flag

3e98f6d

fix tests

168b813

fix scheduler

9a77e49

fix apex

fa1eaaa

improve safety

bea67f6

fix tests

420a2f2

fix tests

f53e9e6

remove pin memory default

fce88a5

rm

39f5328

fix

f8caf4b

Update doc/examples/doc_code/raysgd_torch_signatures.py

9d76e21

fix

b7ca982

Merge branch 'fp16_pytorch' of https://github.com/richardliaw/ray int…

c278e9b

…o fp16_pytorch

migrate changes from other PR

dff9521

ok thanks

20428a7

pass

62ccf6d

signatures

0c388fa

lint'

beeae3c

Update python/ray/experimental/sgd/pytorch/utils.py

748a553

Apply suggestions from code review

acacee0

Co-Authored-By: Edward Oakes <ed.nmi.oakes@gmail.com>

should address most comments

daf5794

lint

f359e97

richardliaw added 3 commits March 28, 2020 10:46

nice

a9e826e

done

866205c

retries

5ddae21

richardliaw added 2 commits March 28, 2020 12:43

fixes

04e0432

kill

e7b59a9

richardliaw commented Mar 28, 2020

View reviewed changes

retry

88d8537

unwrap

7c4ad3c

richardliaw changed the title ~~[tune/raysgd] Tune for TorchTrainer + Fix State Restoration~~ [tune/raysgd] Tune API for TorchTrainer + Fix State Restoration Mar 29, 2020

fixtest

387ceab

none

cf1ce07

richardliaw added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Mar 29, 2020

richardliaw assigned edoakes Mar 29, 2020

edoakes approved these changes Mar 30, 2020

View reviewed changes

edoakes merged commit 86cff17 into ray-project:master Mar 30, 2020


		.. code-block:: python

		trainer.train(max_retries=N, checkpoint="auto")

Conversation

richardliaw commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

richardliaw Mar 28, 2020

Choose a reason for hiding this comment

Uh oh!

richardliaw Mar 28, 2020

Choose a reason for hiding this comment

Uh oh!

richardliaw Mar 28, 2020

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 28, 2020

Uh oh!

AmplabJenkins commented Mar 29, 2020

Uh oh!

AmplabJenkins commented Mar 29, 2020

Uh oh!

AmplabJenkins commented Mar 29, 2020

Uh oh!

maximsmol commented Mar 30, 2020

Uh oh!

edoakes left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

richardliaw commented Mar 11, 2020 •

edited

Loading