[tune] Use newest checkpoint in normal operation by ujvl · Pull Request #7563 · ray-project/ray

ujvl · 2020-03-11T16:15:25Z

Instead of determining trial checkpoint to use by PAUSED vs not PAUSED, we should determine by ERROR vs not ERROR. In the latter case use the newest checkpoint (whether it is in-memory or persistent).

This fixes bug for #7528 (case where trial is unpaused but we still want to use the in-memory checkpoint).

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Unit tests

richardliaw · 2020-03-11T17:01:54Z

Can you add this test in?

This test explicitly uses unpause, which is a closer proxy to what happens in the custom schedulers.

    def testPauseUnpause(self):
        """Tests that unpausing works for trials being processed."""
        trial = Trial("__fake")
        self.trial_executor.start_trial(trial)
        self.assertEqual(Trial.RUNNING, trial.status)
        result = self.trial_executor.fetch_result(trial)
        assert result[TRAINING_ITERATION] == 1
        self.trial_executor.pause_trial(trial)
        self.assertEqual(Trial.PAUSED, trial.status)
        self.trial_executor.unpause_trial(trial)
        self.assertEqual(Trial.PENDING, trial.status)
        self.trial_executor.start_trial(trial)
        self.assertEqual(Trial.RUNNING, trial.status)
        result = self.trial_executor.fetch_result(trial)
        assert result[TRAINING_ITERATION] == 2
        self.trial_executor.stop_trial(trial)
        self.assertEqual(Trial.TERMINATED, trial.status)

AmplabJenkins · 2020-03-11T17:17:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23038/
Test FAILed.

AmplabJenkins · 2020-03-11T17:33:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23039/
Test FAILed.

AmplabJenkins · 2020-03-11T18:02:32Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23040/
Test FAILed.

AmplabJenkins · 2020-03-11T18:12:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23041/
Test FAILed.

AmplabJenkins · 2020-03-11T20:05:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23043/
Test FAILed.

AmplabJenkins · 2020-03-12T06:19:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23065/
Test FAILed.

ujvl · 2020-03-12T08:49:47Z

jenkins test tune

AmplabJenkins · 2020-03-12T09:40:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/377/
Tune tests failed.

AmplabJenkins · 2020-03-12T18:34:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23104/
Test FAILed.

ujvl · 2020-03-12T19:02:11Z

jenkins test tune

AmplabJenkins · 2020-03-12T19:45:00Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/378/
Tune tests failed.

richardliaw · 2020-03-13T00:34:35Z

python/ray/tune/tests/test_trial_runner_2.py


-        runner.step()
+        runner.step()  # Start trial
+        runner.step()  # Process result


btw, why do we need to process results here?

we want there to be a result associated with the checkpoint, otherwise we can't tell which checkpoint (persistent or memory) to return since its tied on training iteration (-1 for no result).

ujvl added 2 commits March 11, 2020 09:08

Use persistent checkpoint for failures

208a2dc

Fix test

e08516b

ujvl added 2 commits March 11, 2020 10:11

Add unpause test

dfcf02e

move test

b106c78

Fix tests

7371294

remove debug statement

5247dc2

Mark test as flaky

81460cc

richardliaw reviewed Mar 13, 2020

View reviewed changes

richardliaw approved these changes Mar 13, 2020

View reviewed changes

richardliaw merged commit 6022eb5 into ray-project:master Mar 13, 2020

ujvl deleted the tune-chkpt branch March 13, 2020 05:35

richardliaw mentioned this pull request Mar 13, 2020

[tune] Hyperband Scheduler Not working #7011

Closed

arsedler9 mentioned this pull request Mar 17, 2020

Local cluster YAML no longer working in 0.9.0.dev0 #7632

Closed

2 tasks

Conversation

ujvl commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checks

Uh oh!

richardliaw commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

AmplabJenkins commented Mar 11, 2020

Uh oh!

AmplabJenkins commented Mar 12, 2020

Uh oh!

ujvl commented Mar 12, 2020

Uh oh!

AmplabJenkins commented Mar 12, 2020

Uh oh!

AmplabJenkins commented Mar 12, 2020

Uh oh!

ujvl commented Mar 12, 2020

Uh oh!

AmplabJenkins commented Mar 12, 2020

Uh oh!

richardliaw Mar 13, 2020

Choose a reason for hiding this comment

Uh oh!

ujvl Mar 13, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ujvl commented Mar 11, 2020 •

edited

Loading

richardliaw commented Mar 11, 2020 •

edited

Loading