Skip to content

[tune] Use newest checkpoint in normal operation#7563

Merged
richardliaw merged 7 commits intoray-project:masterfrom
ujvl:tune-chkpt
Mar 13, 2020
Merged

[tune] Use newest checkpoint in normal operation#7563
richardliaw merged 7 commits intoray-project:masterfrom
ujvl:tune-chkpt

Conversation

@ujvl
Copy link
Copy Markdown
Contributor

@ujvl ujvl commented Mar 11, 2020

Instead of determining trial checkpoint to use by PAUSED vs not PAUSED, we should determine by ERROR vs not ERROR. In the latter case use the newest checkpoint (whether it is in-memory or persistent).

This fixes bug for #7528 (case where trial is unpaused but we still want to use the in-memory checkpoint).

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Unit tests

@richardliaw
Copy link
Copy Markdown
Contributor

richardliaw commented Mar 11, 2020

Can you add this test in?

This test explicitly uses unpause, which is a closer proxy to what happens in the custom schedulers.

    def testPauseUnpause(self):
        """Tests that unpausing works for trials being processed."""
        trial = Trial("__fake")
        self.trial_executor.start_trial(trial)
        self.assertEqual(Trial.RUNNING, trial.status)
        result = self.trial_executor.fetch_result(trial)
        assert result[TRAINING_ITERATION] == 1
        self.trial_executor.pause_trial(trial)
        self.assertEqual(Trial.PAUSED, trial.status)
        self.trial_executor.unpause_trial(trial)
        self.assertEqual(Trial.PENDING, trial.status)
        self.trial_executor.start_trial(trial)
        self.assertEqual(Trial.RUNNING, trial.status)
        result = self.trial_executor.fetch_result(trial)
        assert result[TRAINING_ITERATION] == 2
        self.trial_executor.stop_trial(trial)
        self.assertEqual(Trial.TERMINATED, trial.status)

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23038/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23039/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23040/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23041/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23043/
Test FAILed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23065/
Test FAILed.

@ujvl
Copy link
Copy Markdown
Contributor Author

ujvl commented Mar 12, 2020

jenkins test tune

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/377/
Tune tests failed.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/23104/
Test FAILed.

@ujvl
Copy link
Copy Markdown
Contributor Author

ujvl commented Mar 12, 2020

jenkins test tune

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Tune-Tests/378/
Tune tests failed.


runner.step()
runner.step() # Start trial
runner.step() # Process result
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, why do we need to process results here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want there to be a result associated with the checkpoint, otherwise we can't tell which checkpoint (persistent or memory) to return since its tied on training iteration (-1 for no result).

@richardliaw richardliaw merged commit 6022eb5 into ray-project:master Mar 13, 2020
@ujvl ujvl deleted the tune-chkpt branch March 13, 2020 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants