[tune] fix checkpoint bookkeeping, fixing pbt errors by krfricke · Pull Request #9470 · ray-project/ray

krfricke · 2020-07-14T12:55:43Z

Why are these changes needed?

Tune's CheckpointManager did not reflect checkpoint deletions in CheckpointManager.newest_persistent_checkpoint, leading to errors when trying to resume trials. In the related issue, this was the case with a PopulationBasedTraining scheduler. Introducing bookkeeping here prevents Tune from trying to restore trials where the checkpoint has already been deleted.

The error observed in the issues seems to stem from the fact that the CheckpointManager deletes checkpoints according to checkpoint_score_attr (which can be changed by the user), but PBT has it's own internal state bookkeeping, only considering the latest checkpoints. Only when checkpoint_score_attr is unset, both strategies coincide. Thus, with this PR PBT will now also try to use the best available checkpoint if the latest is not available.

Related issue number

Should fix #9036, #8441

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/latest/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
Testing Strategy
- Unit tests

AmplabJenkins · 2020-07-14T13:02:09Z

Can one of the admins verify this patch?

AmplabJenkins · 2020-07-14T14:10:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28329/
Test FAILed.

richardliaw · 2020-07-14T18:36:26Z

@krfricke can you add a test?

krfricke · 2020-07-15T10:43:15Z

I just created a test and got some more insights in the problem.

Basically the checkpoint manager bookkeeping was broken. Checkpoints that should never be kept (because of bad performance) remained in CheckpointManager.newest_persistent_checkpoint. This was the main problem leading to the errors in the issues.

Setting the newest_persistent_checkpoint to None helps with this, but misses the point that there often actually are persistent checkpoints available. Thus, with the latest commit, we either restore the old checkpoint if the new checkpoint is not stored, or alternatively loop through the _best_checkpoint to find the latest one (e.g. if the last checkpoint got deleted and the next one also gets deleted).

Lastly, the alteration in pbt.py where we load the best checkpoint from storage if a memory checkpoint is not available does not directly affect this problem, and we should discuss if we want this behavior or if we should exclude it for now. In my opinion, it is justified. It only occurs when a trial A was not in the top quantile when it reported its last result, but moves to the top quantile after another trial B performed worse. Then yet another trial C tries to exploit trial A, but cannot load the checkpoint as it has been unset. In this case, trial C will then load the latest persistent checkpoint from trial A.

AmplabJenkins · 2020-07-15T19:35:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28390/
Test FAILed.

krfricke · 2020-07-16T09:17:20Z

This PR introduced unwanted logic changes in PBT and the checkpoint manager. Please disregard this PR in favor of #9517.

Kai Fricke added 2 commits July 14, 2020 13:55

bookkeeping for newest_persistent_checkpoint

ace85c7

use best performing trial in pbt if last trial was not found

59d1212

richardliaw self-assigned this Jul 14, 2020

Kai Fricke added 2 commits July 15, 2020 12:30

Fix checkpoint bookkeeping in checkpoint manager

1a3ccc9

Restore old checkpoint if newest is not stored

963c8cb

richardliaw requested a review from ujvl July 15, 2020 17:50

krfricke mentioned this pull request Jul 16, 2020

[tune] fix pbt checkpoint_freq #9517

Merged

4 tasks

krfricke closed this Jul 16, 2020

krfricke deleted the tune-pbt-checkpoint-debug branch July 16, 2020 09:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] fix checkpoint bookkeeping, fixing pbt errors#9470

[tune] fix checkpoint bookkeeping, fixing pbt errors#9470
krfricke wants to merge 4 commits intoray-project:masterfrom
krfricke:tune-pbt-checkpoint-debug

krfricke commented Jul 14, 2020 •

edited

Loading

Uh oh!

AmplabJenkins commented Jul 14, 2020

Uh oh!

AmplabJenkins commented Jul 14, 2020

Uh oh!

richardliaw commented Jul 14, 2020

Uh oh!

krfricke commented Jul 15, 2020 •

edited

Loading

Uh oh!

AmplabJenkins commented Jul 15, 2020

Uh oh!

krfricke commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krfricke commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

AmplabJenkins commented Jul 14, 2020

Uh oh!

AmplabJenkins commented Jul 14, 2020

Uh oh!

richardliaw commented Jul 14, 2020

Uh oh!

krfricke commented Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AmplabJenkins commented Jul 15, 2020

Uh oh!

krfricke commented Jul 16, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krfricke commented Jul 14, 2020 •

edited

Loading

krfricke commented Jul 15, 2020 •

edited

Loading