Skip to content

[tune] fix checkpoint bookkeeping, fixing pbt errors#9470

Closed
krfricke wants to merge 4 commits intoray-project:masterfrom
krfricke:tune-pbt-checkpoint-debug
Closed

[tune] fix checkpoint bookkeeping, fixing pbt errors#9470
krfricke wants to merge 4 commits intoray-project:masterfrom
krfricke:tune-pbt-checkpoint-debug

Conversation

@krfricke
Copy link
Copy Markdown
Contributor

@krfricke krfricke commented Jul 14, 2020

Why are these changes needed?

Tune's CheckpointManager did not reflect checkpoint deletions in CheckpointManager.newest_persistent_checkpoint, leading to errors when trying to resume trials. In the related issue, this was the case with a PopulationBasedTraining scheduler. Introducing bookkeeping here prevents Tune from trying to restore trials where the checkpoint has already been deleted.

The error observed in the issues seems to stem from the fact that the CheckpointManager deletes checkpoints according to checkpoint_score_attr (which can be changed by the user), but PBT has it's own internal state bookkeeping, only considering the latest checkpoints. Only when checkpoint_score_attr is unset, both strategies coincide. Thus, with this PR PBT will now also try to use the best available checkpoint if the latest is not available.

Related issue number

Should fix #9036, #8441

Checks

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28329/
Test FAILed.

@richardliaw
Copy link
Copy Markdown
Contributor

@krfricke can you add a test?

@richardliaw richardliaw self-assigned this Jul 14, 2020
@krfricke
Copy link
Copy Markdown
Contributor Author

krfricke commented Jul 15, 2020

I just created a test and got some more insights in the problem.

Basically the checkpoint manager bookkeeping was broken. Checkpoints that should never be kept (because of bad performance) remained in CheckpointManager.newest_persistent_checkpoint. This was the main problem leading to the errors in the issues.

Setting the newest_persistent_checkpoint to None helps with this, but misses the point that there often actually are persistent checkpoints available. Thus, with the latest commit, we either restore the old checkpoint if the new checkpoint is not stored, or alternatively loop through the _best_checkpoint to find the latest one (e.g. if the last checkpoint got deleted and the next one also gets deleted).

Lastly, the alteration in pbt.py where we load the best checkpoint from storage if a memory checkpoint is not available does not directly affect this problem, and we should discuss if we want this behavior or if we should exclude it for now. In my opinion, it is justified. It only occurs when a trial A was not in the top quantile when it reported its last result, but moves to the top quantile after another trial B performed worse. Then yet another trial C tries to exploit trial A, but cannot load the checkpoint as it has been unset. In this case, trial C will then load the latest persistent checkpoint from trial A.

@richardliaw richardliaw requested a review from ujvl July 15, 2020 17:50
@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/28390/
Test FAILed.

@krfricke krfricke mentioned this pull request Jul 16, 2020
4 tasks
@krfricke
Copy link
Copy Markdown
Contributor Author

This PR introduced unwanted logic changes in PBT and the checkpoint manager. Please disregard this PR in favor of #9517.

@krfricke krfricke closed this Jul 16, 2020
@krfricke krfricke deleted the tune-pbt-checkpoint-debug branch July 16, 2020 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tune] Population-based training: broken when using keep_checkpoint_num

3 participants