[tune] fix checkpoint bookkeeping, fixing pbt errors#9470
[tune] fix checkpoint bookkeeping, fixing pbt errors#9470krfricke wants to merge 4 commits intoray-project:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
Test FAILed. |
|
@krfricke can you add a test? |
|
I just created a test and got some more insights in the problem. Basically the checkpoint manager bookkeeping was broken. Checkpoints that should never be kept (because of bad performance) remained in Setting the Lastly, the alteration in |
|
Test FAILed. |
|
This PR introduced unwanted logic changes in PBT and the checkpoint manager. Please disregard this PR in favor of #9517. |
Why are these changes needed?
Tune's
CheckpointManagerdid not reflect checkpoint deletions inCheckpointManager.newest_persistent_checkpoint, leading to errors when trying to resume trials. In the related issue, this was the case with a PopulationBasedTraining scheduler. Introducing bookkeeping here prevents Tune from trying to restore trials where the checkpoint has already been deleted.The error observed in the issues seems to stem from the fact that the
CheckpointManagerdeletes checkpoints according tocheckpoint_score_attr(which can be changed by the user), but PBT has it's own internal state bookkeeping, only considering the latest checkpoints. Only whencheckpoint_score_attris unset, both strategies coincide. Thus, with this PR PBT will now also try to use the best available checkpoint if the latest is not available.Related issue number
Should fix #9036, #8441
Checks
scripts/format.shto lint the changes in this PR.