Skip to content

[tune] keep_checkpoints_num logic should be in trainable #5127

@richardliaw

Description

@richardliaw

Right now, keep_checkpoints_num has multiple actor calls that block the control loop. All of this can be implemented on the worker, having the Trainable keep track of the checkpoint history, and removing checkpoints as needed. The driver should also mirror this by using rsync --delete.

Metadata

Metadata

Assignees

Labels

good-first-issueGreat starter issue for someone just starting to contribute to RaytuneTune-related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions