[tune] Avoid scheduler blocking, add reuse_actors optimization#4218
[tune] Avoid scheduler blocking, add reuse_actors optimization#4218ericl merged 17 commits intoray-project:masterfrom
Conversation
|
@arcelien please try this out |
|
Test FAILed. |
richardliaw
left a comment
There was a problem hiding this comment.
Left a couple comments.
| if not reset_successful: | ||
| if reset_successful: | ||
| trial_executor.restore( | ||
| trial, Checkpoint.from_object(new_state.last_checkpoint)) |
There was a problem hiding this comment.
Let me try it on a single gpu machine both with time mutliplexing and also with a small population size.
|
@richardliaw @arcelien I just pushed a change that should drastically speed up time-multiplexing as well, by reusing actors across different trials. This is a bit of a scary change so I marked the PR as WIP until we're sure this is safe. |
| if not error and self._cached_runner is None: | ||
| logger.debug("Retaining trial runner {}".format( | ||
| trial.runner)) | ||
| self._cached_runner = trial.runner |
There was a problem hiding this comment.
I suspect GPU trials may run into issues when reusing same processes because TF doesn't give up the GPU -- unless you've observed otherwise?
There was a problem hiding this comment.
There's no difference between reusing a runner, that retains control of the GPU, and stopping the runner and starting a new one (release / reacquire the GPU), right?
There was a problem hiding this comment.
oh I guess my question is mainly what happens during Trainable._setup(), which actually isn't called in _setup_runner when we're using a cached runner (see above note)
There was a problem hiding this comment.
Right, we call reset_trial() instead. I guess this works for PBT, but not necesssarily for other algorithms. Unless we add a reset_state() function as well?
| logger.debug("Reusing cached runner {}".format( | ||
| self._cached_runner)) | ||
| existing_runner = self._cached_runner | ||
| self._cached_runner = None |
There was a problem hiding this comment.
if we're processing a new trial and the trial resources are different, we can't just used the _cached_runner right?
There was a problem hiding this comment.
That's right. I think we can probably assume they are the same though, if the reuse_actors flag is manually activated.
python/ray/tune/trainable.py
Outdated
| return self._export_model(export_formats, export_dir) | ||
|
|
||
| def reset_config(self, new_config): | ||
| def reset_config(self, new_config, reset_state): |
There was a problem hiding this comment.
Breaking API change. Though I doubt reset_config() is widely used.
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Updated |
|
Test FAILed. |
| self._restore(checkpoint_path) | ||
| self._time_since_restore = 0.0 | ||
| self._timesteps_since_restore = 0 | ||
| self._iterations_since_restore = 0 |
|
BTW, all Travis tests hang on |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test PASSed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test PASSed. |
|
Test FAILed. |
What do these changes do?
The key change is removing the ray.get here:
This avoids blocking the PBT scheduler when restoring trials. This is a large performance bottleneck when restoring very large network weights.
Also, add warnings if fast ray.get paths are ever slow, and warn the user if they didn't implement reset_config(). Incidentally, I think we forgot to restore weights on the reset_config()=True path.
Finally, add a
reuse_actorsflag that allows actors to be reused across trials ifreset_configis implemented. This provides additional speedups if actor creation is slow.