[tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling#1848
Conversation
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Test FAILed. |
|
Apex is failing |
|
Should be fixed |
|
Test FAILed. |
|
jenkins retest this please |
|
Test PASSed. |
|
Test FAILed. |
|
jenkins retest this please |
|
Test FAILed. |
|
jenkins retest this please |
richardliaw
left a comment
There was a problem hiding this comment.
Nice. A couple points:
-
each alg now has a resource requirement - as a user it's very unclear what each configuration does, and how I would go about altering the resource configuration. It would be great to have documentation for each
default_resource_request. -
I'm not sure what
queue_trialsdoes
| using the Client API. | ||
| server_port (int): Port number for launching TuneServer. | ||
| verbose (bool): How much output should be printed for each trial. | ||
| queue_trials (bool): Whether to queue trials when the cluster does |
There was a problem hiding this comment.
Ah I read this multiple times, and I'm not sure what this means; aren't trials "queued" already?
There was a problem hiding this comment.
They aren't though if there aren't enough resources. For example if you have one CPU and launch a trial with 4 CPUs then it won't be queued.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| ray.init() |
There was a problem hiding this comment.
hm why is this needed?
There was a problem hiding this comment.
It's since we now have to ray.get the trainable to inspect the resources
| server_port (int): Port number for launching TuneServer | ||
| verbose (bool): Flag for verbosity. If False, trial results | ||
| will not be output. | ||
| queue_trials (bool): Whether to queue trials when the cluster does |
There was a problem hiding this comment.
yeah I'm not sure what this means...
| _allow_unknown_subkeys = ["model", "optimizer", "env_config"] | ||
|
|
||
| @classmethod | ||
| def default_resource_request(cls, config): |
There was a problem hiding this comment.
This makes sense but in some sense adds a layer of extra complexity. I think if anything, every agent have some explanation of why these are set as defaults.
There was a problem hiding this comment.
Yeah the idea is the algorithm author deals with the complexity, not the users.
|
Test PASSed. |
|
Test FAILed. |
|
so queued_trials are when you start a trial but the cluster doesn't have enough extra_cpus/extra_gpus, so it hangs until the autoscaler kicks in? |
|
That's right |
|
jenkins retest this please |
|
Travis looks good. jenkins retest this please |
|
Test PASSed. |
* master: Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903) [tune] Polishing docs (ray-project#1846) [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848) Preemptively push local arguments for actor tasks (ray-project#1901) [tune] Allow fetching pinned objects from trainable functions (ray-project#1895) Multithreading refactor for ObjectManager. (ray-project#1911) Add slice functionality (ray-project#1832) [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894) Addresses missed comments from multichunk object transfer PR. (ray-project#1908) Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816) [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834) Fix UI issue for non-json-serializable task arguments. (ray-project#1892) Remove unnecessary calls to .hex() for object IDs. (ray-project#1910) Allow multiple raylets to be started on a single machine. (ray-project#1904) # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py
What do these changes do?
RLlib agents now declare their resource requests to Tune, so as a user you don't have to.
Also add a
--queue-trialsoption to Tune, which will allow a Trial to be scheduled even if it somewhat exceeds the resource capacity of the cluster. This should allow RLlib to work with autoscaling clusters in many cases. Couple caveats:Related issue number