[tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling by ericl · Pull Request #1848 · ray-project/ray

ericl · 2018-04-07T21:26:17Z

What do these changes do?

RLlib agents now declare their resource requests to Tune, so as a user you don't have to.

Also add a --queue-trials option to Tune, which will allow a Trial to be scheduled even if it somewhat exceeds the resource capacity of the cluster. This should allow RLlib to work with autoscaling clusters in many cases. Couple caveats:

If the cluster has 0 GPUs it won't queue, this is to avoid hangs forever if the cluster would not add more GPUs in the future.
It's possible some odd shaped resource requests will queue but not trigger auto-scaling, or remain infeasible after autoscaling (e.g. request num cpus greater than the max any single machine can offer). This can be mitigated by just using big machines and setting the autoscaling threshold lower.

Related issue number

AmplabJenkins · 2018-04-07T21:58:38Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4715/
Test FAILed.

AmplabJenkins · 2018-04-07T22:03:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4716/
Test FAILed.

AmplabJenkins · 2018-04-07T22:06:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4717/
Test FAILed.

AmplabJenkins · 2018-04-07T22:12:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4718/
Test FAILed.

AmplabJenkins · 2018-04-07T22:20:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4719/
Test FAILed.

AmplabJenkins · 2018-04-07T22:35:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4720/
Test FAILed.

richardliaw · 2018-04-08T07:43:05Z

Apex is failing

ericl · 2018-04-08T21:42:44Z

Should be fixed

AmplabJenkins · 2018-04-08T22:12:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4741/
Test FAILed.

ericl · 2018-04-09T00:58:46Z

jenkins retest this please

AmplabJenkins · 2018-04-09T01:56:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4746/
Test PASSed.

AmplabJenkins · 2018-04-09T19:51:51Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4754/
Test FAILed.

ericl · 2018-04-09T19:52:40Z

jenkins retest this please

AmplabJenkins · 2018-04-09T19:54:27Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4756/
Test FAILed.

ericl · 2018-04-10T20:44:11Z

jenkins retest this please

richardliaw

Nice. A couple points:

each alg now has a resource requirement - as a user it's very unclear what each configuration does, and how I would go about altering the resource configuration. It would be great to have documentation for each default_resource_request.
I'm not sure what queue_trials does

richardliaw · 2018-04-11T01:54:58Z

python/ray/tune/tune.py

            using the Client API.
        server_port (int): Port number for launching TuneServer.
        verbose (bool): How much output should be printed for each trial.
+        queue_trials (bool): Whether to queue trials when the cluster does


Ah I read this multiple times, and I'm not sure what this means; aren't trials "queued" already?

They aren't though if there aren't enough resources. For example if you have one CPU and launch a trial with 4 CPUs then it won't be queued.

richardliaw · 2018-04-11T01:57:31Z

python/ray/tune/test/trial_scheduler_test.py



 if __name__ == "__main__":
+    ray.init()


hm why is this needed?

It's since we now have to ray.get the trainable to inspect the resources

richardliaw · 2018-04-11T07:28:22Z

python/ray/tune/trial_runner.py

            server_port (int): Port number for launching TuneServer
            verbose (bool): Flag for verbosity. If False, trial results
                will not be output.
+            queue_trials (bool): Whether to queue trials when the cluster does


yeah I'm not sure what this means...

richardliaw · 2018-04-11T08:35:10Z

python/ray/rllib/a3c/a3c.py

    _allow_unknown_subkeys = ["model", "optimizer", "env_config"]

+    @classmethod
+    def default_resource_request(cls, config):


This makes sense but in some sense adds a layer of extra complexity. I think if anything, every agent have some explanation of why these are set as defaults.

Yeah the idea is the algorithm author deals with the complexity, not the users.

AmplabJenkins · 2018-04-11T19:56:44Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4812/
Test PASSed.

AmplabJenkins · 2018-04-13T05:40:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4861/
Test FAILed.

richardliaw · 2018-04-14T07:16:04Z

so queued_trials are when you start a trial but the cluster doesn't have enough extra_cpus/extra_gpus, so it hangs until the autoscaler kicks in?

ericl · 2018-04-16T01:38:36Z

That's right

ericl · 2018-04-16T01:39:58Z

jenkins retest this please

ericl · 2018-04-16T18:27:33Z

Travis looks good. jenkins retest this please

AmplabJenkins · 2018-04-16T19:29:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4946/
Test PASSed.

* master: Handle interrupts correctly for ASIO synchronous reads and writes. (ray-project#1929) [DataFrame] Adding read methods and tests (ray-project#1712) Allow task_table_update to fail when tasks are finished. (ray-project#1927) [rllib] Contribute DDPG to RLlib (ray-project#1877) [xray] Workers blocked in a `ray.get` release their resources (ray-project#1920) Raylet task dispatch and throttling worker startup (ray-project#1912) [DataFrame] Eval fix (ray-project#1903) [tune] Polishing docs (ray-project#1846) [tune] [rllib] Automatically determine RLlib resources and add queueing mechanism for autoscaling (ray-project#1848) Preemptively push local arguments for actor tasks (ray-project#1901) [tune] Allow fetching pinned objects from trainable functions (ray-project#1895) Multithreading refactor for ObjectManager. (ray-project#1911) Add slice functionality (ray-project#1832) [DataFrame] Pass read_csv kwargs to _infer_column (ray-project#1894) Addresses missed comments from multichunk object transfer PR. (ray-project#1908) Allow numpy arrays to be passed by value into tasks (and inlined in the task spec). (ray-project#1816) [xray] Lineage cache requests notifications from the GCS about remote tasks (ray-project#1834) Fix UI issue for non-json-serializable task arguments. (ray-project#1892) Remove unnecessary calls to .hex() for object IDs. (ray-project#1910) Allow multiple raylets to be started on a single machine. (ray-project#1904) # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py

ericl added 11 commits April 7, 2018 02:28

updates

5367617

updates

72067e1

updates

ea62c52

updates

b9b0db5

updates

e7117a0

updates

236a53d

updates

0fd630d

updates

6b18556

updates

e517f33

updates

b927b95

updates

e5751e9

ericl assigned richardliaw Apr 7, 2018

ericl added 7 commits April 7, 2018 14:26

Merge remote-tracking branch 'upstream/master' into auto-conf

5c68f73

updates

dd9b316

updates

272b1cd

updates

e1a4800

updates

2570c7e

updates

f19021d

updates

616a764

updates

bab3823

no gpu

906e45a

lazy get trainable cls

c95384c

Merge remote-tracking branch 'upstream/master' into auto-conf

714e448

richardliaw reviewed Apr 11, 2018

View reviewed changes

ericl added 3 commits April 11, 2018 11:39

updates

fe167ec

Merge remote-tracking branch 'upstream/master' into auto-conf

448c2ad

fix lint

6a6dae2

ericl added 2 commits April 12, 2018 22:07

updates

3729f9f

updates

2aa0f8e

richardliaw approved these changes Apr 16, 2018

View reviewed changes

richardliaw merged commit 7ab890f into ray-project:master Apr 16, 2018

richardliaw deleted the auto-conf branch April 16, 2018 23:58

Conversation

ericl commented Apr 7, 2018

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

AmplabJenkins commented Apr 7, 2018

Uh oh!

richardliaw commented Apr 8, 2018

Uh oh!

ericl commented Apr 8, 2018

Uh oh!

AmplabJenkins commented Apr 8, 2018

Uh oh!

ericl commented Apr 9, 2018

Uh oh!

AmplabJenkins commented Apr 9, 2018

Uh oh!

AmplabJenkins commented Apr 9, 2018

Uh oh!

ericl commented Apr 9, 2018

Uh oh!

AmplabJenkins commented Apr 9, 2018

Uh oh!

ericl commented Apr 10, 2018

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Apr 11, 2018

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 11, 2018

Uh oh!

AmplabJenkins commented Apr 13, 2018

Uh oh!

richardliaw commented Apr 14, 2018

Uh oh!

ericl commented Apr 16, 2018

Uh oh!

ericl commented Apr 16, 2018

Uh oh!

ericl commented Apr 16, 2018

Uh oh!

AmplabJenkins commented Apr 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!