Dynamically grow worker pool to partially solve hanging workloads#286
Conversation
|
Can one of the admins verify this patch? |
src/photon/photon_algorithm.c
Outdated
There was a problem hiding this comment.
probably worth adding a check that we actually removed it for now
test/runtest.py
Outdated
There was a problem hiding this comment.
Doesn't the test need to call ray.init at some point?
There was a problem hiding this comment.
It does (line 1072)
There was a problem hiding this comment.
Oh somehow I missed that :)
python/ray/worker.py
Outdated
There was a problem hiding this comment.
Should this be len(unready_ids)?
There was a problem hiding this comment.
Oh oops, nice catch!
src/photon/photon_scheduler.h
Outdated
There was a problem hiding this comment.
As @atumanov mentioned, we should call this method in kill_worker to reclaim the resources from the task that was running on that worker.
There was a problem hiding this comment.
Ah, yes, there were a couple things that were missing from kill_worker, like updating the task table. I want to make that a separate PR because I think that functionality should have more testing independent of this one. I can add a TODO though.
|
Let's add a test starting with 0 workers, e.g., ray.init(num_workers=0)
@ray.remote
def f():
return 1
# Make sure we can call a remote function. This will require starting a new worker.
ray.get(f.remote())
ray.get([f.remote() for _ in range(100)])Let's also add a test where a worker is blocked for a while so that import ray
import time
ray.init(num_workers=0, num_cpus=100)
@ray.remote
def f():
time.sleep(3)
@ray.remote
def g():
ray.get([f.remote() for _ in range(10)])
ray.get(g.remote()) |
dc3b9c2 to
ed0776d
Compare
src/photon/photon_scheduler.c
Outdated
There was a problem hiding this comment.
maybe give another warning here (you gave one at the beginning, but it might be lost in the logs now)
70ec4f5 to
5e6f9a7
Compare
This is a barebones policy that implements a worker pool: Whenever a task can be assigned (there are enough resources, the inputs are ready, etc.), and the pool of available workers is empty, the local scheduler replenishes the pool with a new worker.
This pull request also accounts for workers blocked on an object that isn't locally available. The local scheduler counts these workers as having temporarily returned their resources, allowing other tasks to run.