Add min and max worker flags to adaptive cluster#1618
Add min and max worker flags to adaptive cluster#1618jacobtomlinson wants to merge 1 commit intodask:masterfrom informatics-lab:adaptive-min-max
Conversation
|
Thanks for this, I'm glad to see people experimenting with these policies on Kubernetes. Two main comments regarding the code here:
For Kubernetes in particular I've been thinking that maybe the right thing to do here isn't to scale-up/scale-down but rather to set a target number of workers given what we know about the workload. For example, we can look at |
|
Hi @mrocklin (I co-wrote the PR with @jacobtomlinson ),
(btw I'm presenting some of this stuff at AGU on Monday, if you're about - talk called "Jade: using on-demand cloud analysis to give scientists back their flow") |
|
I've also been looking at extending Adaptive, and there are various strategies for which Perhaps this function itself should be moved into the Adaptive class, as it that seems to be the only place it's used? This would mean that future improvements in Adaptive strategies would be less likely to need to touch the scheduler. |
|
I agree with @rbubley. We did start with One thing I was thinking while working on this is that scaling up and down in the adaptive class in inconsistent. Scaling up tells you how many workers you should have total, scaling down effectively says "I've terminated these workers please garbage collect them". While this makes sense from a practical point perhaps it could be possible for the adaptive class to handle retiring the workers as well as removing them. And for scaling up perhaps it should be "please add three more", as that feels more consistent with "please remove these three". |
|
I've raised a separate issue for adaptive kubernetes clusters here, just in case this conversation extends beyond the lifetime of this PR. #1619 |
Perhaps we can improve |
|
Replying from mobile.
It was the fact that we don’t want to scale down if removing the list
returned from `workers_to_close` brings us below our minimum. We can check
the logic fine in the `should_scale_down` method and remove some from the
list to avoid going below the minimum. However then when retiring the
`workers_to_close` is called again by `retire_workers` and gets the list
again.
…On Fri, 8 Dec 2017 at 16:30, Matthew Rocklin ***@***.***> wrote:
We did start with min_workers and max_workers in the adaptive class but
found ourselves basically reimplementing workers_to_close as it wasn't
flexible enough for what we were trying to do.
Perhaps we can improve workers_to_close? What was restrictive?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1618 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABiUYtlQmPYCYyirDrLYPD6FH3dthvQKks5s-WQcgaJpZM4Q6_u2>
.
|
|
I would be fine with a |
|
A slightly more general way would be to have a list of do_not_retire workers. If you only care about a numeric minimum you could just select some arbitrarily, but in some cases there may be distinguished workers that you want to keep alive: perhaps you get a discount if the same worker stays alive for a long time, or perhaps these workers are using harvested time from computers doing something else, etc. This could be achieved with a function parameter, but could also be achieved by using the existing worker resources functionality - which in some ways might make more sense: they could declare themselves with a resource of “always-on”, or something. |
|
Have you had concrete experience with systems that need this functionality or is this more hypothetical? |
|
I wouldn't say 'need': it's a want. Both of the examples I gave are live issues on our roadmap. |
|
Would you mind looking at #1632 and checking if the modifications made in that PR could be used in the resolution of this issue? It would, at a high level, move the question of adaptive scaling behavior to As @rbubley suggested, it seems a touch strange that this logic lives in the scheduler but I can see a compelling argument for providing basic scale-down behavior in the scheduler interface, as is currently provided by |
|
It seems that one of the primary reasons this change touches Would a change allowing pass-through arguments to modules loaded via Where |
|
Passing through keyword arguments to the preload script makes sense to me. In general I'm usually in favor of improving the functioning of escape hatches like preload scripts. |
|
@jacobtomlinson Just to clarify, I think this PR and #1632 are compatible and I would be stoked to see all of this logic land somewhere in I've opened #1634 as a strawman implementation of the "command line passthrough" concept. Would you mind taking a look at that pull and seeing if it could be used as a base for this functionality? |
|
@asford Yes I think that would be a good option. Let me know when it's in and I'll update this PR to make use of it. |
|
I'm looking for an update on this issue. @jacobtomlinson, do you expect to revisit things here (#1632 was merged). |
|
In general I agree that adding min/max workers to Adaptive is a good idea |
|
I've implemented min/max in Adaptive here: #1797 Note that this doesn't pass anything through the scheduler. This just focuses on the Adaptive piece in isolation. |
|
I suppose this can be closed now? |
|
Thanks for starting this @jacobtomlinson ! |
Description
Add min and max worker flags to adaptive cluster.
We've been using adaptive clusters on Kubernetes (over EC2) and have found an initial spin up time before tasks start calculating due to the first couple of scale up calls taking time to complete.
This PR adds a minimum number of workers to keep alive in the adaptive cluster meaning that tasks start processing immediately.
We've also added a maximum number of workers while we were at it.
Example usage