Scheduling policy proposal by combk8s · Pull Request #20203 · kubernetes/kubernetes

combk8s · 2016-01-27T08:45:12Z

@davidopp @kevin-wangzefeng @HaiyangDING @mqliang

k8s-bot · 2016-01-27T08:46:16Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-bot · 2016-01-27T08:47:17Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

mqliang · 2016-01-27T08:51:01Z

@bgrant0607 @dalanlan

kevin-wangzefeng · 2016-01-27T09:25:13Z

/cc @alfred-huangjian @HardySimpson

HaiyangDING · 2016-01-27T09:28:57Z

cc @hurf

googlebot · 2016-01-27T09:47:39Z

We found a Contributor License Agreement for you (the sender of this pull request) and all commit authors, but as best as we can tell these commits were authored by someone else. If that's the case, please add them to this pull request and have them confirm that they're okay with these commits being contributed to Google. If we're mistaken and you did author these commits, just reply here to confirm.

googlebot · 2016-01-27T09:56:27Z

CLAs look good, thanks!

k8s-bot · 2016-01-28T18:42:43Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-bot · 2016-01-28T18:51:17Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

k8s-bot · 2016-02-04T19:33:08Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

davidopp · 2016-02-15T07:16:40Z

Sorry for the delay in responding. I think you are addressing a good problem here.

Our thought for how to address this is something more like what Borg does, where each pod has a priority, and the scheduler processes pods in priority order. In Borg the scheduler queue priority is the same priority as is used to determine which pods can preempt (evict) other pods in order to get scheduled, which is important because if you don't process pods from highest to lowest preemption priority, you may schedule a (low-preemption-priority) pod and then immediately preempt it when the next pod you process from the scheduler queue has a higher preemption priority.

Within a single priority, you can do round-robin (or any other kind of fairness approach) among the pending pods with that priority.

We can get the effect of something like your deadline policy by guaranteeing some minimum fraction of the scheduler's time will be spent on each priority level (when there are pending pods at that priority level). This represents a priority inversion relative to the policy I mentioned in the first paragraph (where we process pending pods strictly in priority order) but it shouldn't be a problem as long as it isn't needed very often.

The benefit of the approach I've described above is that it doesn't require any new knobs on the scheduler or pods (other than priority, which we need for preemption anyway).

hurf · 2016-02-15T08:39:46Z

IIUR, with priority, the scheduler will maintain multiple queues, only when queue with higher priority is empty, pods in queue with lower priority will get scheduled?

By setting a fixed time that scheduler will spend on each priority is intended to avoid pods with lower priority waiting for too much time. But 'too much time' is different for each pod, and a fixed time set to scheduler will make it same for all pods, so I'd prefer to let pod itself to describe this requirement.

davidopp · 2016-02-15T09:14:38Z

IIUR, with priority, the scheduler will maintain multiple queues, only when queue with higher priority is empty, pods in queue with lower priority will get scheduled?

Yes, you can think of it that way (though an actual implementation would probably have a single queue sorted by priority).

Though as I mentioned, you can add a rule that says you will occasionally allow lower-priority pods to jump the queue, to avoid starvation (and get something similar to the Deadline described in this proposal).

'too much time' is different for each pod, and a fixed time set to scheduler will make it same for all pods, so I'd prefer to let pod itself to describe this requirement.

In practice, anything you allow pods to request has to be protected by a quota, otherwise people will just request the "best" of everything. The way Borg handles this is to have resource quota per priority level. This prevents users from setting the highest priority (or equivalently, shortest scheduling deadline) on all of their pods.

'too much time' is different for each pod

Once you implement things like equivalence classes and caching (#17390), it's very hard to calculate the amount of time spent trying to schedule a particular pod. Also, I think you're over-estimating people's ability to make reasonable decisions about how long they're willing to wait for their pod to get considered by the scheduler.

On the other hand, I think the per-pod deadline concept is useful in the context of a deadline scheduler for handling batch jobs, where users say "I need this job to run within the next 12 hours, and it will take two hours to run" and then let the system figure out when to run it based on how much resources will be available at different times. But this is very different from having pods specify how often they should be examined by the scheduler when they are pending.

mqliang · 2016-02-16T08:19:47Z

@davidopp

Our thought for how to address this is something more like what Borg does, where each pod has a priority, and the scheduler processes pods in priority order.

Within a single priority, you can do round-robin (or any other kind of fairness approach) among the pending pods with that priority.

The benefit of the approach I've described above is that it doesn't require any new knobs on the scheduler or pods (other than priority, which we need for preemption anyway).

So, we should add Priority and Deadline to PodSpec, then scheduler will always schedule the pod with highest priority first unless a pod with lower priority has expired. If we implement in this way, scheduler will only support Priority scheduling and Deadline scheduling, the current FIFO feature is removed. My intuition is that there are always multiple types of workload running in the same cluster and they need to be scheduled in different ways(FIFO, RR, Highest Priority First, Deadline).

FIFO and RR feature may be very useful in some usage scenario, for example:

User may want ensure scheduling fair between namespaces, so they want scheduler work like: cyclic between different namespace queue first, and the highest priority pod or expired pod in the selected namespace will be scheduled first.
User may want the first created pod been scheduled first.
User may want the pod with highest priority been scheduled first.

And since we support multi-scheduler in the near future, people could deploy multi schedulers with different config to meets their different scheduling requirements.

So, personally, I think the approach in this proposal is much more flexible:

if SchedulePolicy == SchedulerPolicyRR, cyclic between different namespace queue first, and then the highest priority pod or expired pod in the selected namespace will be scheduled first.
if SchedulePolicy == SchedulerPolicyFIFO, the first created pod been scheduled first, ignoring the priority of pod, the deadline of pod could still work.
if SchedulePolicy == SchedulerPolicyHPF, (SchedulerPolicyHPF means highest priority first ), the pod with highest priority will been scheduler first, the deadline of pod could still work.

And we can add a --enable-deadline-scheduling flag to scheduler, is this flag is false, the deadline scheduling feature is disabled. Thought?

lvlv · 2016-02-16T08:46:40Z

@combk8s @mqliang

I'm wondering how deadline scheduling works, can you explain some more?
What will happen if scheduler can't find slot and missed the deadline? How should scheduler decide the order based on the deadline specified in the PodSpec?

On the other hand, is scheduling already the bottleneck at this level? If yes, I believe we should put more resource on how to make them concurrent instead :-)

@davidopp

btw I love the concept of priority scheduler 👍 though it need more effort of prioritized quota.

hurf · 2016-02-16T08:55:51Z

Once you implement things like equivalence classes and caching (#17390), it's very hard to calculate the amount of time spent trying to schedule a particular pod. Also, I think you're over-estimating people's ability to make reasonable decisions about how long they're willing to wait for their pod to get considered by the scheduler.

We don't need to evaluate the time the scheduler spent to schedule a pod, but the time a pod waits in the queue to get scheduled. Anyway it's not the key point. I think the main divergence is whether to give user more control or let the scheduler to decide.

@mqliang

I'm positive to dealine scheduling. But I don't think FIFO is removed, for when all pods have same priority, it is FIFO.
And the scenario described requires RR by namespaces can be solved by using multiple schedulers. IIRC we can only apply one SchedulePolicy introduced by the proposal to one scheduler? If we want to have FIFO and RR at the same time to meet different requirments, we need multiple schedulers in the system anyway.
As in my understanding SchedulePolicy should be used to describe what predicates and prioritizers should be used to schedule a pod. So we can appoint policies like affinity or anti-affinity. But it's another story.
Back to this issue. What in my mind is:

add Priority and Deadline(no consensus yet) to PodSpec.
Schedule by priority.
If a pod with Deadline gets in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority.

Just my thoughts for discussion.

mqliang · 2016-02-16T09:17:03Z

@lvlv

What will happen if scheduler can't find slot and missed the deadline?
How should scheduler decide the order based on the deadline specified in the PodSpec?

Just as @hurf put:

Schedule by priority first. If a pod with Deadline get in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority, so that it could been scheduled ASAP.

davidopp · 2016-02-16T10:33:51Z

I think the main divergence is whether to give user more control or let the scheduler to decide.

Yes, I agree. I think the kinds of policies that are being discussed here (e.g. FIFO vs. RR vs. HPF) should be scheduler parameters, not pod parameters. We do need "priority" in each pod in order to implement preemption anyway, and we can use that priority as the signal to the scheduler for how to prioritize the pod in the scheduling queue if the scheduler is configured for HPF.

add Priority and Deadline(no consensus yet) to PodSpec.

Schedule by priority.

If a pod with Deadline gets in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority.

I still disagree about setting Deadline per-pod. I don't think the user will know how to set it, and also the effect is not very visible (if user's pod is pending, how can they tell whether the scheduler is re-evaluating the pod every second or every minute?). I think it makes more sense to make it something that the scheduler decides. For example, scheduler can use HPF but occasionally check lower-priority pods to avoid starvation.

And like I said before, I think a "deadline scheduler" in the batch scheduler sense (like http://research.microsoft.com/apps/pubs/default.aspx?id=192091 ) would be very useful. It's also easier to solve the "what prevents every user from asking for the soonest deadline" problem there because you can connect it to some billing mechanism that charges more money for sooner deadlines. But it's very different from the kind of deadline we're talking about here, which only controls how often the scheduler will evaluate the pod, not whether the pod will actually be able to start.

k8s-bot · 2016-02-17T00:16:01Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

combk8s · 2016-02-17T07:58:13Z

@davidopp

if user's pod is pending, how can they tell whether the scheduler is re-evaluating the pod every second or every minute?
But it's very different from the kind of deadline we're talking about here, which only controls how often the scheduler will evaluate the pod, not whether the pod will actually be able to start.

Deadline in this PR doesn't control how often the scheduler will evaluate the pod, but the deadline when pod will actually be able to scheduled. Once one pod expired, it will get the highest priority and will be scheduled soon.

k8s-bot · 2016-03-10T20:26:19Z

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-04-19T10:49:18Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-05-27T23:19:23Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-14T23:08:48Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-15T00:36:52Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-bot · 2016-06-23T22:10:18Z

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

k8s-github-robot · 2016-07-19T22:15:43Z

This PR hasn't been active in 153 days. Feel free to reopen.

You can add 'keep-open' label to prevent this from happening again.

googlebot added the cla: yes label Jan 27, 2016

k8s-github-robot assigned smarterclayton Jan 27, 2016

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 27, 2016

combk8s changed the title ~~Scheduling policy~~ Scheduling policy proposal Jan 27, 2016

davidopp assigned davidopp and unassigned smarterclayton Jan 27, 2016

kevin-wangzefeng added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 27, 2016

combk8s force-pushed the scheduling-policy branch from 51f3a1b to 3146d54 Compare January 27, 2016 09:47

googlebot added cla: no and removed cla: yes labels Jan 27, 2016

combk8s force-pushed the scheduling-policy branch from 3146d54 to 83a20fa Compare January 27, 2016 09:51

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 27, 2016

scheduling policy proposal

05f9ca9

combk8s force-pushed the scheduling-policy branch from 83a20fa to 05f9ca9 Compare January 27, 2016 09:56

googlebot added cla: yes and removed cla: no labels Jan 27, 2016

This was referenced Feb 27, 2016

Random thought about Rescheduler implementation #22054

Closed

[WIP/RFC] Rescheduling in Kubernetes design proposal #22217

Merged

k8s-github-robot added the release-note-label-needed label Mar 31, 2016

mqliang mentioned this pull request May 17, 2016

Backoff in scheduler is not working as intended #25663

Closed

k8s-github-robot closed this Jul 19, 2016

Conversation

combk8s commented Jan 27, 2016 • edited by k8s-github-robot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-bot commented Jan 27, 2016

Uh oh!

k8s-bot commented Jan 27, 2016

Uh oh!

mqliang commented Jan 27, 2016

Uh oh!

kevin-wangzefeng commented Jan 27, 2016

Uh oh!

HaiyangDING commented Jan 27, 2016

Uh oh!

googlebot commented Jan 27, 2016

Uh oh!

googlebot commented Jan 27, 2016

Uh oh!

k8s-bot commented Jan 28, 2016

Uh oh!

k8s-bot commented Jan 28, 2016

Uh oh!

k8s-bot commented Feb 4, 2016

Uh oh!

davidopp commented Feb 15, 2016

Uh oh!

hurf commented Feb 15, 2016

Uh oh!

davidopp commented Feb 15, 2016

Uh oh!

mqliang commented Feb 16, 2016

Uh oh!

lvlv commented Feb 16, 2016

Uh oh!

hurf commented Feb 16, 2016

Uh oh!

mqliang commented Feb 16, 2016

Uh oh!

davidopp commented Feb 16, 2016

Uh oh!

k8s-bot commented Feb 17, 2016

Uh oh!

combk8s commented Feb 17, 2016

Uh oh!

k8s-bot commented Mar 10, 2016

Uh oh!

k8s-bot commented Apr 19, 2016

Uh oh!

k8s-bot commented May 27, 2016

Uh oh!

k8s-bot commented Jun 14, 2016

Uh oh!

k8s-bot commented Jun 15, 2016

Uh oh!

k8s-bot commented Jun 23, 2016

Uh oh!

k8s-github-robot commented Jul 19, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

combk8s commented Jan 27, 2016 •

edited by k8s-github-robot

Loading