Skip to content

Scheduling policy proposal#20203

Closed
combk8s wants to merge 1 commit intokubernetes:masterfrom
combk8s:scheduling-policy
Closed

Scheduling policy proposal#20203
combk8s wants to merge 1 commit intokubernetes:masterfrom
combk8s:scheduling-policy

Conversation

@combk8s
Copy link
Copy Markdown
Contributor

@combk8s combk8s commented Jan 27, 2016

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 27, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

1 similar comment
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 27, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 27, 2016
@combk8s combk8s changed the title Scheduling policy Scheduling policy proposal Jan 27, 2016
@mqliang
Copy link
Copy Markdown
Contributor

mqliang commented Jan 27, 2016

@bgrant0607 @dalanlan

@davidopp davidopp assigned davidopp and unassigned smarterclayton Jan 27, 2016
@kevin-wangzefeng kevin-wangzefeng added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Jan 27, 2016
@kevin-wangzefeng
Copy link
Copy Markdown
Contributor

/cc @alfred-huangjian @HardySimpson

@HaiyangDING
Copy link
Copy Markdown

cc @hurf

@googlebot
Copy link
Copy Markdown

We found a Contributor License Agreement for you (the sender of this pull request) and all commit authors, but as best as we can tell these commits were authored by someone else. If that's the case, please add them to this pull request and have them confirm that they're okay with these commits being contributed to Google. If we're mistaken and you did author these commits, just reply here to confirm.

@k8s-github-robot k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 27, 2016
@googlebot
Copy link
Copy Markdown

CLAs look good, thanks!

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 28, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

2 similar comments
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 28, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Feb 4, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@davidopp
Copy link
Copy Markdown
Contributor

Sorry for the delay in responding. I think you are addressing a good problem here.

Our thought for how to address this is something more like what Borg does, where each pod has a priority, and the scheduler processes pods in priority order. In Borg the scheduler queue priority is the same priority as is used to determine which pods can preempt (evict) other pods in order to get scheduled, which is important because if you don't process pods from highest to lowest preemption priority, you may schedule a (low-preemption-priority) pod and then immediately preempt it when the next pod you process from the scheduler queue has a higher preemption priority.

Within a single priority, you can do round-robin (or any other kind of fairness approach) among the pending pods with that priority.

We can get the effect of something like your deadline policy by guaranteeing some minimum fraction of the scheduler's time will be spent on each priority level (when there are pending pods at that priority level). This represents a priority inversion relative to the policy I mentioned in the first paragraph (where we process pending pods strictly in priority order) but it shouldn't be a problem as long as it isn't needed very often.

The benefit of the approach I've described above is that it doesn't require any new knobs on the scheduler or pods (other than priority, which we need for preemption anyway).

@hurf
Copy link
Copy Markdown
Contributor

hurf commented Feb 15, 2016

IIUR, with priority, the scheduler will maintain multiple queues, only when queue with higher priority is empty, pods in queue with lower priority will get scheduled?

By setting a fixed time that scheduler will spend on each priority is intended to avoid pods with lower priority waiting for too much time. But 'too much time' is different for each pod, and a fixed time set to scheduler will make it same for all pods, so I'd prefer to let pod itself to describe this requirement.

@davidopp
Copy link
Copy Markdown
Contributor

IIUR, with priority, the scheduler will maintain multiple queues, only when queue with higher priority is empty, pods in queue with lower priority will get scheduled?

Yes, you can think of it that way (though an actual implementation would probably have a single queue sorted by priority).

Though as I mentioned, you can add a rule that says you will occasionally allow lower-priority pods to jump the queue, to avoid starvation (and get something similar to the Deadline described in this proposal).

'too much time' is different for each pod, and a fixed time set to scheduler will make it same for all pods, so I'd prefer to let pod itself to describe this requirement.

In practice, anything you allow pods to request has to be protected by a quota, otherwise people will just request the "best" of everything. The way Borg handles this is to have resource quota per priority level. This prevents users from setting the highest priority (or equivalently, shortest scheduling deadline) on all of their pods.

'too much time' is different for each pod

Once you implement things like equivalence classes and caching (#17390), it's very hard to calculate the amount of time spent trying to schedule a particular pod. Also, I think you're over-estimating people's ability to make reasonable decisions about how long they're willing to wait for their pod to get considered by the scheduler.

On the other hand, I think the per-pod deadline concept is useful in the context of a deadline scheduler for handling batch jobs, where users say "I need this job to run within the next 12 hours, and it will take two hours to run" and then let the system figure out when to run it based on how much resources will be available at different times. But this is very different from having pods specify how often they should be examined by the scheduler when they are pending.

@mqliang
Copy link
Copy Markdown
Contributor

mqliang commented Feb 16, 2016

@davidopp

Our thought for how to address this is something more like what Borg does, where each pod has a priority, and the scheduler processes pods in priority order.

Within a single priority, you can do round-robin (or any other kind of fairness approach) among the pending pods with that priority.

The benefit of the approach I've described above is that it doesn't require any new knobs on the scheduler or pods (other than priority, which we need for preemption anyway).

So, we should add Priority and Deadline to PodSpec, then scheduler will always schedule the pod with highest priority first unless a pod with lower priority has expired. If we implement in this way, scheduler will only support Priority scheduling and Deadline scheduling, the current FIFO feature is removed. My intuition is that there are always multiple types of workload running in the same cluster and they need to be scheduled in different ways(FIFO, RR, Highest Priority First, Deadline).

FIFO and RR feature may be very useful in some usage scenario, for example:

  1. User may want ensure scheduling fair between namespaces, so they want scheduler work like: cyclic between different namespace queue first, and the highest priority pod or expired pod in the selected namespace will be scheduled first.
  2. User may want the first created pod been scheduled first.
  3. User may want the pod with highest priority been scheduled first.

And since we support multi-scheduler in the near future, people could deploy multi schedulers with different config to meets their different scheduling requirements.

So, personally, I think the approach in this proposal is much more flexible:

  1. if SchedulePolicy == SchedulerPolicyRR, cyclic between different namespace queue first, and then the highest priority pod or expired pod in the selected namespace will be scheduled first.
  2. if SchedulePolicy == SchedulerPolicyFIFO, the first created pod been scheduled first, ignoring the priority of pod, the deadline of pod could still work.
  3. if SchedulePolicy == SchedulerPolicyHPF, (SchedulerPolicyHPF means highest priority first ), the pod with highest priority will been scheduler first, the deadline of pod could still work.

And we can add a --enable-deadline-scheduling flag to scheduler, is this flag is false, the deadline scheduling feature is disabled. Thought?

@lvlv
Copy link
Copy Markdown
Contributor

lvlv commented Feb 16, 2016

@combk8s @mqliang

I'm wondering how deadline scheduling works, can you explain some more?
What will happen if scheduler can't find slot and missed the deadline? How should scheduler decide the order based on the deadline specified in the PodSpec?

On the other hand, is scheduling already the bottleneck at this level? If yes, I believe we should put more resource on how to make them concurrent instead :-)

@davidopp

btw I love the concept of priority scheduler 👍 though it need more effort of prioritized quota.

@hurf
Copy link
Copy Markdown
Contributor

hurf commented Feb 16, 2016

Once you implement things like equivalence classes and caching (#17390), it's very hard to calculate the amount of time spent trying to schedule a particular pod. Also, I think you're over-estimating people's ability to make reasonable decisions about how long they're willing to wait for their pod to get considered by the scheduler.

We don't need to evaluate the time the scheduler spent to schedule a pod, but the time a pod waits in the queue to get scheduled. Anyway it's not the key point. I think the main divergence is whether to give user more control or let the scheduler to decide.

@mqliang

I'm positive to dealine scheduling. But I don't think FIFO is removed, for when all pods have same priority, it is FIFO.
And the scenario described requires RR by namespaces can be solved by using multiple schedulers. IIRC we can only apply one SchedulePolicy introduced by the proposal to one scheduler? If we want to have FIFO and RR at the same time to meet different requirments, we need multiple schedulers in the system anyway.
As in my understanding SchedulePolicy should be used to describe what predicates and prioritizers should be used to schedule a pod. So we can appoint policies like affinity or anti-affinity. But it's another story.
Back to this issue. What in my mind is:

  1. add Priority and Deadline(no consensus yet) to PodSpec.
  2. Schedule by priority.
  3. If a pod with Deadline gets in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority.

Just my thoughts for discussion.

@mqliang
Copy link
Copy Markdown
Contributor

mqliang commented Feb 16, 2016

@lvlv

What will happen if scheduler can't find slot and missed the deadline?
How should scheduler decide the order based on the deadline specified in the PodSpec?

Just as @hurf put:

Schedule by priority first. If a pod with Deadline get in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority, so that it could been scheduled ASAP.

@davidopp
Copy link
Copy Markdown
Contributor

I think the main divergence is whether to give user more control or let the scheduler to decide.

Yes, I agree. I think the kinds of policies that are being discussed here (e.g. FIFO vs. RR vs. HPF) should be scheduler parameters, not pod parameters. We do need "priority" in each pod in order to implement preemption anyway, and we can use that priority as the signal to the scheduler for how to prioritize the pod in the scheduling queue if the scheduler is configured for HPF.

  1. add Priority and Deadline(no consensus yet) to PodSpec.
  2. Schedule by priority.
  3. If a pod with Deadline gets in the queue, set a timer for it. If the timer ends and the pod remains in the queue, give it highest priority.

I still disagree about setting Deadline per-pod. I don't think the user will know how to set it, and also the effect is not very visible (if user's pod is pending, how can they tell whether the scheduler is re-evaluating the pod every second or every minute?). I think it makes more sense to make it something that the scheduler decides. For example, scheduler can use HPF but occasionally check lower-priority pods to avoid starvation.

And like I said before, I think a "deadline scheduler" in the batch scheduler sense (like http://research.microsoft.com/apps/pubs/default.aspx?id=192091 ) would be very useful. It's also easier to solve the "what prevents every user from asking for the soonest deadline" problem there because you can connect it to some billing mechanism that charges more money for sooner deadlines. But it's very different from the kind of deadline we're talking about here, which only controls how often the scheduler will evaluate the pod, not whether the pod will actually be able to start.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Feb 17, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

If this message is too spammy, please complain to ixdy.

@combk8s
Copy link
Copy Markdown
Contributor Author

combk8s commented Feb 17, 2016

@davidopp

if user's pod is pending, how can they tell whether the scheduler is re-evaluating the pod every second or every minute?
But it's very different from the kind of deadline we're talking about here, which only controls how often the scheduler will evaluate the pod, not whether the pod will actually be able to start.

Deadline in this PR doesn't control how often the scheduler will evaluate the pod, but the deadline when pod will actually be able to scheduled. Once one pod expired, it will get the highest priority and will be scheduled soon.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Mar 10, 2016

Can one of the admins verify that this patch is reasonable to test? (reply "ok to test", or if you trust the user, reply "add to whitelist")

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Apr 19, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in hack/jenkins/job-configs/kubernetes-jenkins-pull/ instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented May 27, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

3 similar comments
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jun 14, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jun 15, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jun 23, 2016

Can one of the admins verify that this patch is reasonable to test? If so, please reply "ok to test".
(Note: "add to whitelist" is no longer supported. Please update configurations in kubernetes/test-infra/jenkins/job-configs/kubernetes-jenkins-pull instead.)

This message may repeat a few times in short succession due to jenkinsci/ghprb-plugin#292. Sorry.

Otherwise, if this message is too spammy, please complain to ixdy.

@k8s-github-robot
Copy link
Copy Markdown

This PR hasn't been active in 153 days. Feel free to reopen.

You can add 'keep-open' label to prevent this from happening again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/design Categorizes issue or PR as related to design. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.