Skip to content

Dedicated nodes, taints, and tolerations design proposal#18263

Merged
davidopp merged 1 commit intokubernetes:masterfrom
davidopp:taints-tolerations
Jan 24, 2016
Merged

Dedicated nodes, taints, and tolerations design proposal#18263
davidopp merged 1 commit intokubernetes:masterfrom
davidopp:taints-tolerations

Conversation

@davidopp
Copy link
Copy Markdown
Contributor

@davidopp davidopp commented Dec 6, 2015

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 6, 2015
@k8s-github-robot
Copy link
Copy Markdown

Labelling this PR as size/L

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Dec 6, 2015

GCE e2e test build/test passed for commit bd131c17dc115f350f369e5b3a739a846c301e25.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Dec 7, 2015

GCE e2e test build/test passed for commit 11d11e14a914be2f20022e1071a778a82b5c4c0c.

@davidopp davidopp changed the title Dedicated nodes, taints, and tolerations design proposal initial draft Dedicated nodes, taints, and tolerations design proposal Dec 7, 2015
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will probably implement this mechanism in Rescheduler, or anywhere else?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably do it in the rescheduler, but I think it's probably better to do it in Kubelet.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@mml
Copy link
Copy Markdown
Contributor

mml commented Dec 7, 2015

Can you cover use cases? If security is one motivation, we'll need to address things like the race I mentioned in #17190 .

@bgrant0607 bgrant0607 mentioned this pull request Dec 8, 2015
@mml
Copy link
Copy Markdown
Contributor

mml commented Dec 8, 2015

Is "keep dedicated users off the shared machines" a use case we care about?

@davidopp
Copy link
Copy Markdown
Contributor Author

davidopp commented Dec 8, 2015

Comments from meeting with @bgrant0607 today follow. I will update the proposal to reflect them shortly:

  • Make it clear that Kubelet should enforce taints/tolerations (not just scheduler); goes into shared predicate library between kubelet and scheduler
  • String constants in API should start with capital letter (e.g. should be EvictAndNoStart, not evictAndNoStart)
  • Instead of two behaviors (evict, and no start) should have three behaviors: scheduler should not schedule onto the node, kubelet should not accept a pod onto the node, and evict. Kubelet only cares about last two. See bottom of Mark node to be decommissioned and act accordingly #3885 (comment) for more details
  • Consider relationship between tolerations and forgiveness, and correspondingly use cases that are about machine drains/failures/etc. Proposal: Forgiveness #1574. It's not clear that this requires a change to the design, except long-term we might want to add a timeframe to tolerations so you can say for example "I am willing to stay bound to a machine that is down for maintenance for up to 4 hours, but not longer." Taint auto-applied when problem happens, and has a default toleration.
  • The current proposal special-cases * as the Value of a Toleration to me wildcard rather than string asterisk. To be more consistent with rest of Kubernetes API, there should be no special-case for interpreting Value, instead add an "Operation" field to Toleration that can take on the values "Equal" and "Exists" which defaults to "Equal."
  • Should tell users not to use these features until they have been in Kubelet and master for enough binary versions that we feel comfortable we will not need to roll back either to a version that does not support them; longer-term would be good to have a way to enforce this (Cluster versioning #4855).

@davidopp
Copy link
Copy Markdown
Contributor Author

davidopp commented Dec 8, 2015

Can you cover use cases? If security is one motivation, we'll need to address things like the race I mentioned in #17190 .

Yeah, it's definitely a good point. I think this falls under "In the future one can imagine an admission controller that applies taints to nodes and tolerations to pods based on a policy specifying dedicated node groups." Actually I realized the "taints to nodes" part doesn't make sense (admission controllers have nothing to do with nodes). But we could create some kind of API object that describes the dedicated node policy, and NodeController could attach the necessary taints when a machine registers based on the policy stored in that object (we can have the node register in an unschedulable state until NodeController marks it ready, after adding the taints). The "tolerations to pods" policy could use the mechanism described in #18262.

Is "keep dedicated users off the shared machines" a use case we care about?

Probably in at least some cases yes. Putting a taint on all of the shared machines is easy; I guess the annoying part is that you'd need to put a toleration on all of the non-dedicated users' pods (unless there's something I'm not thinking of).

@davidopp
Copy link
Copy Markdown
Contributor Author

davidopp commented Dec 8, 2015

(continuing from end of last comment)

I guess you could say any pod that has no tolerations is assumed to tolerate the "shared machine" taint, or something hacky like that...

@davidopp
Copy link
Copy Markdown
Contributor Author

davidopp commented Dec 8, 2015

@erictune

@ncdc
Copy link
Copy Markdown
Member

ncdc commented Dec 9, 2015

cc @kubernetes/rh-cluster-infra

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the cordon/uncordon functionality in #16698 expressable in terms of taints?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmorie Yes. Discussed on #1574

@mml
Copy link
Copy Markdown
Contributor

mml commented Dec 9, 2015

Do we have strong enough use cases that are not adequately addressed by using clusters to implement dedicated node pools to justify this added cognitive and implementation complexity?

Suppose the dedicated nodes share power infrastructure with the shared nodes. Managing drains across all clusters gets tricky and error-prone.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all nodes have a dedicated label, then all pods can have a simple equality node constraint.

For example, suppose we have dedicated machine set "gpus".
Label those machines "dedicated=gpus", and label the rest of the machines "dedicated=public".
Then all pods just need a NodeConstraint: dedicated=gpus or NodeConstraint: dedicated=public.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit of taints over regular constraints is that with taints, pods that don't need to access the "special" nodes don't need to know anything about the special nodes. So for example, if you have a cluster with no GPUs and then you add some nodes to it with GPUs and you want only pods that request GPUs to be able to schedule onto those nodes, you just add a taint to the new nodes and a toleration to the pods that request GPUs. "Regular" pods (and nodes) can be oblivious. This property becomes especially valuable once you have lots of different types of special nodes (multiple dedicated groups, multiple special hardware types, different kinds of machine problems that exclude pods, etc.). With the constraint approach, you need "regular" pods to explicitly constrain away from every different type of special node.

The one place where this becomes a little less beneficial is when there are no "regular" nodes, as in the case where you not only want to keep pods off of dedicated nodes they aren't entitled to access, but also want to keep some pods (presumably the pods of dedicated users) off of non-dedicated nodes. In that case the world is essentially as you described in your comment, where every node is effectively dedicated, and taints/tolerations are not as valuable because the non-dedicated nodes must be tainted and the non-dedicated-user pods must have tolerations. But taints/tolerations are still better than constraints even in that case because as you add dedicated node groups, you don't need to change how you handle the non-dedicated pods, whereas with the constraint approach, you would need to start adding a "dedicated != " constraint to all future and currently-pending pods.

@erictune
Copy link
Copy Markdown
Contributor

erictune commented Dec 9, 2015

Is there a way to use taints to keep users off nodes that I want to be restricted?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-commutative nature of this one makes me uneasy (because it compromises the declarative semantics that we strive for across Kubernetes). As a trivial example, if a request to Taint:noStart all nodes arrives at roughly the same time as a request to launch a pod on every node, it's not clear what the correct/desirable steady state of the cluster should be afterwards. With Taint:evictAndNoStart that's clear, irrespective of what order the requests arrive in.

Could you list some real-world use cases for noStart. I realise that there are some obvious ones (e.g. stop scheduling stuff on this node/cluster, so that I can slowly retire it), but I'm hoping that we can shoot them down as unnecessary or not useful in practise (as I believe the aforementioed to be). In which case we can remove the entire TaintEffect concept (at least for now). And then a Taint just becomes a key-value pair like a label, and life becomes a lot simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the problem you're describing is a big deal. Someone who adds a noStart taint by definition doesn't care about what is running on the machine, they just don't want new stuff to start running there. So I don't think they care what the final state is in the scenario you described.

The main use case for the noStart taint is exactly the one you mentioned -- to "cordon" (#16698) a machine in preparation for draining.

@erictune
Copy link
Copy Markdown
Contributor

erictune commented Dec 9, 2015

Sorry. Please disregard last comment. I see that you explain this in the doc.

@erictune
Copy link
Copy Markdown
Contributor

erictune commented Dec 9, 2015

I think a requirement for dedicated machines is that "user A is allowed to run pods on dedicated machine set X but user B is not able to.". I don't see a description of what keeps User B from putting an arbitrary Toleration on her pod. Presumably some sort of admission controller is needed to enforce which Tolerations user B is allowed to put on her pod?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about dedicated=special-user:NoScheduleNoAdmitNoExecute. (Instead of a second =)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I didn't really like the syntax I was using but couldn't think of anything better.

@bgrant0607
Copy link
Copy Markdown
Member

Looks pretty good. Just some minor comments.

@bgrant0607
Copy link
Copy Markdown
Member

See also #17151

@davidopp
Copy link
Copy Markdown
Contributor Author

Don't remember if this is mentioned by the proposal:
we should "deprecate" the existing NodeSpec.unschedulable field and represent that as a taint
we should automatically apply taints for node conditions that cause the node to become unschedulable

These are good points but I'm trying to avoid expanding the scope of this doc too much. The idea was to present taints and tolerations, and give one convincing use case. I think we can do followup docs to explain how to use them in node maintenance and other scenarios. The first paragraph alludes to the fact that there are other use cases.

@davidopp
Copy link
Copy Markdown
Contributor Author

@bgrant0607 I addressed your feedback; PTAL.

@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 13, 2016

GCE e2e test build/test passed for commit 566d138164f9fbc2f7e6a8ed39189414469969df.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could elide the extra Nos?

NoScheduleAdmitExecute

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some people might mistakenly think that the "No" binds only to "Schedule" rather than to "ScheduleAdmitExecute"

@bgrant0607
Copy link
Copy Markdown
Member

Note: I don't remember why we decided not to support a richer set of operations in Toleration. In could be achieved by multiple tolerations, but NotIn cannot. I can't think of a use case for that, but this will be hard to extend.

However, we probably will implement this as an alpha annotation initially, in which case we could change it if we discovered use cases for more operators. I don't think we need to update the proposal to reflect that detail.

LGTM.

@bgrant0607 bgrant0607 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 22, 2016
@k8s-github-robot
Copy link
Copy Markdown

PR changed after LGTM, removing LGTM.

@k8s-github-robot k8s-github-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. kind/design Categorizes issue or PR as related to design. labels Jan 24, 2016
@davidopp davidopp added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 24, 2016
@k8s-bot
Copy link
Copy Markdown

k8s-bot commented Jan 24, 2016

GCE e2e test build/test passed for commit 14c2763.

@davidopp
Copy link
Copy Markdown
Contributor Author

Tests pass, manually merging as this is a doc-only PR with just one doc. @bgrant0607 's comment #18263 (comment) about possibly allowing a richer set of operations for Toleration is noted as a point for further consideration. (I don't have an opinion on it really; I think we just used the narrowest semantics that we knew for sure there were use cases for.)

davidopp pushed a commit that referenced this pull request Jan 24, 2016
Dedicated nodes, taints, and tolerations design proposal
@davidopp davidopp merged commit 5abfb43 into kubernetes:master Jan 24, 2016
xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Dedicated nodes, taints, and tolerations design proposal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.