Dedicated nodes, taints, and tolerations design proposal#18263
Dedicated nodes, taints, and tolerations design proposal#18263davidopp merged 1 commit intokubernetes:masterfrom
Conversation
|
Labelling this PR as size/L |
|
GCE e2e test build/test passed for commit bd131c17dc115f350f369e5b3a739a846c301e25. |
bd131c1 to
11d11e1
Compare
|
GCE e2e test build/test passed for commit 11d11e14a914be2f20022e1071a778a82b5c4c0c. |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
Will probably implement this mechanism in Rescheduler, or anywhere else?
There was a problem hiding this comment.
You could probably do it in the rescheduler, but I think it's probably better to do it in Kubelet.
|
Can you cover use cases? If security is one motivation, we'll need to address things like the race I mentioned in #17190 . |
|
Is "keep dedicated users off the shared machines" a use case we care about? |
|
Comments from meeting with @bgrant0607 today follow. I will update the proposal to reflect them shortly:
|
Yeah, it's definitely a good point. I think this falls under "In the future one can imagine an admission controller that applies taints to nodes and tolerations to pods based on a policy specifying dedicated node groups." Actually I realized the "taints to nodes" part doesn't make sense (admission controllers have nothing to do with nodes). But we could create some kind of API object that describes the dedicated node policy, and NodeController could attach the necessary taints when a machine registers based on the policy stored in that object (we can have the node register in an unschedulable state until NodeController marks it ready, after adding the taints). The "tolerations to pods" policy could use the mechanism described in #18262.
Probably in at least some cases yes. Putting a taint on all of the shared machines is easy; I guess the annoying part is that you'd need to put a toleration on all of the non-dedicated users' pods (unless there's something I'm not thinking of). |
|
(continuing from end of last comment) I guess you could say any pod that has no tolerations is assumed to tolerate the "shared machine" taint, or something hacky like that... |
|
cc @kubernetes/rh-cluster-infra |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
Is the cordon/uncordon functionality in #16698 expressable in terms of taints?
Suppose the dedicated nodes share power infrastructure with the shared nodes. Managing drains across all clusters gets tricky and error-prone. |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
If all nodes have a dedicated label, then all pods can have a simple equality node constraint.
For example, suppose we have dedicated machine set "gpus".
Label those machines "dedicated=gpus", and label the rest of the machines "dedicated=public".
Then all pods just need a NodeConstraint: dedicated=gpus or NodeConstraint: dedicated=public.
There was a problem hiding this comment.
The main benefit of taints over regular constraints is that with taints, pods that don't need to access the "special" nodes don't need to know anything about the special nodes. So for example, if you have a cluster with no GPUs and then you add some nodes to it with GPUs and you want only pods that request GPUs to be able to schedule onto those nodes, you just add a taint to the new nodes and a toleration to the pods that request GPUs. "Regular" pods (and nodes) can be oblivious. This property becomes especially valuable once you have lots of different types of special nodes (multiple dedicated groups, multiple special hardware types, different kinds of machine problems that exclude pods, etc.). With the constraint approach, you need "regular" pods to explicitly constrain away from every different type of special node.
The one place where this becomes a little less beneficial is when there are no "regular" nodes, as in the case where you not only want to keep pods off of dedicated nodes they aren't entitled to access, but also want to keep some pods (presumably the pods of dedicated users) off of non-dedicated nodes. In that case the world is essentially as you described in your comment, where every node is effectively dedicated, and taints/tolerations are not as valuable because the non-dedicated nodes must be tainted and the non-dedicated-user pods must have tolerations. But taints/tolerations are still better than constraints even in that case because as you add dedicated node groups, you don't need to change how you handle the non-dedicated pods, whereas with the constraint approach, you would need to start adding a "dedicated != " constraint to all future and currently-pending pods.
|
Is there a way to use taints to keep users off nodes that I want to be restricted? |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
The non-commutative nature of this one makes me uneasy (because it compromises the declarative semantics that we strive for across Kubernetes). As a trivial example, if a request to Taint:noStart all nodes arrives at roughly the same time as a request to launch a pod on every node, it's not clear what the correct/desirable steady state of the cluster should be afterwards. With Taint:evictAndNoStart that's clear, irrespective of what order the requests arrive in.
Could you list some real-world use cases for noStart. I realise that there are some obvious ones (e.g. stop scheduling stuff on this node/cluster, so that I can slowly retire it), but I'm hoping that we can shoot them down as unnecessary or not useful in practise (as I believe the aforementioed to be). In which case we can remove the entire TaintEffect concept (at least for now). And then a Taint just becomes a key-value pair like a label, and life becomes a lot simpler.
There was a problem hiding this comment.
I don't think that the problem you're describing is a big deal. Someone who adds a noStart taint by definition doesn't care about what is running on the machine, they just don't want new stuff to start running there. So I don't think they care what the final state is in the scenario you described.
The main use case for the noStart taint is exactly the one you mentioned -- to "cordon" (#16698) a machine in preparation for draining.
|
Sorry. Please disregard last comment. I see that you explain this in the doc. |
|
I think a requirement for dedicated machines is that "user A is allowed to run pods on dedicated machine set X but user B is not able to.". I don't see a description of what keeps User B from putting an arbitrary |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
How about dedicated=special-user:NoScheduleNoAdmitNoExecute. (Instead of a second =)
There was a problem hiding this comment.
Sure. I didn't really like the syntax I was using but couldn't think of anything better.
|
Looks pretty good. Just some minor comments. |
|
See also #17151 |
These are good points but I'm trying to avoid expanding the scope of this doc too much. The idea was to present taints and tolerations, and give one convincing use case. I think we can do followup docs to explain how to use them in node maintenance and other scenarios. The first paragraph alludes to the fact that there are other use cases. |
|
@bgrant0607 I addressed your feedback; PTAL. |
|
GCE e2e test build/test passed for commit 566d138164f9fbc2f7e6a8ed39189414469969df. |
docs/proposals/dedicated-nodes.md
Outdated
There was a problem hiding this comment.
Perhaps we could elide the extra Nos?
NoScheduleAdmitExecute
There was a problem hiding this comment.
I think some people might mistakenly think that the "No" binds only to "Schedule" rather than to "ScheduleAdmitExecute"
|
Note: I don't remember why we decided not to support a richer set of operations in Toleration. However, we probably will implement this as an alpha annotation initially, in which case we could change it if we discovered use cases for more operators. I don't think we need to update the proposal to reflect that detail. LGTM. |
566d138 to
14c2763
Compare
|
PR changed after LGTM, removing LGTM. |
|
GCE e2e test build/test passed for commit 14c2763. |
|
Tests pass, manually merging as this is a doc-only PR with just one doc. @bgrant0607 's comment #18263 (comment) about possibly allowing a richer set of operations for Toleration is noted as a point for further consideration. (I don't have an opinion on it really; I think we just used the narrowest semantics that we knew for sure there were use cases for.) |
Dedicated nodes, taints, and tolerations design proposal
Dedicated nodes, taints, and tolerations design proposal
ref/ #17190
It's helpful to read #18261 before this one.
@bgrant0607 @mml @kevin-wangzefeng @alfred-huangjian @mikedanese @erimatnor @timothysc