Add a KEP for graduating PodDisruptionBudget to stable#904
Add a KEP for graduating PodDisruptionBudget to stable#904k8s-ci-robot merged 1 commit intokubernetes:masterfrom
Conversation
| ## Summary | ||
|
|
||
| [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) | ||
| is a Kubernete API that limits the number of pods of a collection that are down simultaneously from voluntary disruptions. |
| #### Mutable PDBs | ||
|
|
||
| A mutable PDB object allows its `MinAvailable` and `MaxUnavailable` fields to be | ||
| modified by clients. Components that use PDB must watch such modifications and |
There was a problem hiding this comment.
what about selector fields? is there a good reason to limit which spec fields can be modified?
There was a problem hiding this comment.
Added the selector as well. I am not closely familiar with the internal logic of Disruption controller to say with confidence whether updating the selector (and/or other fields) could cause any issues there, but I don't see a concern after a quick look at the code.
| the rolling update spec and currently does not take PDB into account. We need to | ||
| change the implementation and use | ||
| `min(PDB.MinAvailable, RollingUpdate.MinAvailable)` and | ||
| `max(PDB.MaxUnavailable, RollingUpdate.MaxUnavailable)` instead. |
There was a problem hiding this comment.
Do we want deployments looking at PDB objects or using the evict subresource? Just looking at the PDB does not take into account PDB status for selectors that match across deployments, right?
Respecting PDB could easily lead to a situation in which a deployment could not make progress. Given that, is respecting PDB still desired? If so, describe how that state will be detected, communicated, and/or resolved (automatically or manually).
There was a problem hiding this comment.
What I had in mind was to only look at MinAvailable and MaxUnavailable values and treat those similar to MinAvailable and MaxUnavailable of a rolling update, without making further changes and without using "evict" subresource, but given your points and reading some of the old comments, I think we should drop this requirement.
| #### Needed Tests | ||
|
|
||
| If this is considered a non-optional feature, there should be a conformance test | ||
| for it (this needs to be an e2e test tagged as conformance). |
There was a problem hiding this comment.
This seems like a good candidate for conformance testing. With an ack from @kubernetes/cncf-conformance-wg I'd make this a more definitive statement
There was a problem hiding this comment.
Changed wording to make the test a requirement.
| ## Summary | ||
|
|
||
| [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) | ||
| is a Kubernete API that limits the number of pods of a collection that are down simultaneously from voluntary disruptions. |
| #### Mutable PDBs | ||
|
|
||
| A mutable PDB object allows its `MinAvailable` and `MaxUnavailable` fields to be | ||
| modified by clients. Components that use PDB must watch such modifications and |
There was a problem hiding this comment.
Added the selector as well. I am not closely familiar with the internal logic of Disruption controller to say with confidence whether updating the selector (and/or other fields) could cause any issues there, but I don't see a concern after a quick look at the code.
| the rolling update spec and currently does not take PDB into account. We need to | ||
| change the implementation and use | ||
| `min(PDB.MinAvailable, RollingUpdate.MinAvailable)` and | ||
| `max(PDB.MaxUnavailable, RollingUpdate.MaxUnavailable)` instead. |
There was a problem hiding this comment.
What I had in mind was to only look at MinAvailable and MaxUnavailable values and treat those similar to MinAvailable and MaxUnavailable of a rolling update, without making further changes and without using "evict" subresource, but given your points and reading some of the old comments, I think we should drop this requirement.
|
|
||
| ## Summary | ||
|
|
||
| [Pod Disruption Budget (PDB)](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) |
There was a problem hiding this comment.
@bsalamat @liggitt - I've been thinking a lot about this topic when working on ClusterAPI.
Because the object is very loosely coupled with Pods, and only via label selector, is there any reason why we couldn't have it be more generic and remove the the 'Pod' from the resource kind? The link via label denotes the resource that it is tied to.
As a concrete example, we would like to tie disruption budgets to other resources for cluster management (machines).
/cc @ncdc @detiber
There was a problem hiding this comment.
for the other scenarios, is delete the standard definition of disruption?
There was a problem hiding this comment.
Agree with disruption budgets to machines. There's two orthogonal dimensions and use cases - managing machine upgrades because of organizational or infrastructure policy (and having to create fake pods to represent that with PDB today), and managing impact to applications.
There was a problem hiding this comment.
Would just using label selector cause confusion here? Would there be a case where we would want to match both Machines and Pods with the same selector? Or would it be on the consumer to differentiate the type to associate with the label selector?
As to the specific use case around Machines, would it make sense to have the disruption budget exist for the Machines or the underlying Nodes instead? If what we are striving for is to ensure a certain level of capacity is available during an upgrade, then it seems to me that we are more concerned with the Nodes rather than the Machines themselves.
There was a problem hiding this comment.
we have protected a pool of machines by running a pod per node with anti-affinity rules setup, and then you can prevent disruption to more than N number of nodes using that pod as a guard. its a bit of a hack, but it works. in general, i would prefer we have pod disruption budget, and machine disruption budgets as separate objects, the use cases may vary slightly in practice. similar to having machine sets and replica sets.
There was a problem hiding this comment.
+1 to having different disruption budget for pods and nodes for a few reasons:
- PDB is enforced through
evictsubresource of Pods which is obviously not applicable to nodes. - If we used the same API for both objects, we would need to add a field to the API to specify the type of object that it applies to (node vs pod).
- Implementation of Node Disruption Budget shares little with that of PDB. So, we won't save much on the coding side by combining the two APIs either.
There was a problem hiding this comment.
I would prefer we have pod disruption budget, and machine disruption budgets as separate objects, the use cases may vary slightly in practice. similar to having machine sets and replica sets
I'd tend to agree. The concepts are similar, but the application, audience, and scope are different.
PDB's are namespaced, and write access can be granted to users with write access in a namespace.
Machine or Node disruption budgets would be cluster-scoped, and should only be writeable by the cluster admin.
There was a problem hiding this comment.
Machine or Node disruption budgets would be cluster-scoped, and should only be writeable by the cluster admin.
That's not true, they would be namespace scoped in a mgmt cluster.
FWIW I'm ok if we decide to not rationalize now, but I do think this is an area in the api where there is overlapping concepts and having (N) of the same type is also not good.
There was a problem hiding this comment.
FWIW I'm ok if we decide to not rationalize now,
If we think we may change the API at some point, discussing it now is probably better than after the API is GA'ed.
There was a problem hiding this comment.
We have work-arounds on our side.
|
i have no major issue about graduating pdb in its current form. limitations that we have encountered recently could be handled via a higher level controller writing to a pdb so making them mutable would help a lot. an example scenario was sizing a pdb relative to the number of control plane machines in a cluster which is easier to do if we support mutability. this is useful if you use daemonsets for some types of workloads, but still want to minimize potential disruption to the number of control plane machines. |
| ### Goals | ||
|
|
||
| * Plan to promote PDB API to stable version. | ||
| * Propose changes to the deployment controller to take PDB into account. |
There was a problem hiding this comment.
Where is "taking it into account" described?
There was a problem hiding this comment.
I removed that section as it turns out to create complexities which are not easy to address. I'll remove this goal.
d2515ef to
e77d930
Compare
| - [This test](https://github.com/kubernetes/kubernetes/blob/ac56bd502ab96696682c66ebdff94b6e52471aa3/test/integration/scheduler/preemption_test.go#L731) | ||
| tests effects of PDB on preemption (PDB is honored in a best effort way) | ||
| * Eviction integration tests | ||
| - [These tests](https://github.com/kubernetes/kubernetes/blob/master/test/integration/evictions/evictions_test.go) test eviction logic and its interactions with PDB. |
There was a problem hiding this comment.
| - [ ] Ensure that components do not have any logic that relies on immutability | ||
| of PDBs. For example, if component builds a cache of pods that match various | ||
| PDBs, they must add logic to invalid the cache on updates. | ||
| - Action Item: sweep for in-tree uses to make sure we’re informer driven and |
There was a problem hiding this comment.
This should include all repos in kubernetes org, not just k/k. I know Cluster Autoscaler has logic to deal with PDBs (I don't think it will be impacted by this change, but it's worth double checking), I suspect VPA may also use it. Not sure about other projects under kubernetes org.
e77d930 to
aa2d4d4
Compare
|
I'm looking into adding support for the scale subresource with PDBs, but I'm not sure if this would be a requirement for GA. But I hope this would be ready for the next release of Kubernetes together with mutable PDBs. And then both changes could soak for one release and we can push PDBs to GA in 1.16. I am interested in taking on the task of getting this to GA. |
|
So we've discussed this as something we want to do. It passes the criteria for approval. |
| Kubernetes control plane evicts some of the existing pods, but keeps at least 10 | ||
| around. The PDB is updated and states that at least 20 replicas must be kept | ||
| alive. It may appear to an observer that the evictions happened before the PDB | ||
| update were incorrect, if they don’t notice the PDB update. |
There was a problem hiding this comment.
We already have this problem before PDBs are mutable.
Immutable PDBs can be deleted and re-created with new spec (20 replicas in this example).
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, janetkuo, kow3ns The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a placeholder PR for the docs for the PDB to GA KEP kubernetes/enhancements#904
This is a placeholder PR for the docs for the PDB to GA KEP kubernetes/enhancements#904
|
See kubernetes/kubernetes#95083 for a potential problem with the debatable treatment of the |
/sig apps
/assign @kow3ns @liggitt