You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are several ongoing discussion threads related to this topic, so open this one to summarize all the relevant discussions and hope to reach a conclusion.
Currently pod scheduling and PV controller (PVC/PV binding and dynamic provisioning) are implemented in completely separated controllers. This separation has some benefits such as enabling scheduler pluggable, better isolation, and better performance since PV/PVC binding can be performed asynchronously before pod scheduling. But both controller makes decisions independently without considering the other's choice so that the node/zone selection might conflict.
The goal is to still keep this separate controller design largely and make necessary modification to overcome the problem of conflicting decisions. There are four scenarios we need to consider.
Single Zone, network attached storage
In such case, PV/PVC is not tied to any node selection. The binding can be performed interdependently. No change is needed from current code.
Single Zone, local storage
In this case, PV is tied to a specific node. Once a PVC is bound to a PV, it is bound to the node. The decision might not be compatible to pod scheduler's decision. Normally pod scheduler ranks nodes based on some predefined policies and picks the node with the highest score. PV controller is also trying to search and find the best match to bind PVC and PV.
Multiple Zone, network attached storage
Pod scheduler gives each zone a score which is used to rank nodes. But PV controller itself does not consider zone information when choosing PV for PVC and it uses the same way of choosing PV for a given PVC. Only for dynamic provisioning, user might specify zone information and PV will be created in that zone. In such case, pod scheduler has a predicate which picks nodes only from the zone where the PV is. For statefulsets, a hacky way to make sure volumes are spread across zones when creating the volumes by using PVC name as an indicator of statefulset. In this situation, it is possible that PV controller picks a zone in which no node has enough resources (CPU/memory) for a pod. See more discussion at Fix StatefulSet volume provisioning "magic" #41598
Multiple Zone, local storage
PV controller could find PV candidates from different zones and nodes. It is similar to case 2. PV controller might picks up a zone and node that does not have enough resources for the pod.
To solve these problems, I think it would be good for storage, scheduling, and workload team to get together and agree on the outcome we want to deliver.
Proposal:
[@vishh] Move the binding selection and decision from PV controller to scheduler. The PV controller will still be around to take care of dangling claims and/or volumes, rollback incomplete transactions (necessary when a pod requests multiple local PVs), reclaim PVs, etc.
There are several ongoing discussion threads related to this topic, so open this one to summarize all the relevant discussions and hope to reach a conclusion.
Currently pod scheduling and PV controller (PVC/PV binding and dynamic provisioning) are implemented in completely separated controllers. This separation has some benefits such as enabling scheduler pluggable, better isolation, and better performance since PV/PVC binding can be performed asynchronously before pod scheduling. But both controller makes decisions independently without considering the other's choice so that the node/zone selection might conflict.
The goal is to still keep this separate controller design largely and make necessary modification to overcome the problem of conflicting decisions. There are four scenarios we need to consider.
Single Zone, network attached storage
In such case, PV/PVC is not tied to any node selection. The binding can be performed interdependently. No change is needed from current code.
Single Zone, local storage
In this case, PV is tied to a specific node. Once a PVC is bound to a PV, it is bound to the node. The decision might not be compatible to pod scheduler's decision. Normally pod scheduler ranks nodes based on some predefined policies and picks the node with the highest score. PV controller is also trying to search and find the best match to bind PVC and PV.
Multiple Zone, network attached storage
Pod scheduler gives each zone a score which is used to rank nodes. But PV controller itself does not consider zone information when choosing PV for PVC and it uses the same way of choosing PV for a given PVC. Only for dynamic provisioning, user might specify zone information and PV will be created in that zone. In such case, pod scheduler has a predicate which picks nodes only from the zone where the PV is. For statefulsets, a hacky way to make sure volumes are spread across zones when creating the volumes by using PVC name as an indicator of statefulset. In this situation, it is possible that PV controller picks a zone in which no node has enough resources (CPU/memory) for a pod. See more discussion at Fix StatefulSet volume provisioning "magic" #41598
Multiple Zone, local storage
PV controller could find PV candidates from different zones and nodes. It is similar to case 2. PV controller might picks up a zone and node that does not have enough resources for the pod.
To solve these problems, I think it would be good for storage, scheduling, and workload team to get together and agree on the outcome we want to deliver.
Proposal:
[@vishh] Move the binding selection and decision from PV controller to scheduler. The PV controller will still be around to take care of dangling claims and/or volumes, rollback incomplete transactions (necessary when a pod requests multiple local PVs), reclaim PVs, etc.