Skip to content

scheduler and storage provision (PV controller) coordination #43504

@jingxu97

Description

@jingxu97

There are several ongoing discussion threads related to this topic, so open this one to summarize all the relevant discussions and hope to reach a conclusion.

Currently pod scheduling and PV controller (PVC/PV binding and dynamic provisioning) are implemented in completely separated controllers. This separation has some benefits such as enabling scheduler pluggable, better isolation, and better performance since PV/PVC binding can be performed asynchronously before pod scheduling. But both controller makes decisions independently without considering the other's choice so that the node/zone selection might conflict.

The goal is to still keep this separate controller design largely and make necessary modification to overcome the problem of conflicting decisions. There are four scenarios we need to consider.

  1. Single Zone, network attached storage
    In such case, PV/PVC is not tied to any node selection. The binding can be performed interdependently. No change is needed from current code.

  2. Single Zone, local storage
    In this case, PV is tied to a specific node. Once a PVC is bound to a PV, it is bound to the node. The decision might not be compatible to pod scheduler's decision. Normally pod scheduler ranks nodes based on some predefined policies and picks the node with the highest score. PV controller is also trying to search and find the best match to bind PVC and PV.

  3. Multiple Zone, network attached storage
    Pod scheduler gives each zone a score which is used to rank nodes. But PV controller itself does not consider zone information when choosing PV for PVC and it uses the same way of choosing PV for a given PVC. Only for dynamic provisioning, user might specify zone information and PV will be created in that zone. In such case, pod scheduler has a predicate which picks nodes only from the zone where the PV is. For statefulsets, a hacky way to make sure volumes are spread across zones when creating the volumes by using PVC name as an indicator of statefulset. In this situation, it is possible that PV controller picks a zone in which no node has enough resources (CPU/memory) for a pod. See more discussion at Fix StatefulSet volume provisioning "magic" #41598

  4. Multiple Zone, local storage
    PV controller could find PV candidates from different zones and nodes. It is similar to case 2. PV controller might picks up a zone and node that does not have enough resources for the pod.

To solve these problems, I think it would be good for storage, scheduling, and workload team to get together and agree on the outcome we want to deliver.

Proposal:
[@vishh] Move the binding selection and decision from PV controller to scheduler. The PV controller will still be around to take care of dangling claims and/or volumes, rollback incomplete transactions (necessary when a pod requests multiple local PVs), reclaim PVs, etc.

Metadata

Metadata

Assignees

Labels

kind/featureCategorizes issue or PR as related to a new feature.priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.sig/schedulingCategorizes an issue or PR as relevant to SIG Scheduling.sig/storageCategorizes an issue or PR as relevant to SIG Storage.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions