MetadataPolicy and its use in choosing the scheduler in multi-scheduler Kubernetes system#18262
Conversation
|
Labelling this PR as size/L |
|
GCE e2e build/test failed for commit 9cf651d. |
docs/proposals/choosing-scheduler.md
Outdated
|
Thanks @davidopp . The PR looks great. Just have a few questions if you don't mind. |
9cf651d to
692ce70
Compare
|
GCE e2e test build/test passed for commit 692ce705c4557a6512e8faffff52506cc36a4edc. |
docs/proposals/choosing-scheduler.md
Outdated
There was a problem hiding this comment.
Can you show somewhere what the YAML would look like for setting a named scheduler as default scheduler? And suggest where this "intentionally generic" object would be extended to, for example, include network policy.
There was a problem hiding this comment.
What values can this take?
My intent is that it can take any string. Interpreting the string is the purview of the component that is using a PodPolicy. For example, for the puts-a-scheduler-name-annotation-on-a-pod admission controller, this string is the scheduler name to apply. In another component, the string my be effectively an enum value, which triggers arbitrary behavior based on the value.
Can you show somewhere what the YAML would look like for setting a named scheduler as default scheduler?
Empty PodSelector matches all pods (in the namespace), so you could have the last PodPolicyRule in the list have a PolicyPredicate with an empty PodSelector; since rules are evaluated in order, this would have the effect of using the corresponding Policy for any pods that don't match any of the "real" rules.
And suggest where this "intentionally generic" object would be extended to, for example, include network policy.
Sorry, my claims of "intentionally generic" are perhaps a bit overblown. My thought was simply that the "Policy" string could be used however the consumer wants; in the scheduler-chooser it's the name of the desired scheduler, while in another consumer it could be effectively an enum value. I didn't give any real thought to use cases outside of scheduler-chooser, I just kinda assumed a string would be good enough. I'm quite willing to believe that assumption is wrong. (BTW @bgrant0607 is the one who suggested to do something generic that could be reused by other components that need policies.)
There was a problem hiding this comment.
I think the YAML ends up as
kind: PodSchedulerPolicy
spec:
policy:
rules:
- policyPredicate:
podSelector:
foo: bar
policy: my-custom-scheduler
That's a WHOLE LOT of indent for a string. It looks like you could kill one level of nesting with no loss of info or flexibility?
kind: PodSchedulerPolicy
spec:
policy:
- predicate:
podSelector:
foo: bar
value: my-custom-scheduler
|
I pushed a new commit to address the comments @thockin had on the confusing wording in the description. I didn't actually change the design yet, though I do like his suggestion for an alternative design. Interested to see what others have to say. |
|
GCE e2e test build/test passed for commit 2d76dfb45a65d5adc0c9c82eaa1f37d26f387eca. |
|
I've revamped the proposal based on the feedback from @bgrant0607 and @thockin. PTAL. |
|
GCE e2e build/test failed for commit ab53955d05464ab2d595f66d1a0674ed15916ca7. |
|
GCE e2e build/test failed for commit 3dd572a50dccdb11a9dfa8fbd487214acfce0e3f. |
|
GCE e2e build/test failed for commit 26b2c0abcf539622a98368d7c60d98a1695f61ec. |
|
It occurred to me that, at least for the scheduler-picking case, it could be useful to allow multiple of the same type of action per PodPolicyRule. The semantics would be "pick one of these randomly." So for the scheduler use case, you could give several different annotations (all for the scheduler name key) and the consumer of the PodPolicy would interpret that as "pick one of these at random." This would be useful if you are running multiple replicas of the same scheduler for performance reasons. Another approach would be to make the PolicyPredicate a bit more expressive, e.g. so that you could have one PolicyPredicate for "hash of the Pod is less than [midpoint of your hash range]" and another for "hash of the Pod is [greater than or equal to the midpoint of your hash range]." Then you could assign a different scheduler for each PolicyPredicate and sort of get the same random behavior (assuming the pods are different so you get different hashes). |
|
Cool, I will start the implementation once the PR gets merged. I need some time to catch up. |
|
(1) and (2) sounds a conceptual difference. From engineering's point of view, both are gonna need a central controller or registry to guard concurrency issues.
I want to point out resharding logic should be in a separate layer instead of going into scheduler. I want to also point out that the goal has changed slightly after the beginning. At the beginning, it's about to enable flexibility in scheduling policies. Now we are going way further and talking about fault tolerance and load balancing. To be honest, it would be better if we could separate the issues and deal with them one by one. |
docs/proposals/metadata-policy.md
Outdated
|
Incorporated reviewer comments, PTAL. |
|
GCE e2e test build/test passed for commit a1df7444e59a1073adcc60f188756b1556549734. |
docs/proposals/metadata-policy.md
Outdated
|
Should this be called "PodMetadataPolicy"? I anticipate similar things may be needed for PersistentVolumeClaim in the future. |
|
@derekwaynecarr I wasn't thinking of this as specific to pods. |
|
LGTM. Please rebase, run hack/update-generated-docs.sh, squash, and apply the lgtm label. |
|
The comments in Godoc made this seem explicit to just pods. On Tuesday, February 9, 2016, Brian Grant notifications@github.com wrote:
|
|
@derekwaynecarr has a good point -- I say "pod" all over the place, never generalized it from the initial version which was going to be just for pods. I'll fix it (haven't merged yet). |
a1df744 to
05dcf74
Compare
|
Automatic merge from submit-queue |
Auto commit by PR queue bot
|
GCE e2e build/test failed for commit 05dcf74. |
|
Now that this has been merged, I will send relevant PR asap. |
|
Now that the proposal has been accepted, has there been any discussion around actually implementing MetadataPolicy? |
|
TBH I have not seen any demand for MetadataPolicy. I suspect that people who are using multiple schedulers just write their own admission controller that hard-codes the policy for setting schedulerName (or reads a policy from some custom configuration mechanism they set up themselves). |
|
I'd like to see discussion on podschedulerpolicy or similar that resolves:
1. Control over tolerations (rbac or otherwise)
2. Control over node selector
3. Whether to support small M "logical configs" for placement that are
named and usable for human focused uis
4. Unifies the namespace placement policy defaulters with a real resource.
5. Potentially controls (or at least discusses intersection with)
which schedulers are available to select
|
ref/ #11793
ref/ #17097
ref/ #17324
@thockin @HaiyangDING @bgrant0607 @cameronbrunner @timothysc @hongchaodeng @mali11 @mqliang @derekwaynecarr
This change is