Automated cherry pick of #129661: DRA CEL: add missing size estimator by pohly · Pull Request #129690 · kubernetes/kubernetes

pohly · 2025-01-17T13:27:58Z

Cherry pick of #129661 on release-1.32.
Cherry pick of #129749 on release-1.32.

#129661: DRA CEL: add missing size estimator
#129749: DRA: CEL cost estimate during validation

For details on the cherry pick process, see the cherry pick requests page.

DRA: CEL expressions using attribute strings exceeded the cost limit because their cost estimation was incomplete. Cost estimation was unnecessarily also computed in the scheduler.

Not implementing a size estimator had the effect that strings retrieved from the attributes were treated as "unknown size", leading to wildly overestimating the cost and validation errors even for even simple expressions like this: device.attributes["qat.intel.com"].services.matches("[^a]?sym") Maximum number of elements in maps and the maximum length of the driver name string were also ignored resp. missing. Pre-defined types like apiservercel.StringType must be avoided because they are defined as having a zero maximum size.

pohly · 2025-01-17T13:46:57Z

This is not an obvious candidate for cherry-picking (no security issue, no regression because the specific usage has never worked, and not critical because no data loss etc).

But it affects a beta feature that people wanted to use with 1.32 and the problem was something that got reported by different driver authors (Intel for QAT, NVIDIA for GPUs) because they ran into it when preparing driver releases for 1.32. So let's discuss whether it's worth to cherry-pick...

Issue: DRA CEL: add missing size estimator #129661
Scope of the change: limited to DRA CEL expressions, no impact on other CEL usage
Risks: low, because once it was understood what needed to be done, it was possible to adapt the corresponding code from admission control - @jpbetz, agreed?
Tests: no changes in existing unit test cases, new cases confirm that the change works as intended
Key stakeholder SIG reviewers/approvers: @klueska, @SergeyKanzhelev, please confirm for SIG Node that this can and should be backported

pohly · 2025-01-17T13:49:36Z

/cc @liggitt

Because he was concerned about the intent to backport and for API approval.

klueska · 2025-01-20T12:11:43Z

One argument for backporting is that without this fix, the following (seemingly simple) CEL expression fails to evaluate:

 - cel:
      expression: device.attributes['gpu.nvidia.com'].productName.lowerAscii().matches('^.*a100.*$')

This is a common way to select a specific type of GPU by its product name.

I can't speak to the complexity of the fix itself (and therefore the risk of backporting), but from a usability point of view, not backporting this means that users of DRA are artifically limited in the types of selection they can do on devices.

/cc @jpbetz to comment on the risk of backporting

k8s-ci-robot · 2025-01-20T12:11:48Z

@klueska: GitHub didn't allow me to request PR reviews from the following users: to, comment, on, the, risk, of.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

One argument for backporting is that without this fix, the following (seemingly simple) CEL expression fails to evaluate:
- cel:
           expression: |
             device.attributes['gpu.nvidia.com'].productName.lowerAscii().matches('^.*a100.*$')
This is a common way to select a specific type of GPU by its product name.

I can't speak to the complexity of the fix itself (and therefore the risk of backporting), but from a usability point of view, not backporting this means that users of DRA are artifically limited in the types of selection they can do on devices.

/cc @jpbetz to comment on the risk of backporting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2025-01-20T12:35:50Z

Not backporting also has the negative effect that those users who could give us feedback on writing those more complex CEL expressions would have to wait for 1.33. As seen with the promotion to beta in 1.32, it's not enough to have something available in the master branch. It needs to be released or almost released (= included in a tagged alpha release) before people start taking notice. For users there's the additional complication that they often depend on cloud or Kubernetes providers.

pohly · 2025-01-21T17:35:31Z

It was pointed out that this used to work in early 1.32 alpha releases (@klueska had tested at that time) and then broke later when adding the CEL cost limit, so this is in fact a regression, at least when also considering alphas. This further strengthens the arguments that this should get cherry-picked.

jpbetz · 2025-01-21T17:44:56Z

+1 to backporting this. This effectively makes it impossible to use some valid CEL expressions in prior minors. This was not our intent, and it "feels buggy" to have expressions that work on 1.33+ be rejected due to a mistake on expression costing.

I consider this low risk. The way the code change is made, only DRA is impacted, and it's easy to reason about the code and the test coverage serves as a strong safety net.

liggitt · 2025-01-21T17:55:35Z

It was pointed out that this used to work in early 1.32 alpha releases (@klueska had tested at that time) and then broke later when adding the CEL cost limit

I don't really buy this... any cost limit added to the beta feature could be framed as a regression since it would make arbitrarily expensive expressions allowed in the alpha fail validation. We intend to reject some expressions that worked in alpha, the only question is which / how many.

+1 to backporting this. This effectively makes it impossible to use some valid CEL expressions in prior minors. This was not our intent, and it "feels buggy" to have expressions that work on 1.33+ be rejected due to a mistake on expression costing.

I consider this low risk. The way the code change is made, only DRA is impacted, and it's easy to reason about the code and the test coverage serves as a strong safety net.

I think we're trying to weigh these risks against each other:

Risk of alpha users being unable to use reasonable expressions due to overly aggressive cost enforcement. Clarifying how unusable the overly aggressive cost enforcement makes the feature would be helpful. Does "impossible to use some valid CEL expressions" mean most or more complex?
Risk of regressing the beta feature in 1.32 with a non-trivial cost estimator change. Clarifying the risk here would be helpful. Worst-case, if the new cost estimation code errored / panicked, what would it break? Just the individual DRA API write request that would have failed due to cost estimation rejection anyway, or could that take down a server / controller?

johnbelamaric · 2025-01-21T18:51:50Z

I don't really buy this... any cost limit added to the beta feature could be framed as a regression since it would make arbitrarily expensive expressions allowed in the alpha fail validation. We intend to reject some expressions that worked in alpha, the only question is which / how many.

This issue though here is that it is not an arbitrarily expensive operation - the length is enforced and was enforced but is just not counted properly in the cost estimation. I get your point, but I do think that's a material difference from "any expression".

Risk of alpha users being unable to use reasonable expressions due to overly aggressive cost enforcement. Clarifying how unusable the overly aggressive cost enforcement makes the feature would be helpful. Does "impossible to use some valid CEL expressions" mean most or more complex?

These are beta users not alpha users.

IIUC, any expression that either makes a comparison between two string attributes, or makes a function call on a string attribute (like a regex match) will fail validation here. In particular, that means no regex matching against these strings. I think the test cases only did comparisons of these strings to fixed length strings, so the those fixed length strings were used in the test case cost estimations.

I am not sure how smart the cost estimator is. It might be possible to "hack" around it by adding an additional, always true clause that limits the attribute string length. For example instead of attribute.matches('.*') you would put attribute.length() < 100 && attribute.matches('.*'). A human could see that here you can assume attribute is less than 100 characters and so estimate the matches cost. I don't know if the algorithm is that smart. It's also junk that will likely creep into CEL expressions long long into the future...

liggitt · 2025-01-21T18:56:48Z

These are beta users not alpha users.

Sorry... I meant users moving from alpha → beta losing ability to use expressions they did previously. New beta users new to the feature won't have any existing expressions they would consider the current state to regress.

I'd look to folks who know the DRA domain (@pohly, @johnbelamaric, @klueska) to describe how limiting this issue is for alpha → beta users, and the folks who know the cost estimator impl / wiring (@jpbetz, @pohly) to describe the risk to the whole cluster if the new code fails / panics in unexpected ways.

I mostly want to be super clear in the backport justification why we consider this 1) severe enough to backport and 2) low-risk enough to backport.

pohly · 2025-02-26T17:19:30Z

I ran an experiment to verify that the panic protection works and the blast radius really is just the apiserver.

First I build all components with this patch:

diff --git a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
index d59f7d7d4bb..b157e334383 100644
--- a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
+++ b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
@@ -370,6 +370,8 @@ type sizeEstimator struct {
 }
 
 func (s *sizeEstimator) EstimateSize(element checker.AstNode) *checker.SizeEstimate {
+       panic("this should not happen")
+
        path := element.Path()
        if len(path) == 0 {
                // Path() can return an empty list, early exit if it does since we can't

In a cluster with that fault, running an E2E test which uses CEL selectors ("retries pod scheduling after updating device class") fails as expected when trying to create the DeviceClass, without taking down the apiserver:

  I0226 18:09:40.474568 379941 dra.go:1674] Unexpected error: create *v1beta1.DeviceClass: 
      <*url.Error | 0xc0007a4e70>: 
      Post "https://localhost:6444/apis/resource.k8s.io/v1beta1/deviceclasses": stream error: stream ID 71; INTERNAL_ERROR; received from peer
      {
          Op: "Post",
          URL: "https://localhost:6444/apis/resource.k8s.io/v1beta1/deviceclasses",
          Err: <http2.StreamError>{
              StreamID: 71,
              Code: 2,
              Cause: <*errors.errorString | 0x72521f0>{
                  s: "received from peer",
              },
          },
      }

Next I "fixed" the kube-apiserver and restarted it (and only it!). Now the test passes, showing that the kube-scheduler doesn't call this code anymore.

Finally, to rule out that the kube-scheduler perhaps somehow tolerated the panic and worked normally despite it, I broke some other code and then the kube-scheduler really crashed.

pohly · 2025-02-26T17:27:54Z

I also tested that only the writes are affected. I first created a DeviceClass, then restarted the apiserver with the "bug". Reading the class works, updating it fails, leaving the class unmodified.

pohly · 2025-02-26T19:34:54Z

I chatted with @liggitt about this. He still doesn't like that we are proposing to make an exception for something that normally wouldn't qualify for backporting and said that he "won't approve, but won't block it".

@johnbelamaric: can you perhaps confirm that we want this backported by adding the LGTM? These are mechanical cherry-picks from master, so no new technical changes.

@thockin: can you approve (needed because it's touching some API files)?

/triage accepted
/priority important-soon

thockin · 2025-02-26T21:24:43Z

The only things about this that make it viable to me are that it is beta and actually reported by implementors. I am not sure what the guiding principle would be to say yes to this and no to something else, though. It's not a "bug" per se, but it kind of is, right?

liggitt · 2025-02-26T21:29:13Z

Yeah, thanks for the due diligence on containing the blast radius if there's an accidental bug in this code, and for verifying the behavior in the worst case scenario we can imagine where this new code completely panics. Both of those benefit master / 1.33+ as well, regardless of what we do here.

This is at the point now where we don't think this is risky to backport.

My reluctance comes from years of past experience where we (including I) would approve what we thought were risk-free backports, which later turned out to regress patch releases. From Kubernetes 1.19-1.27, ~35% of our patch releases were regressed due to a backport and worse than the .0 of their minor in some way. That was a terrible track record, and taught users that picking up Kubernetes patch releases was somewhat of a gamble, stability-wise.

In 1.28+, when we got significantly more strict about what we allow in backports, our track record has improved to ~8% of patch releases regressing (and is trending even better in newer minors). We need patch releases to be as close to ~0 risk as we can make them, so we can release critical bugfixes and security fixes and count on fast adoption.

Do I see a specific risk with this backport? Not really, though it is ~big and difficult to eyeball. Do I think that in aggregate, making exceptions and backporting things outside our narrow guidelines makes Kubernetes patch releases worse and less reliable? Yes, and I think the data supports that.

So... yeah... I'm not so concerned about a specific risk on this PR that I'd throw myself in front of it to block area owners / approvers who were pushing for it. But I can't say I approve of fixing up features in place with backports.

johnbelamaric · 2025-02-26T22:51:17Z

I think the limited blast radius and the two independent implementors being limited by this justifies the backport.

/lgtm

k8s-ci-robot · 2025-02-26T22:51:24Z

LGTM label has been added.

Details

Git tree hash: 493f210cbe5b52cf0cd2bfa4d1229b55975d4cf2

thockin · 2025-02-26T23:37:45Z

If I am left as the tie-breaker, I'll try to fight my risk-aversion instinct and bias toward action.

/approve
/hold last chance to disagree?

pohly · 2025-02-27T06:48:52Z

It's not a "bug" per se, but it kind of is, right?

It is a bug, just not a regression. Fixes for regressions justify a backport (can't have 1.(X+1) be worse than 1.X), fixes for bugs don't (new feature in 1.X simply not usable yet, users have to wait for 1.(X+1)).

Let me get release team approval, then I'll lift the hold.

pohly · 2025-03-01T08:37:04Z

/cc kubernetes/release-managers

k8s-ci-robot · 2025-03-03T12:03:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, saschagrunert, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/apis/OWNERS~~ [thockin]
~~staging/src/k8s.io/api/OWNERS~~ [thockin]
~~staging/src/k8s.io/dynamic-resource-allocation/OWNERS~~ [pohly,thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2025-03-03T12:08:02Z

/hold cancel

k8s-ci-robot added this to the v1.32 milestone Jan 17, 2025

k8s-ci-robot requested a review from bart0sh January 17, 2025 13:28

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jan 17, 2025

k8s-ci-robot requested a review from klueska January 17, 2025 13:28

k8s-ci-robot requested a review from liggitt January 17, 2025 13:49

pohly mentioned this pull request Jan 17, 2025

DRA: Increase CEL expression complexity limit in resourceClaimTemplate #129523

Closed

k8s-ci-robot requested a review from jpbetz January 20, 2025 12:11

thockin self-assigned this Feb 26, 2025

k8s-ci-robot assigned johnbelamaric Feb 26, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 26, 2025

k8s-ci-robot requested a review from a team March 1, 2025 08:37

saschagrunert approved these changes Mar 3, 2025

View reviewed changes

k8s-ci-robot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 3, 2025

github-project-automation Bot moved this from Needs Reviewer to Needs Approver in SIG Node: code and documentation PRs Mar 3, 2025

k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Mar 3, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2025

k8s-ci-robot merged commit 8207fb4 into kubernetes:release-1.32 Mar 3, 2025

github-project-automation Bot moved this from Needs Approver to Done in SIG Node: code and documentation PRs Mar 3, 2025

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Mar 4, 2025

Conversation

pohly commented Jan 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented Jan 17, 2025

Uh oh!

pohly commented Jan 17, 2025

Uh oh!

klueska commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Jan 20, 2025

Uh oh!

pohly commented Jan 20, 2025

Uh oh!

pohly commented Jan 21, 2025

Uh oh!

jpbetz commented Jan 21, 2025

Uh oh!

liggitt commented Jan 21, 2025

Uh oh!

johnbelamaric commented Jan 21, 2025

Uh oh!

liggitt commented Jan 21, 2025

Uh oh!

pohly commented Feb 26, 2025

Uh oh!

pohly commented Feb 26, 2025

Uh oh!

pohly commented Feb 26, 2025

Uh oh!

thockin commented Feb 26, 2025

Uh oh!

liggitt commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnbelamaric commented Feb 26, 2025

Uh oh!

k8s-ci-robot commented Feb 26, 2025

Uh oh!

thockin commented Feb 26, 2025

Uh oh!

pohly commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pohly commented Mar 1, 2025

Uh oh!

k8s-ci-robot commented Mar 3, 2025

Uh oh!

pohly commented Mar 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

pohly commented Jan 17, 2025 •

edited

Loading

klueska commented Jan 20, 2025 •

edited

Loading

liggitt commented Feb 26, 2025 •

edited

Loading

pohly commented Feb 27, 2025 •

edited

Loading