Skip to content

Automated cherry pick of #129661: DRA CEL: add missing size estimator#129690

Merged
k8s-ci-robot merged 2 commits intokubernetes:release-1.32from
pohly:automated-cherry-pick-of-#129661-origin-release-1.32
Mar 3, 2025
Merged

Automated cherry pick of #129661: DRA CEL: add missing size estimator#129690
k8s-ci-robot merged 2 commits intokubernetes:release-1.32from
pohly:automated-cherry-pick-of-#129661-origin-release-1.32

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Jan 17, 2025

Cherry pick of #129661 on release-1.32.
Cherry pick of #129749 on release-1.32.

#129661: DRA CEL: add missing size estimator
#129749: DRA: CEL cost estimate during validation

For details on the cherry pick process, see the cherry pick requests page.

DRA: CEL expressions using attribute strings exceeded the cost limit because their cost estimation was incomplete. Cost estimation was unnecessarily also computed in the scheduler.

Not implementing a size estimator had the effect that strings retrieved from
the attributes were treated as "unknown size", leading to wildly overestimating
the cost and validation errors even for even simple expressions like this:

    device.attributes["qat.intel.com"].services.matches("[^a]?sym")

Maximum number of elements in maps and the maximum length of the driver name
string were also ignored resp. missing. Pre-defined types like
apiservercel.StringType must be avoided because they are defined as having
a zero maximum size.
@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Jan 17, 2025
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 17, 2025
@k8s-ci-robot k8s-ci-robot requested a review from bart0sh January 17, 2025 13:28
@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jan 17, 2025
@k8s-ci-robot k8s-ci-robot requested a review from klueska January 17, 2025 13:28
@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2025
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 17, 2025

This is not an obvious candidate for cherry-picking (no security issue, no regression because the specific usage has never worked, and not critical because no data loss etc).

But it affects a beta feature that people wanted to use with 1.32 and the problem was something that got reported by different driver authors (Intel for QAT, NVIDIA for GPUs) because they ran into it when preparing driver releases for 1.32. So let's discuss whether it's worth to cherry-pick...

  • Issue: DRA CEL: add missing size estimator #129661
  • Scope of the change: limited to DRA CEL expressions, no impact on other CEL usage
  • Risks: low, because once it was understood what needed to be done, it was possible to adapt the corresponding code from admission control - @jpbetz, agreed?
  • Tests: no changes in existing unit test cases, new cases confirm that the change works as intended
  • Key stakeholder SIG reviewers/approvers: @klueska, @SergeyKanzhelev, please confirm for SIG Node that this can and should be backported

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 17, 2025

/cc @liggitt

Because he was concerned about the intent to backport and for API approval.

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Jan 20, 2025

One argument for backporting is that without this fix, the following (seemingly simple) CEL expression fails to evaluate:

 - cel:
      expression: device.attributes['gpu.nvidia.com'].productName.lowerAscii().matches('^.*a100.*$')

This is a common way to select a specific type of GPU by its product name.

I can't speak to the complexity of the fix itself (and therefore the risk of backporting), but from a usability point of view, not backporting this means that users of DRA are artifically limited in the types of selection they can do on devices.

/cc @jpbetz to comment on the risk of backporting

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz January 20, 2025 12:11
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@klueska: GitHub didn't allow me to request PR reviews from the following users: to, comment, on, the, risk, of.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

One argument for backporting is that without this fix, the following (seemingly simple) CEL expression fails to evaluate:

- cel:
           expression: |
             device.attributes['gpu.nvidia.com'].productName.lowerAscii().matches('^.*a100.*$')

This is a common way to select a specific type of GPU by its product name.

I can't speak to the complexity of the fix itself (and therefore the risk of backporting), but from a usability point of view, not backporting this means that users of DRA are artifically limited in the types of selection they can do on devices.

/cc @jpbetz to comment on the risk of backporting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 20, 2025

Not backporting also has the negative effect that those users who could give us feedback on writing those more complex CEL expressions would have to wait for 1.33. As seen with the promotion to beta in 1.32, it's not enough to have something available in the master branch. It needs to be released or almost released (= included in a tagged alpha release) before people start taking notice. For users there's the additional complication that they often depend on cloud or Kubernetes providers.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 21, 2025

It was pointed out that this used to work in early 1.32 alpha releases (@klueska had tested at that time) and then broke later when adding the CEL cost limit, so this is in fact a regression, at least when also considering alphas. This further strengthens the arguments that this should get cherry-picked.

@jpbetz
Copy link
Copy Markdown
Contributor

jpbetz commented Jan 21, 2025

+1 to backporting this. This effectively makes it impossible to use some valid CEL expressions in prior minors. This was not our intent, and it "feels buggy" to have expressions that work on 1.33+ be rejected due to a mistake on expression costing.

I consider this low risk. The way the code change is made, only DRA is impacted, and it's easy to reason about the code and the test coverage serves as a strong safety net.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 21, 2025

It was pointed out that this used to work in early 1.32 alpha releases (@klueska had tested at that time) and then broke later when adding the CEL cost limit

I don't really buy this... any cost limit added to the beta feature could be framed as a regression since it would make arbitrarily expensive expressions allowed in the alpha fail validation. We intend to reject some expressions that worked in alpha, the only question is which / how many.

+1 to backporting this. This effectively makes it impossible to use some valid CEL expressions in prior minors. This was not our intent, and it "feels buggy" to have expressions that work on 1.33+ be rejected due to a mistake on expression costing.

I consider this low risk. The way the code change is made, only DRA is impacted, and it's easy to reason about the code and the test coverage serves as a strong safety net.

I think we're trying to weigh these risks against each other:

  • Risk of alpha users being unable to use reasonable expressions due to overly aggressive cost enforcement. Clarifying how unusable the overly aggressive cost enforcement makes the feature would be helpful. Does "impossible to use some valid CEL expressions" mean most or more complex?
  • Risk of regressing the beta feature in 1.32 with a non-trivial cost estimator change. Clarifying the risk here would be helpful. Worst-case, if the new cost estimation code errored / panicked, what would it break? Just the individual DRA API write request that would have failed due to cost estimation rejection anyway, or could that take down a server / controller?

@johnbelamaric
Copy link
Copy Markdown
Member

I don't really buy this... any cost limit added to the beta feature could be framed as a regression since it would make arbitrarily expensive expressions allowed in the alpha fail validation. We intend to reject some expressions that worked in alpha, the only question is which / how many.

This issue though here is that it is not an arbitrarily expensive operation - the length is enforced and was enforced but is just not counted properly in the cost estimation. I get your point, but I do think that's a material difference from "any expression".

  • Risk of alpha users being unable to use reasonable expressions due to overly aggressive cost enforcement. Clarifying how unusable the overly aggressive cost enforcement makes the feature would be helpful. Does "impossible to use some valid CEL expressions" mean most or more complex?

These are beta users not alpha users.

IIUC, any expression that either makes a comparison between two string attributes, or makes a function call on a string attribute (like a regex match) will fail validation here. In particular, that means no regex matching against these strings. I think the test cases only did comparisons of these strings to fixed length strings, so the those fixed length strings were used in the test case cost estimations.

I am not sure how smart the cost estimator is. It might be possible to "hack" around it by adding an additional, always true clause that limits the attribute string length. For example instead of attribute.matches('.*') you would put attribute.length() < 100 && attribute.matches('.*'). A human could see that here you can assume attribute is less than 100 characters and so estimate the matches cost. I don't know if the algorithm is that smart. It's also junk that will likely creep into CEL expressions long long into the future...

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 21, 2025

These are beta users not alpha users.

Sorry... I meant users moving from alpha → beta losing ability to use expressions they did previously. New beta users new to the feature won't have any existing expressions they would consider the current state to regress.

I'd look to folks who know the DRA domain (@pohly, @johnbelamaric, @klueska) to describe how limiting this issue is for alpha → beta users, and the folks who know the cost estimator impl / wiring (@jpbetz, @pohly) to describe the risk to the whole cluster if the new code fails / panics in unexpected ways.

I mostly want to be super clear in the backport justification why we consider this 1) severe enough to backport and 2) low-risk enough to backport.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Feb 26, 2025

I ran an experiment to verify that the panic protection works and the blast radius really is just the apiserver.

First I build all components with this patch:

diff --git a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
index d59f7d7d4bb..b157e334383 100644
--- a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
+++ b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
@@ -370,6 +370,8 @@ type sizeEstimator struct {
 }
 
 func (s *sizeEstimator) EstimateSize(element checker.AstNode) *checker.SizeEstimate {
+       panic("this should not happen")
+
        path := element.Path()
        if len(path) == 0 {
                // Path() can return an empty list, early exit if it does since we can't

In a cluster with that fault, running an E2E test which uses CEL selectors ("retries pod scheduling after updating device class") fails as expected when trying to create the DeviceClass, without taking down the apiserver:

  I0226 18:09:40.474568 379941 dra.go:1674] Unexpected error: create *v1beta1.DeviceClass: 
      <*url.Error | 0xc0007a4e70>: 
      Post "https://localhost:6444/apis/resource.k8s.io/v1beta1/deviceclasses": stream error: stream ID 71; INTERNAL_ERROR; received from peer
      {
          Op: "Post",
          URL: "https://localhost:6444/apis/resource.k8s.io/v1beta1/deviceclasses",
          Err: <http2.StreamError>{
              StreamID: 71,
              Code: 2,
              Cause: <*errors.errorString | 0x72521f0>{
                  s: "received from peer",
              },
          },
      }

Next I "fixed" the kube-apiserver and restarted it (and only it!). Now the test passes, showing that the kube-scheduler doesn't call this code anymore.

Finally, to rule out that the kube-scheduler perhaps somehow tolerated the panic and worked normally despite it, I broke some other code and then the kube-scheduler really crashed.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Feb 26, 2025

I also tested that only the writes are affected. I first created a DeviceClass, then restarted the apiserver with the "bug". Reading the class works, updating it fails, leaving the class unmodified.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Feb 26, 2025

I chatted with @liggitt about this. He still doesn't like that we are proposing to make an exception for something that normally wouldn't qualify for backporting and said that he "won't approve, but won't block it".

@johnbelamaric: can you perhaps confirm that we want this backported by adding the LGTM? These are mechanical cherry-picks from master, so no new technical changes.

@thockin: can you approve (needed because it's touching some API files)?

/triage accepted
/priority important-soon

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Feb 26, 2025
@thockin
Copy link
Copy Markdown
Member

thockin commented Feb 26, 2025

The only things about this that make it viable to me are that it is beta and actually reported by implementors. I am not sure what the guiding principle would be to say yes to this and no to something else, though. It's not a "bug" per se, but it kind of is, right?

@thockin thockin self-assigned this Feb 26, 2025
@liggitt
Copy link
Copy Markdown
Member

liggitt commented Feb 26, 2025

Yeah, thanks for the due diligence on containing the blast radius if there's an accidental bug in this code, and for verifying the behavior in the worst case scenario we can imagine where this new code completely panics. Both of those benefit master / 1.33+ as well, regardless of what we do here.

This is at the point now where we don't think this is risky to backport.

My reluctance comes from years of past experience where we (including I) would approve what we thought were risk-free backports, which later turned out to regress patch releases. From Kubernetes 1.19-1.27, ~35% of our patch releases were regressed due to a backport and worse than the .0 of their minor in some way. That was a terrible track record, and taught users that picking up Kubernetes patch releases was somewhat of a gamble, stability-wise.

In 1.28+, when we got significantly more strict about what we allow in backports, our track record has improved to ~8% of patch releases regressing (and is trending even better in newer minors). We need patch releases to be as close to ~0 risk as we can make them, so we can release critical bugfixes and security fixes and count on fast adoption.

Do I see a specific risk with this backport? Not really, though it is ~big and difficult to eyeball. Do I think that in aggregate, making exceptions and backporting things outside our narrow guidelines makes Kubernetes patch releases worse and less reliable? Yes, and I think the data supports that.

So... yeah... I'm not so concerned about a specific risk on this PR that I'd throw myself in front of it to block area owners / approvers who were pushing for it. But I can't say I approve of fixing up features in place with backports.

@johnbelamaric
Copy link
Copy Markdown
Member

I think the limited blast radius and the two independent implementors being limited by this justifies the backport.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 26, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: 493f210cbe5b52cf0cd2bfa4d1229b55975d4cf2

@thockin
Copy link
Copy Markdown
Member

thockin commented Feb 26, 2025

If I am left as the tie-breaker, I'll try to fight my risk-aversion instinct and bias toward action.

/approve
/hold last chance to disagree?

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 26, 2025
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Feb 27, 2025

It's not a "bug" per se, but it kind of is, right?

It is a bug, just not a regression. Fixes for regressions justify a backport (can't have 1.(X+1) be worse than 1.X), fixes for bugs don't (new feature in 1.X simply not usable yet, users have to wait for 1.(X+1)).

Let me get release team approval, then I'll lift the hold.

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 1, 2025

/cc kubernetes/release-managers

@k8s-ci-robot k8s-ci-robot requested a review from a team March 1, 2025 08:37
@k8s-ci-robot k8s-ci-robot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 3, 2025
@github-project-automation github-project-automation Bot moved this from Needs Reviewer to Needs Approver in SIG Node: code and documentation PRs Mar 3, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/cherry-pick-not-approved Indicates that a PR is not yet approved to merge into a release branch. label Mar 3, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, saschagrunert, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Mar 3, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 3, 2025
@k8s-ci-robot k8s-ci-robot merged commit 8207fb4 into kubernetes:release-1.32 Mar 3, 2025
@github-project-automation github-project-automation Bot moved this from Needs Approver to Done in SIG Node: code and documentation PRs Mar 3, 2025
@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

8 participants