Automated cherry pick of #129661: DRA CEL: add missing size estimator#129690
Conversation
Not implementing a size estimator had the effect that strings retrieved from
the attributes were treated as "unknown size", leading to wildly overestimating
the cost and validation errors even for even simple expressions like this:
device.attributes["qat.intel.com"].services.matches("[^a]?sym")
Maximum number of elements in maps and the maximum length of the driver name
string were also ignored resp. missing. Pre-defined types like
apiservercel.StringType must be avoided because they are defined as having
a zero maximum size.
|
This is not an obvious candidate for cherry-picking (no security issue, no regression because the specific usage has never worked, and not critical because no data loss etc). But it affects a beta feature that people wanted to use with 1.32 and the problem was something that got reported by different driver authors (Intel for QAT, NVIDIA for GPUs) because they ran into it when preparing driver releases for 1.32. So let's discuss whether it's worth to cherry-pick...
|
|
/cc @liggitt Because he was concerned about the intent to backport and for API approval. |
|
One argument for backporting is that without this fix, the following (seemingly simple) CEL expression fails to evaluate: This is a common way to select a specific type of GPU by its product name. I can't speak to the complexity of the fix itself (and therefore the risk of backporting), but from a usability point of view, not backporting this means that users of DRA are artifically limited in the types of selection they can do on devices. /cc @jpbetz to comment on the risk of backporting |
|
@klueska: GitHub didn't allow me to request PR reviews from the following users: to, comment, on, the, risk, of. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Not backporting also has the negative effect that those users who could give us feedback on writing those more complex CEL expressions would have to wait for 1.33. As seen with the promotion to beta in 1.32, it's not enough to have something available in the master branch. It needs to be released or almost released (= included in a tagged alpha release) before people start taking notice. For users there's the additional complication that they often depend on cloud or Kubernetes providers. |
|
It was pointed out that this used to work in early 1.32 alpha releases (@klueska had tested at that time) and then broke later when adding the CEL cost limit, so this is in fact a regression, at least when also considering alphas. This further strengthens the arguments that this should get cherry-picked. |
|
+1 to backporting this. This effectively makes it impossible to use some valid CEL expressions in prior minors. This was not our intent, and it "feels buggy" to have expressions that work on 1.33+ be rejected due to a mistake on expression costing. I consider this low risk. The way the code change is made, only DRA is impacted, and it's easy to reason about the code and the test coverage serves as a strong safety net. |
I don't really buy this... any cost limit added to the beta feature could be framed as a regression since it would make arbitrarily expensive expressions allowed in the alpha fail validation. We intend to reject some expressions that worked in alpha, the only question is which / how many.
I think we're trying to weigh these risks against each other:
|
This issue though here is that it is not an arbitrarily expensive operation - the length is enforced and was enforced but is just not counted properly in the cost estimation. I get your point, but I do think that's a material difference from "any expression".
These are beta users not alpha users. IIUC, any expression that either makes a comparison between two string attributes, or makes a function call on a string attribute (like a regex match) will fail validation here. In particular, that means no regex matching against these strings. I think the test cases only did comparisons of these strings to fixed length strings, so the those fixed length strings were used in the test case cost estimations. I am not sure how smart the cost estimator is. It might be possible to "hack" around it by adding an additional, always true clause that limits the attribute string length. For example instead of |
Sorry... I meant users moving from alpha → beta losing ability to use expressions they did previously. New beta users new to the feature won't have any existing expressions they would consider the current state to regress. I'd look to folks who know the DRA domain (@pohly, @johnbelamaric, @klueska) to describe how limiting this issue is for alpha → beta users, and the folks who know the cost estimator impl / wiring (@jpbetz, @pohly) to describe the risk to the whole cluster if the new code fails / panics in unexpected ways. I mostly want to be super clear in the backport justification why we consider this 1) severe enough to backport and 2) low-risk enough to backport. |
|
I ran an experiment to verify that the panic protection works and the blast radius really is just the apiserver. First I build all components with this patch: diff --git a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
index d59f7d7d4bb..b157e334383 100644
--- a/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
+++ b/staging/src/k8s.io/dynamic-resource-allocation/cel/compile.go
@@ -370,6 +370,8 @@ type sizeEstimator struct {
}
func (s *sizeEstimator) EstimateSize(element checker.AstNode) *checker.SizeEstimate {
+ panic("this should not happen")
+
path := element.Path()
if len(path) == 0 {
// Path() can return an empty list, early exit if it does since we can'tIn a cluster with that fault, running an E2E test which uses CEL selectors ("retries pod scheduling after updating device class") fails as expected when trying to create the DeviceClass, without taking down the apiserver: Next I "fixed" the kube-apiserver and restarted it (and only it!). Now the test passes, showing that the kube-scheduler doesn't call this code anymore. Finally, to rule out that the kube-scheduler perhaps somehow tolerated the panic and worked normally despite it, I broke some other code and then the kube-scheduler really crashed. |
|
I also tested that only the writes are affected. I first created a DeviceClass, then restarted the apiserver with the "bug". Reading the class works, updating it fails, leaving the class unmodified. |
|
I chatted with @liggitt about this. He still doesn't like that we are proposing to make an exception for something that normally wouldn't qualify for backporting and said that he "won't approve, but won't block it". @johnbelamaric: can you perhaps confirm that we want this backported by adding the LGTM? These are mechanical cherry-picks from master, so no new technical changes. @thockin: can you approve (needed because it's touching some API files)? /triage accepted |
|
The only things about this that make it viable to me are that it is beta and actually reported by implementors. I am not sure what the guiding principle would be to say yes to this and no to something else, though. It's not a "bug" per se, but it kind of is, right? |
|
Yeah, thanks for the due diligence on containing the blast radius if there's an accidental bug in this code, and for verifying the behavior in the worst case scenario we can imagine where this new code completely panics. Both of those benefit master / 1.33+ as well, regardless of what we do here. This is at the point now where we don't think this is risky to backport. My reluctance comes from years of past experience where we (including I) would approve what we thought were risk-free backports, which later turned out to regress patch releases. From Kubernetes 1.19-1.27, ~35% of our patch releases were regressed due to a backport and worse than the .0 of their minor in some way. That was a terrible track record, and taught users that picking up Kubernetes patch releases was somewhat of a gamble, stability-wise. In 1.28+, when we got significantly more strict about what we allow in backports, our track record has improved to ~8% of patch releases regressing (and is trending even better in newer minors). We need patch releases to be as close to ~0 risk as we can make them, so we can release critical bugfixes and security fixes and count on fast adoption. Do I see a specific risk with this backport? Not really, though it is ~big and difficult to eyeball. Do I think that in aggregate, making exceptions and backporting things outside our narrow guidelines makes Kubernetes patch releases worse and less reliable? Yes, and I think the data supports that. So... yeah... I'm not so concerned about a specific risk on this PR that I'd throw myself in front of it to block area owners / approvers who were pushing for it. But I can't say I approve of fixing up features in place with backports. |
|
I think the limited blast radius and the two independent implementors being limited by this justifies the backport. /lgtm |
|
LGTM label has been added. DetailsGit tree hash: 493f210cbe5b52cf0cd2bfa4d1229b55975d4cf2 |
|
If I am left as the tie-breaker, I'll try to fight my risk-aversion instinct and bias toward action. /approve |
It is a bug, just not a regression. Fixes for regressions justify a backport (can't have 1.(X+1) be worse than 1.X), fixes for bugs don't (new feature in 1.X simply not usable yet, users have to wait for 1.(X+1)). Let me get release team approval, then I'll lift the hold. |
|
/cc kubernetes/release-managers |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, saschagrunert, thockin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
Cherry pick of #129661 on release-1.32.
Cherry pick of #129749 on release-1.32.
#129661: DRA CEL: add missing size estimator
#129749: DRA: CEL cost estimate during validation
For details on the cherry pick process, see the cherry pick requests page.