Skip to content

DRA CEL: add missing size estimator#129661

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
pohly:dra-cell-cost-limit-increase
Jan 17, 2025
Merged

DRA CEL: add missing size estimator#129661
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
pohly:dra-cell-cost-limit-increase

Conversation

@pohly
Copy link
Copy Markdown
Contributor

@pohly pohly commented Jan 16, 2025

What type of PR is this?

/kind bug
/priority important-soon
/assign @jpbetz

Needs to be backported to 1.32.

What this PR does / why we need it:

Not implementing a size estimator had the effect that strings retrieved from the attributes were treated as "unknown size", leading to wildly overestimating the cost and validation errors even for even simple expressions like this:

device.attributes["qat.intel.com"].services.matches("[^a]?sym")

Maximum number of elements in maps and the maximum length of the driver name string were also ignored resp. missing. Pre-defined types like apiservercel.StringType must be avoided because they are defined as having a zero maximum size.

Which issue(s) this PR fixes:

Fixes #129523

Special notes for your reviewer:

This might change the cost estimate for some potentially stored CEL expressions involving the driver name. This is okay because the validation was implemented so that all stored expressions are assumed to be valid, i.e. a down- or upgrade will not lead to validation errors when the different versions of the apiserver implement cost estimation differently.

Does this PR introduce a user-facing change?

DRA: CEL expressions using attribute strings exceeded the cost limit because their cost estimation was incomplete.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/issues/4381

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 16, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API sig/node Categorizes an issue or PR as relevant to SIG Node. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 16, 2025
@pohly pohly force-pushed the dra-cell-cost-limit-increase branch from 23dcfed to d979e57 Compare January 16, 2025 12:30
@pohly pohly changed the title WIP: DRA CEL: add missing size estimator DRA CEL: add missing size estimator Jan 16, 2025
@pohly pohly marked this pull request as ready for review January 16, 2025 12:31
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 16, 2025
@Adarsh-verma-14
Copy link
Copy Markdown

/test pull-kubernetes-e2e-kind
/test pull-kubernetes-unit

@k8s-triage-robot
Copy link
Copy Markdown

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@pohly pohly force-pushed the dra-cell-cost-limit-increase branch from d979e57 to e3d839f Compare January 16, 2025 14:47
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 16, 2025

@Adarsh-verma-14: the unit test failure is genuine, I'll have to update validation.

field(driverVar, apiservercel.StringType, true),
field(attributesVar, apiservercel.NewMapType(apiservercel.StringType, apiservercel.NewMapType(apiservercel.StringType, apiservercel.AnyType, resourceapi.ResourceSliceMaxAttributesAndCapacitiesPerDevice), resourceapi.ResourceSliceMaxAttributesAndCapacitiesPerDevice), true),
field(capacityVar, apiservercel.NewMapType(apiservercel.StringType, apiservercel.NewMapType(apiservercel.StringType, apiservercel.QuantityDeclType, resourceapi.ResourceSliceMaxAttributesAndCapacitiesPerDevice), resourceapi.ResourceSliceMaxAttributesAndCapacitiesPerDevice), true),
field(driverVar, driverType, true),
Copy link
Copy Markdown
Contributor

@jpbetz jpbetz Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just brainstorming on how to ensure someone doesn't accidentally forget to set a MaxElements on a type..

Option: field() takes the MaxElements as an argument, you could define consts for the values provided.
Option: field() takes a type alias to apiservercel.AnyType and this package provides a newType function that requires the MaxElements argument and returns the alias type.
Option: field() asserts that passed types (at least the string and bytes typed ones, maybe others) have MaxElements set.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't limited to fields, nor to this particular source code.

Someone else in some entirely different package might do apiservercel.NewMapType(apiservercel.StringType, apiservercel.StringType, myMaxNumberOfKeys) without realizing that they should use some custom string type with MaxElements set.

To disarm this footgun everywhere, those pre-defined types would have to be removed in favor of constructor methods which force passing a maxElements parameters.

Alternatively, the size estimator could refuse to use an unset MaxElements. But that then only fails if someone is careful with unit testing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.. I'd be OK with opening a separate issue for the pre-defined types -> constructor method idea.

WDYT @cici37 is this something the CEL contributors would be willing to work on?

Not implementing a size estimator had the effect that strings retrieved from
the attributes were treated as "unknown size", leading to wildly overestimating
the cost and validation errors even for even simple expressions like this:

    device.attributes["qat.intel.com"].services.matches("[^a]?sym")

Maximum number of elements in maps and the maximum length of the driver name
string were also ignored resp. missing. Pre-defined types like
apiservercel.StringType must be avoided because they are defined as having
a zero maximum size.
@pohly pohly force-pushed the dra-cell-cost-limit-increase branch from e3d839f to 2cc3dbf Compare January 16, 2025 15:36
@jpbetz
Copy link
Copy Markdown
Contributor

jpbetz commented Jan 16, 2025

/lgtm
For API machinery bits and the use of MaxElements.
I'm okay with #129661 (comment) being further improved as a follow up, possibly by CEL maintainers.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2025
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: fe5a09234b007e7ae58eb414661716901e0c72ab

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 16, 2025

/retest

"Pod scheduling timeout".

@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 16, 2025

/label api-review

@k8s-ci-robot k8s-ci-robot added the api-review Categorizes an issue or PR as actively needing an API review. label Jan 16, 2025

// DriverNameMaxLength is the maximum valid length of a driver name in the
// ResourceSliceSpec and other places. It's the same as for CSI driver names.
const DriverNameMaxLength = 63
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference:

if len(driverName) > 63 {
allErrs = append(allErrs, field.TooLong(fldPath, "" /*unused*/, 63))
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And

validateDriverName = corevalidation.ValidateCSIDriverName

if len(path) == 0 {
// Path() can return an empty list, early exit if it does since we can't
// provide size estimates when that happens
return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does a nil return turn into the worst-case size estimate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe so. This is copy-and-pasted from the size estimator

func (c *sizeEstimator) EstimateSize(element checker.AstNode) *checker.SizeEstimate {
if len(element.Path()) == 0 {
// Path() can return an empty list, early exit if it does since we can't
// provide size estimates when that happens
return nil
}
currentNode := c.root
// cut off "self" from path, since we always start there
for _, name := range element.Path()[1:] {
switch name {
case "@items", "@values":
if currentNode.ElemType == nil {
return nil
}
currentNode = currentNode.ElemType
case "@keys":
if currentNode.KeyType == nil {
return nil
}
currentNode = currentNode.KeyType
default:
field, ok := currentNode.Fields[name]
if !ok {
return nil
}
if field.Type == nil {
return nil
}
currentNode = field.Type
}
}
return &checker.SizeEstimate{Min: 0, Max: uint64(currentNode.MaxElements)}
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 16, 2025

Needs to be backported to 1.32.

What's the rationale for backporting? At first glance, this looks like a pretty complex change to backport, and this isn't fixing a regression, right?

@liggitt
Copy link
Copy Markdown
Member

liggitt commented Jan 16, 2025

/approve

for API constants

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, pohly

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 16, 2025
@k8s-ci-robot k8s-ci-robot merged commit 488fac4 into kubernetes:master Jan 17, 2025
@k8s-ci-robot k8s-ci-robot added this to the v1.33 milestone Jan 17, 2025
@pohly
Copy link
Copy Markdown
Contributor Author

pohly commented Jan 17, 2025

What's the rationale for backporting? At first glance, this looks like a pretty complex change to backport, and this isn't fixing a regression, right?

Not a regression, but a pretty significant bug that limits what users can do with DRA in 1.32. I'll elaborate further on this in the cherry-pick PR and we can discuss there.

k8s-ci-robot added a commit that referenced this pull request Mar 3, 2025
…1-origin-release-1.32

Automated cherry pick of #129661: DRA CEL: add missing size estimator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-review Categorizes an issue or PR as actively needing an API review. approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Projects

Status: API review completed, 1.33
Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

DRA: Increase CEL expression complexity limit in resourceClaimTemplate

6 participants