Skip to content

DRA: start scheduler after creating binding/non-binding slicesin Basicflow#137253

Merged
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
tsj-30:fix-133435
Mar 2, 2026
Merged

DRA: start scheduler after creating binding/non-binding slicesin Basicflow#137253
k8s-ci-robot merged 2 commits into
kubernetes:masterfrom
tsj-30:fix-133435

Conversation

@tsj-30

@tsj-30 tsj-30 commented Feb 26, 2026

Copy link
Copy Markdown
Contributor

/kind flake
/kind failing-test
Fixed #133435

What this PR does / why we need it:
This PR stabilizes TestDRA/all/DeviceBindingConditions/BasicFlow in binding_conditions_test.go.

The test expects the first claim to allocate without-binding, but in CI the first allocation occasionally came back as with-binding, causing a mismatch like:

actual first allocation: worker-0-with-binding / with-binding
expected first allocation: worker-0-without-binding / without-binding

Expected
        		    <v1.DeviceAllocationResult>: 
        		        results:
        		        - bindingConditions:
        		          - attached
        		          bindingFailureConditions:
        		          - failed
        		          device: with-binding
        		          driver: testdra-all-devicebindingconditions-basicflow-28t9x.driver
        		          pool: worker-0-with-binding
        		          request: req-1
        		to equal
        		    <v1.DeviceAllocationResult>: 
        		        results:
        		        - device: without-binding
        		          driver: testdra-all-devicebindingconditions-basicflow-28t9x.driver
        		          pool: worker-0-without-binding
        		          request: req-1

To make the test deterministic, the ResourceSlice without binding conditions is created before the slice with binding conditions, so the initial allocation path consistently matches test expectations.

Which issue(s) this PR is related to:
N/A

NONE

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/test sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 26, 2026
@k8s-ci-robot k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2026
class, driverName := createTestClass(tCtx, namespace)
startScheduler(tCtx)

// Create slice without binding conditions first so it gets allocated first

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of slice creation doesn't guarantee that, does it?

I don't think this fixes whatever the underlying race is.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I expect to see in a PR which claims to fix a flake:

  • A comparison between a "good" run and a "bad" run which explains why the bad one failed.
  • An explanation why the proposed change prevents the "bad" scenario reliably.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially thought order alone was the issue, so I moved the without-binding slice creation earlier.
After debugging further, I now think the real fix is to start the scheduler only after creating both slices, so it consistently picks the expected device first.

@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2026
@tsj-30

tsj-30 commented Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

CC: @ttsuuubasa

@tsj-30

tsj-30 commented Feb 27, 2026

Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-integration

@ttsuuubasa

Copy link
Copy Markdown
Contributor

Does simply moving the startScheduler() to the current location actually guarantee that both ResourceSlices have been created?

First, let’s clarify why the test failed.

If I understand @tsj-30 correctly, the failure happened because the without-binding ResourceSlice had not been created yet (i.e., it wasn’t ready in time), and the ResourceClaim allocation ran while only the with-binding slice existed.
Is that the correct understanding?

If so, it seems necessary to wait until both ResourceSlices are created before creating the ResourceClaim. What do you think?

@pohly

pohly commented Feb 27, 2026

Copy link
Copy Markdown
Contributor

If I understand @tsj-30 correctly, the failure happened because the without-binding ResourceSlice had not been created yet (i.e., it wasn’t ready in time), and the ResourceClaim allocation ran while only the with-binding slice existed.

Log output of a failed run should be able to confirm that hypothesis. It depends a bit on the log level, though. It would be good to see this, otherwise we make changes without really knowing the root cause.

If so, it seems necessary to wait until both ResourceSlices are created before creating the ResourceClaim. What do you think?

The important part is that the scheduler has to have both slices in its informer cache when it schedules the pod. Running the scheduler and then creating them does not guarantee that. Creating the slices and then starting the scheduler is fine because the scheduler first syncs its caches, then starts scheduling pods.

@ttsuuubasa

ttsuuubasa commented Feb 28, 2026

Copy link
Copy Markdown
Contributor

Log output of a failed run should be able to confirm that hypothesis. It depends a bit on the log level, though. It would be good to see this, otherwise we make changes without really knowing the root cause.

After reviewing the log output, I can NOT say that there was definitive evidence proving the hypothesis.
Between the start of TestDRA/all and the end of DeviceBindingConditions/BasicFlow, there were no clear errors.

However, in a separate ControllerManagerMetrics test, an error occurred when scheduling a Pod, indicating that the DeviceClass did not exist.
Since this DeviceClass should have been created at the beginning of the test, the fact that it was missing suggests that the API server may likewise have been slow to create it.

14:49:34.041314: Started the scheduler for test TestDRA/all.

14:49:34.042580: [ControllerManagerMetrics] device class does not exist

14:49:35.062107: Error in this run

14:49:35.081248: Stopping the scheduler after test TestDRA/all/DeviceBindingConditions/BasicFlow...

How often does this error actually occur?
To verify the hypothesis above, it might be help to log the scheduler’s slice informer cache list right before the corresponding gomega.Expect is executed.

@ttsuuubasa

Copy link
Copy Markdown
Contributor

The important part is that the scheduler has to have both slices in its informer cache when it schedules the pod.

So this means that instead of simply checking for existence with a clientset Get against the API server, we need to check for existence using the scheduler’s informer Get?

@ttsuuubasa

Copy link
Copy Markdown
Contributor

Creating the slices and then starting the scheduler is fine because the scheduler first syncs its caches, then starts scheduling pods.

I understand the benefit of creating the ResourceSlice before starting the scheduler.
However, isn’t there still a possibility that the ResourceSlice creation on the API server may be delayed, causing the scheduler to start before the slice actually exists, leading to the same kind of issue?

Would it make sense to introduce a helper that verifies whether the scheduler’s informer cache actually contains the specified resource?

I also considered adding such cache checks to the existing helpers like createSlice or createClaim, but that would implicitly assume that the scheduler is already running, which may not hold for all test scenarios, right?

@pohly

pohly commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

we need to check for existence using the scheduler’s informer Get?

Even if the informer cache has the object, the event handlers might not. The test code might have access to the informer cache (I think it provides the informer factory), but it probably doesn't have access to all event handlers. Starting the scheduler after setting up objects seems easier.

However, isn’t there still a possibility that the ResourceSlice creation on the API server may be delayed, causing the scheduler to start before the slice actually exists, leading to the same kind of issue?

No. Once a Create call returns, any Get or List call will report the new object.

@pohly pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Mar 2, 2026

@pohly pohly left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

LGTM label has been added.

DetailsGit tree hash: b0ac151e52f4d0b77b605876c32bfaf25e085353

@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, tsj-30

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2026
@pohly

pohly commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Fixed #133435

For the future: please don't auto-close flake issues in the PR. We want to keep the issue open to verify that the flake is really gone.

@ttsuuubasa

Copy link
Copy Markdown
Contributor

Based on the discussion above, I think the proposed changes are fine.

@pohly

pohly commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

/release-note-none

The "NONE" has to be in the verbatim text block.

@tsj-30: The PR uses some parts of the Kubernetes PR template, but it was modified (sections, verbatim text block). I'm curious: was that done manually or AI generated?

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 2, 2026
@pohly

pohly commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

To make the test deterministic, the ResourceSlice without binding conditions is created before the slice with binding conditions, so the initial allocation path consistently matches test expectations.

This part of the PR description is out-dated. Would be nice to update it, for future reference.

@k8s-ci-robot k8s-ci-robot merged commit 2a5b284 into kubernetes:master Mar 2, 2026
13 of 21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.36 milestone Mar 2, 2026
@github-project-automation github-project-automation Bot moved this from Triage to Done in SIG Node CI/Test Board Mar 2, 2026
@tsj-30

tsj-30 commented Mar 3, 2026

Copy link
Copy Markdown
Contributor Author

Hey @pohly, @ttsuuubasa
Apologies for the delayed response — I haven’t been feeling well the past couple of days and am still recovering.
Thanks for the LGTM 🚀 Really appreciate it!

@tsj-30

tsj-30 commented Mar 3, 2026

Copy link
Copy Markdown
Contributor Author

/release-note-none

The "NONE" has to be in the verbatim text block.

@tsj-30: The PR uses some parts of the Kubernetes PR template, but it was modified (sections, verbatim text block). I'm curious: was that done manually or AI generated?

Yaa, I used AI to help draft the PR description and missed that,
Sorry about that — I’ll fix it right away,
Thanks for pointing it out. 😄

@pohly pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Mar 3, 2026
k8s-ci-robot added a commit that referenced this pull request Mar 4, 2026
…53-upstream-release-1.35

Automated cherry pick of #137253:  DRA: start scheduler after creating binding/non-binding slicesin Basicflow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Development

Successfully merging this pull request may close these issues.

[Flaking test] [DRA]TestDRA/all/DeviceBindingConditions

4 participants