DRA: start scheduler after creating binding/non-binding slicesin Basicflow by tsj-30 · Pull Request #137253 · kubernetes/kubernetes

tsj-30 · 2026-02-26T06:13:28Z

/kind flake
/kind failing-test
Fixed #133435

What this PR does / why we need it:
This PR stabilizes TestDRA/all/DeviceBindingConditions/BasicFlow in binding_conditions_test.go.

The test expects the first claim to allocate without-binding, but in CI the first allocation occasionally came back as with-binding, causing a mismatch like:

actual first allocation: worker-0-with-binding / with-binding
expected first allocation: worker-0-without-binding / without-binding

Expected
        		    <v1.DeviceAllocationResult>: 
        		        results:
        		        - bindingConditions:
        		          - attached
        		          bindingFailureConditions:
        		          - failed
        		          device: with-binding
        		          driver: testdra-all-devicebindingconditions-basicflow-28t9x.driver
        		          pool: worker-0-with-binding
        		          request: req-1
        		to equal
        		    <v1.DeviceAllocationResult>: 
        		        results:
        		        - device: without-binding
        		          driver: testdra-all-devicebindingconditions-basicflow-28t9x.driver
        		          pool: worker-0-without-binding
        		          request: req-1

To make the test deterministic, the ResourceSlice without binding conditions is created before the slice with binding conditions, so the initial allocation path consistently matches test expectations.

Which issue(s) this PR is related to:
N/A

NONE

…binding slice first

k8s-ci-robot · 2026-02-26T06:13:36Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pohly · 2026-02-26T07:29:54Z

 	class, driverName := createTestClass(tCtx, namespace)
 	startScheduler(tCtx)

+	// Create slice without binding conditions first so it gets allocated first


The order of slice creation doesn't guarantee that, does it?

I don't think this fixes whatever the underlying race is.

What I expect to see in a PR which claims to fix a flake:

A comparison between a "good" run and a "bad" run which explains why the bad one failed.

An explanation why the proposed change prevents the "bad" scenario reliably.

I initially thought order alone was the issue, so I moved the without-binding slice creation earlier.
After debugging further, I now think the real fix is to start the scheduler only after creating both slices, so it consistently picks the expected device first.

tsj-30 · 2026-02-27T04:20:40Z

CC: @ttsuuubasa

tsj-30 · 2026-02-27T04:31:27Z

/test pull-kubernetes-integration

ttsuuubasa · 2026-02-27T08:18:58Z

Does simply moving the startScheduler() to the current location actually guarantee that both ResourceSlices have been created?

First, let’s clarify why the test failed.

If I understand @tsj-30 correctly, the failure happened because the without-binding ResourceSlice had not been created yet (i.e., it wasn’t ready in time), and the ResourceClaim allocation ran while only the with-binding slice existed.
Is that the correct understanding?

If so, it seems necessary to wait until both ResourceSlices are created before creating the ResourceClaim. What do you think?

pohly · 2026-02-27T10:05:06Z

If I understand @tsj-30 correctly, the failure happened because the without-binding ResourceSlice had not been created yet (i.e., it wasn’t ready in time), and the ResourceClaim allocation ran while only the with-binding slice existed.

Log output of a failed run should be able to confirm that hypothesis. It depends a bit on the log level, though. It would be good to see this, otherwise we make changes without really knowing the root cause.

If so, it seems necessary to wait until both ResourceSlices are created before creating the ResourceClaim. What do you think?

The important part is that the scheduler has to have both slices in its informer cache when it schedules the pod. Running the scheduler and then creating them does not guarantee that. Creating the slices and then starting the scheduler is fine because the scheduler first syncs its caches, then starts scheduling pods.

ttsuuubasa · 2026-02-28T08:27:26Z

Log output of a failed run should be able to confirm that hypothesis. It depends a bit on the log level, though. It would be good to see this, otherwise we make changes without really knowing the root cause.

After reviewing the log output, I can NOT say that there was definitive evidence proving the hypothesis.
Between the start of TestDRA/all and the end of DeviceBindingConditions/BasicFlow, there were no clear errors.

However, in a separate ControllerManagerMetrics test, an error occurred when scheduling a Pod, indicating that the DeviceClass did not exist.
Since this DeviceClass should have been created at the beginning of the test, the fact that it was missing suggests that the API server may likewise have been slow to create it.

14:49:34.041314: Started the scheduler for test TestDRA/all.

14:49:34.042580: [ControllerManagerMetrics] device class does not exist

14:49:35.062107: Error in this run

14:49:35.081248: Stopping the scheduler after test TestDRA/all/DeviceBindingConditions/BasicFlow...

How often does this error actually occur?
To verify the hypothesis above, it might be help to log the scheduler’s slice informer cache list right before the corresponding gomega.Expect is executed.

ttsuuubasa · 2026-02-28T08:41:01Z

The important part is that the scheduler has to have both slices in its informer cache when it schedules the pod.

So this means that instead of simply checking for existence with a clientset Get against the API server, we need to check for existence using the scheduler’s informer Get?

ttsuuubasa · 2026-02-28T08:53:41Z

Creating the slices and then starting the scheduler is fine because the scheduler first syncs its caches, then starts scheduling pods.

I understand the benefit of creating the ResourceSlice before starting the scheduler.
However, isn’t there still a possibility that the ResourceSlice creation on the API server may be delayed, causing the scheduler to start before the slice actually exists, leading to the same kind of issue?

Would it make sense to introduce a helper that verifies whether the scheduler’s informer cache actually contains the specified resource?

I also considered adding such cache checks to the existing helpers like createSlice or createClaim, but that would implicitly assume that the scheduler is already running, which may not hold for all test scenarios, right?

pohly · 2026-03-02T07:00:00Z

we need to check for existence using the scheduler’s informer Get?

Even if the informer cache has the object, the event handlers might not. The test code might have access to the informer cache (I think it provides the informer factory), but it probably doesn't have access to all event handlers. Starting the scheduler after setting up objects seems easier.

However, isn’t there still a possibility that the ResourceSlice creation on the API server may be delayed, causing the scheduler to start before the slice actually exists, leading to the same kind of issue?

No. Once a Create call returns, any Get or List call will report the new object.

pohly

/lgtm
/approve

k8s-ci-robot · 2026-03-02T08:46:16Z

LGTM label has been added.

Details

Git tree hash: b0ac151e52f4d0b77b605876c32bfaf25e085353

k8s-ci-robot · 2026-03-02T08:46:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: pohly, tsj-30

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~test/integration/dra/OWNERS~~ [pohly]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

pohly · 2026-03-02T08:47:20Z

Fixed #133435

For the future: please don't auto-close flake issues in the PR. We want to keep the issue open to verify that the flake is really gone.

ttsuuubasa · 2026-03-02T09:54:22Z

Based on the discussion above, I think the proposed changes are fine.

pohly · 2026-03-02T10:49:05Z

/release-note-none

The "NONE" has to be in the verbatim text block.

@tsj-30: The PR uses some parts of the Kubernetes PR template, but it was modified (sections, verbatim text block). I'm curious: was that done manually or AI generated?

pohly · 2026-03-02T10:50:03Z

To make the test deterministic, the ResourceSlice without binding conditions is created before the slice with binding conditions, so the initial allocation path consistently matches test expectations.

This part of the PR description is out-dated. Would be nice to update it, for future reference.

tsj-30 · 2026-03-03T04:22:52Z

Hey @pohly, @ttsuuubasa
Apologies for the delayed response — I haven’t been feeling well the past couple of days and am still recovering.
Thanks for the LGTM 🚀 Really appreciate it!

tsj-30 · 2026-03-03T04:24:41Z

/release-note-none

The "NONE" has to be in the verbatim text block.

@tsj-30: The PR uses some parts of the Kubernetes PR template, but it was modified (sections, verbatim text block). I'm curious: was that done manually or AI generated?

Yaa, I used AI to help draft the PR description and missed that,
Sorry about that — I’ll fix it right away,
Thanks for pointing it out. 😄

…53-upstream-release-1.35 Automated cherry pick of #137253: DRA: start scheduler after creating binding/non-binding slicesin Basicflow

dra: stabilize DeviceBindingConditions BasicFlow by creating without-…

8a7e586

…binding slice first

github-project-automation Bot added this to SIG Node CI/Test Board Feb 26, 2026

k8s-ci-robot added wg/device-management Categorizes an issue or PR as relevant to WG Device Management. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 26, 2026

github-project-automation Bot moved this to Triage in SIG Node CI/Test Board Feb 26, 2026

github-project-automation Bot added this to Dynamic Resource Allocation and SIG Node: code and documentation PRs Feb 26, 2026

github-project-automation Bot moved this to 🆕 New in Dynamic Resource Allocation Feb 26, 2026

github-project-automation Bot moved this to Triage in SIG Node: code and documentation PRs Feb 26, 2026

k8s-ci-robot requested review from bart0sh and pohly February 26, 2026 06:13

pohly reviewed Feb 26, 2026

View reviewed changes

pohly mentioned this pull request Feb 26, 2026

Ktesting: contexthelper panic + test fix #137255

Merged

k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 26, 2026

start scheduler after creating binding/non-binding slices

d7e8ee8

tsj-30 force-pushed the fix-133435 branch from e562940 to d7e8ee8 Compare February 26, 2026 09:17

pohly moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation Mar 2, 2026

pohly approved these changes Mar 2, 2026

View reviewed changes

k8s-ci-robot assigned pohly Mar 2, 2026

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 2, 2026

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 2, 2026

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 2, 2026

k8s-ci-robot merged commit 2a5b284 into kubernetes:master Mar 2, 2026
13 of 21 checks passed

k8s-ci-robot added this to the v1.36 milestone Mar 2, 2026

github-project-automation Bot moved this from Triage to Done in SIG Node CI/Test Board Mar 2, 2026

github-project-automation Bot moved this from Triage to Done in SIG Node: code and documentation PRs Mar 2, 2026

tsj-30 mentioned this pull request Mar 3, 2026

Automated cherry pick of #137253: DRA: start scheduler after creating binding/non-binding slicesin Basicflow #137366

Merged

pohly moved this from 👀 In review to ✅ Done in Dynamic Resource Allocation Mar 3, 2026

k8s-ci-robot added a commit that referenced this pull request Mar 4, 2026

Merge pull request #137366 from tsj-30/automated-cherry-pick-of-#1372…

60ebb8a

…53-upstream-release-1.35 Automated cherry pick of #137253: DRA: start scheduler after creating binding/non-binding slicesin Basicflow

Conversation

tsj-30 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 26, 2026

Uh oh!

pohly Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

pohly Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tsj-30 Feb 26, 2026

Choose a reason for hiding this comment

Uh oh!

tsj-30 commented Feb 27, 2026

Uh oh!

tsj-30 commented Feb 27, 2026

Uh oh!

ttsuuubasa commented Feb 27, 2026

Uh oh!

pohly commented Feb 27, 2026

Uh oh!

ttsuuubasa commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ttsuuubasa commented Feb 28, 2026

Uh oh!

ttsuuubasa commented Feb 28, 2026

Uh oh!

pohly commented Mar 2, 2026

Uh oh!

pohly left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Mar 2, 2026

Uh oh!

k8s-ci-robot commented Mar 2, 2026

Uh oh!

pohly commented Mar 2, 2026

Uh oh!

ttsuuubasa commented Mar 2, 2026

Uh oh!

pohly commented Mar 2, 2026

Uh oh!

pohly commented Mar 2, 2026

Uh oh!

Uh oh!

tsj-30 commented Mar 3, 2026

Uh oh!

tsj-30 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tsj-30 commented Feb 26, 2026 •

edited

Loading

ttsuuubasa commented Feb 28, 2026 •

edited

Loading