DRA: start scheduler after creating binding/non-binding slicesin Basicflow#137253
Conversation
…binding slice first
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| class, driverName := createTestClass(tCtx, namespace) | ||
| startScheduler(tCtx) | ||
|
|
||
| // Create slice without binding conditions first so it gets allocated first |
There was a problem hiding this comment.
The order of slice creation doesn't guarantee that, does it?
I don't think this fixes whatever the underlying race is.
There was a problem hiding this comment.
What I expect to see in a PR which claims to fix a flake:
- A comparison between a "good" run and a "bad" run which explains why the bad one failed.
- An explanation why the proposed change prevents the "bad" scenario reliably.
There was a problem hiding this comment.
I initially thought order alone was the issue, so I moved the without-binding slice creation earlier.
After debugging further, I now think the real fix is to start the scheduler only after creating both slices, so it consistently picks the expected device first.
|
CC: @ttsuuubasa |
|
/test pull-kubernetes-integration |
|
Does simply moving the First, let’s clarify why the test failed. If I understand @tsj-30 correctly, the failure happened because the If so, it seems necessary to wait until both ResourceSlices are created before creating the ResourceClaim. What do you think? |
Log output of a failed run should be able to confirm that hypothesis. It depends a bit on the log level, though. It would be good to see this, otherwise we make changes without really knowing the root cause.
The important part is that the scheduler has to have both slices in its informer cache when it schedules the pod. Running the scheduler and then creating them does not guarantee that. Creating the slices and then starting the scheduler is fine because the scheduler first syncs its caches, then starts scheduling pods. |
After reviewing the log output, I can NOT say that there was definitive evidence proving the hypothesis. However, in a separate How often does this error actually occur? |
So this means that instead of simply checking for existence with a clientset Get against the API server, we need to check for existence using the scheduler’s informer Get? |
I understand the benefit of creating the ResourceSlice before starting the scheduler. Would it make sense to introduce a helper that verifies whether the scheduler’s informer cache actually contains the specified resource? I also considered adding such cache checks to the existing helpers like |
Even if the informer cache has the object, the event handlers might not. The test code might have access to the informer cache (I think it provides the informer factory), but it probably doesn't have access to all event handlers. Starting the scheduler after setting up objects seems easier.
No. Once a Create call returns, any Get or List call will report the new object. |
|
LGTM label has been added. DetailsGit tree hash: b0ac151e52f4d0b77b605876c32bfaf25e085353 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, tsj-30 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
For the future: please don't auto-close flake issues in the PR. We want to keep the issue open to verify that the flake is really gone. |
|
Based on the discussion above, I think the proposed changes are fine. |
|
/release-note-none The "NONE" has to be in the verbatim text block. @tsj-30: The PR uses some parts of the Kubernetes PR template, but it was modified (sections, verbatim text block). I'm curious: was that done manually or AI generated? |
This part of the PR description is out-dated. Would be nice to update it, for future reference. |
|
Hey @pohly, @ttsuuubasa |
Yaa, I used AI to help draft the PR description and missed that, |
…53-upstream-release-1.35 Automated cherry pick of #137253: DRA: start scheduler after creating binding/non-binding slicesin Basicflow
/kind flake
/kind failing-test
Fixed #133435
What this PR does / why we need it:
This PR stabilizes TestDRA/all/DeviceBindingConditions/BasicFlow in binding_conditions_test.go.
The test expects the first claim to allocate without-binding, but in CI the first allocation occasionally came back as with-binding, causing a mismatch like:
actual first allocation: worker-0-with-binding / with-binding
expected first allocation: worker-0-without-binding / without-binding
To make the test deterministic, the ResourceSlice without binding conditions is created before the slice with binding conditions, so the initial allocation path consistently matches test expectations.
Which issue(s) this PR is related to:
N/A