-
Notifications
You must be signed in to change notification settings - Fork 507
[release-0.14] pkg/cache/scheduler: add exclusion stats to TAS failure messages #8169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-0.14] pkg/cache/scheduler: add exclusion stats to TAS failure messages #8169
Conversation
✅ Deploy Preview for kubernetes-sigs-kueue canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR enhances TAS (Topology Aware Scheduling) failure messages by adding detailed exclusion statistics showing why nodes were excluded during scheduling. It's a manual cherry-pick of #8043 to the release-0.14 branch.
Key changes:
- Adds
ExclusionStatsstruct to track why nodes are excluded (taints, nodeSelector, affinity, topology domain, resources) - Extends
CountInWithLimitingResourcefunction to identify which resource is most constraining when pods can't fit - Updates TAS failure messages to include total node count and breakdown of exclusion reasons
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/cache/scheduler/tas_flavor_snapshot.go | Adds ExclusionStats tracking, updates fillInCounts to record exclusions, enhances notFitMessage to include exclusion details |
| pkg/resources/requests.go | Adds CountInWithLimitingResource function returning both count and limiting resource with alphabetical tie-breaking; refactors CountIn to use it |
| pkg/resources/requests_test.go | Adds comprehensive tests for CountInWithLimitingResource covering tie-breaking, missing resources, and various limiting scenarios |
| pkg/scheduler/scheduler_tas_test.go | Updates test expectations to include exclusion stats in TAS failure messages and admission event messages |
| pkg/cache/scheduler/tas_cache_test.go | Updates test expectations for failure messages and adds new tests for exclusion stats formatting with multiple exclusion reasons |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| count = int32(cap / rValue) | ||
| } | ||
| if result == nil || count < *result { | ||
| // Tie-break between CPU and memory counts to ensure deterministic results. |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment mentions "CPU and memory" specifically, but the tie-breaking logic applies to all resource types, not just CPU and memory. Consider updating the comment to be more general, e.g., "Tie-break between resources with equal counts to ensure deterministic results."
| // Tie-break between CPU and memory counts to ensure deterministic results. | |
| // Tie-break between resources with equal counts to ensure deterministic results. |
| } | ||
|
|
||
| func (s *TASFlavorSnapshot) findLevelWithFitDomains(levelIdx int, required bool, podSetSize int32, leaderPodSetSize int32, sliceSize int32, unconstrained bool) (int, []*domain, string) { | ||
| func (s *TASFlavorSnapshot) findLevelWithFitDomains(levelIdx int, required bool, podSetSize int32, leaderPodSetSize int32, sliceSize int32, unconstrained bool, stats *ExclusionStats) (int, []*domain, string) { |
Copilot
AI
Dec 10, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function signature has been updated to include a stats *ExclusionStats parameter, but the caller of this function at line 799 has not been updated to pass this parameter. This will cause a compilation error. The call at line 799 should be updated to:
fitLevelIdx, currFitDomain, reason := s.findLevelWithFitDomains(levelIdx, required, count, leaderCount, sliceSize, unconstrained, stats)Include detailed node exclusion reasons (taints, nodeSelector, affinity, resources) in TAS scheduling failure messages to improve debuggability. Fixes: kubernetes-sigs#7854 Signed-off-by: Sohan Kunkerkar <sohank2602@gmail.com>
391b378 to
e393ef1
Compare
|
/release-note-edit |
|
/kind feature |
|
LGTM label has been added. DetailsGit tree hash: d53215f73b3b0d645761062efdce26beb3ec2aa1 |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, sohankunkerkar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a manual cherry-pick of #8043
/assign @sohankunkerkar