fix: add PCS topology constraints to scaled PodGangs#347
Merged
Ronkahn21 merged 3 commits intoJan 19, 2026
Conversation
Scaled PodGangs (PCSG replicas above MinAvailable) were missing PCSG-level topology constraints in TopologyConstraintGroupConfigs. This caused pods in scaled PCSG replicas to be scheduled without proper topology grouping constraints. Changes: - Collect pclqFQNs while building scaled PodGang - Create TopologyConstraintGroupConfig for PCSG when TAS enabled - Set pcsgTopologyConstraints in podGangInfo for scaled PodGangs - Now mirrors the pattern used in base PodGang creation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Updated test expectations in TestComputeExpectedPodGangsWithTopologyConstraints to correctly validate the 3-level constraint hierarchy for scaled PodGangs. Changes: - Fixed topologyLevel from rack to zone (PCS-level constraint) - Added missing pcsgConstraints field for PCSG-level constraints Both base and scaled PodGangs have unified 3-level structure with PCS-level at top, PCLQ-level for PodGroups, and PCSG-level in TopologyConstraintGroupConfigs. Signed-off-by: Ron Kahn <rkahn@nvidia.com>
shayasoolin
approved these changes
Jan 19, 2026
danbar2
approved these changes
Jan 19, 2026
Collaborator
|
For scaled podgangs, which is essentially pcsg replicas above min available, I thought we need to apply the topology constraints on the entire PodGang level and not the @unmarshall can you please review this change as well? |
Collaborator
|
@Ronkahn21 you will have to rollback this PR as the change that this PR introduces is incorrect. For a scaled podgang there is no |
unmarshall
pushed a commit
that referenced
this pull request
Jan 20, 2026
) * Revert "fix: add PCSG topology constraints to scaled PodGangs (#347)" This reverts commit a08e5af. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix: update topology constraint handling for scaled PodGangs Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix: only create PCSG topology constraints when defined Add nil check before creating PCSG topology constraint groups in base PodGang building. This prevents creating empty topology constraints when PCSG has no TopologyConstraint specified, ensuring scaled PodGangs properly fall back to PCS-level constraints. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: add case for PCS with nil PCSG topology constraints fallback to pcs Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: remove redundant comments regarding pcsgConstraints in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: improve formatting and remove redundant comment in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> --------- Signed-off-by: Ron Kahn <rkahn@nvidia.com>
danbar2
pushed a commit
to danbar2/grove
that referenced
this pull request
Jan 21, 2026
…i-dynamo#357) * Revert "fix: add PCSG topology constraints to scaled PodGangs (ai-dynamo#347)" This reverts commit a08e5af. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix: update topology constraint handling for scaled PodGangs Signed-off-by: Ron Kahn <rkahn@nvidia.com> * fix: only create PCSG topology constraints when defined Add nil check before creating PCSG topology constraint groups in base PodGang building. This prevents creating empty topology constraints when PCSG has no TopologyConstraint specified, ensuring scaled PodGangs properly fall back to PCS-level constraints. Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: add case for PCS with nil PCSG topology constraints fallback to pcs Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: remove redundant comments regarding pcsgConstraints in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: improve formatting and remove redundant comment in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> * test: clean up formatting in syncflow_test.go Signed-off-by: Ron Kahn <rkahn@nvidia.com> --------- Signed-off-by: Ron Kahn <rkahn@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind bug
What this PR does / why we need it:
This PR fixes a bug where topology constraints were not properly created for PodCliqueScalingGroup (PCSG) replicas when Topology Aware Scheduling (TAS) is enabled.
The fix ensures that:
Without this fix, multi-replica PodCliqueScalingGroups would not have topology constraints applied at the correct granularity, leading to incorrect pod placement.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This PR contains only the production code fixes. The comprehensive e2e test infrastructure for TAS will be submitted in a separate PR to keep the reviews focused.
Changes:
operator/internal/controller/podcliqueset/components/podgang/syncflow.go: Added PCSG topology constraint creation logicoperator/internal/controller/podcliqueset/components/podgang/syncflow_test.go: Updated test expectations to reflect proper constraint hierarchyAll unit tests pass and linter reports 0 issues.
Does this PR introduce a API change?
Additional documentation e.g., enhancement proposals, usage docs, etc.: