Skip to content

fix: add PCS topology constraints to scaled PodGangs#347

Merged
Ronkahn21 merged 3 commits into
ai-dynamo:mainfrom
Ronkahn21:fix/pcsg-topology-constraints
Jan 19, 2026
Merged

fix: add PCS topology constraints to scaled PodGangs#347
Ronkahn21 merged 3 commits into
ai-dynamo:mainfrom
Ronkahn21:fix/pcsg-topology-constraints

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes a bug where topology constraints were not properly created for PodCliqueScalingGroup (PCSG) replicas when Topology Aware Scheduling (TAS) is enabled.

The fix ensures that:

  • PCS-level topology constraints are created for the entire PodGangSet
  • PCSG-level topology constraints are created for each ScalingGroup replica
  • PodClique FQNs are properly tracked and grouped within each PCSG replica

Without this fix, multi-replica PodCliqueScalingGroups would not have topology constraints applied at the correct granularity, leading to incorrect pod placement.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

This PR contains only the production code fixes. The comprehensive e2e test infrastructure for TAS will be submitted in a separate PR to keep the reviews focused.

Changes:

  • operator/internal/controller/podcliqueset/components/podgang/syncflow.go: Added PCSG topology constraint creation logic
  • operator/internal/controller/podcliqueset/components/podgang/syncflow_test.go: Updated test expectations to reflect proper constraint hierarchy

All unit tests pass and linter reports 0 issues.

Does this PR introduce a API change?

NONE

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

Scaled PodGangs (PCSG replicas above MinAvailable) were missing
PCSG-level topology constraints in TopologyConstraintGroupConfigs.
This caused pods in scaled PCSG replicas to be scheduled without
proper topology grouping constraints.

Changes:
- Collect pclqFQNs while building scaled PodGang
- Create TopologyConstraintGroupConfig for PCSG when TAS enabled
- Set pcsgTopologyConstraints in podGangInfo for scaled PodGangs
- Now mirrors the pattern used in base PodGang creation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Updated test expectations in TestComputeExpectedPodGangsWithTopologyConstraints
to correctly validate the 3-level constraint hierarchy for scaled PodGangs.

Changes:
- Fixed topologyLevel from rack to zone (PCS-level constraint)
- Added missing pcsgConstraints field for PCSG-level constraints

Both base and scaled PodGangs have unified 3-level structure with
PCS-level at top, PCLQ-level for PodGroups, and PCSG-level in
TopologyConstraintGroupConfigs.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 merged commit a08e5af into ai-dynamo:main Jan 19, 2026
3 of 6 checks passed
@Ronkahn21 Ronkahn21 changed the title fix: add PCSG topology constraints to scaled PodGangs fix: add PCS topology constraints to scaled PodGangs Jan 19, 2026
@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

For scaled podgangs, which is essentially pcsg replicas above min available, I thought we need to apply the topology constraints on the entire PodGang level and not the TopologyConstraintGroupConfigs.

@unmarshall can you please review this change as well?

@unmarshall

unmarshall commented Jan 20, 2026

Copy link
Copy Markdown
Collaborator

@Ronkahn21 you will have to rollback this PR as the change that this PR introduces is incorrect. For a scaled podgang there is no TopologyConstraintGroupConfigs. Please refer to #339 for more details. A PR #340 was raised to fix that.

Ronkahn21 added a commit to Ronkahn21/grove that referenced this pull request Jan 20, 2026
…amo#347)"

This reverts commit a08e5af.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
unmarshall pushed a commit that referenced this pull request Jan 20, 2026
)

* Revert "fix: add PCSG topology constraints to scaled PodGangs (#347)"

This reverts commit a08e5af.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix: update topology constraint handling for scaled PodGangs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix: only create PCSG topology constraints when defined

Add nil check before creating PCSG topology constraint groups in
base PodGang building. This prevents creating empty topology
constraints when PCSG has no TopologyConstraint specified,
ensuring scaled PodGangs properly fall back to PCS-level
constraints.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: add case for PCS with nil PCSG topology constraints fallback to pcs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: remove redundant comments regarding pcsgConstraints in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: improve formatting and remove redundant comment in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

---------

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
danbar2 pushed a commit to danbar2/grove that referenced this pull request Jan 21, 2026
…i-dynamo#357)

* Revert "fix: add PCSG topology constraints to scaled PodGangs (ai-dynamo#347)"

This reverts commit a08e5af.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix: update topology constraint handling for scaled PodGangs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* fix: only create PCSG topology constraints when defined

Add nil check before creating PCSG topology constraint groups in
base PodGang building. This prevents creating empty topology
constraints when PCSG has no TopologyConstraint specified,
ensuring scaled PodGangs properly fall back to PCS-level
constraints.

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: add case for PCS with nil PCSG topology constraints fallback to pcs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: remove redundant comments regarding pcsgConstraints in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: improve formatting and remove redundant comment in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

* test: clean up formatting in syncflow_test.go

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

---------

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants