Skip to content

Reconcile PodClique TopologyConstraints#302

Merged
unmarshall merged 19 commits into
ai-dynamo:mainfrom
unmarshall:tas
Jan 11, 2026
Merged

Reconcile PodClique TopologyConstraints#302
unmarshall merged 19 commits into
ai-dynamo:mainfrom
unmarshall:tas

Conversation

@unmarshall

@unmarshall unmarshall commented Jan 4, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?

What this PR does / why we need it:

This PR introduces the following changes to ensure that the TopologyConstraints for topology aware scheduling defined on the PodCliqueSet are reconciled into PodGang.

  • Renamed ClusterTopologyConfiguration to
    TopologyAwareSchedulingConfiguration in operator config.
  • Introduced a new condition TopologyLevelsUnavailable on PCS
  • PackDomain field in corev1alpha1 TopologyConstraint is now required.
  • When creating ClusterTopology, if host topology level is not defined
    in TopologyAwareSchedulingConfiguration then the operator will set
    this level in ClusterTopology as this is a required level.
  • Adapted PodGang component to set pack constraints at all hierarchy levels.
  • Introduced a new condition TopologyLevelsUnavailable in PCS status.
  • Added reconciliation code to update the PCS status condition.
  • Added missing cluster role to delete KAI topology CR.

Which issue(s) this PR fixes:

Fixes #246, #301

Special notes for your reviewer:

Does this PR introduce a API change?

Rename Grove OperatorConfiguration.ClusterTopology to OperatorConfiguration.TopologyAwareSchedulingConfig
Adds support to reconcile TopologyConstraints defined on PodCliqueSet and propagate them to all the PodGangs that are created for the PodCliqueSet

Additional documentation e.g., enhancement proposals, usage docs, etc.:


@unmarshall unmarshall force-pushed the tas branch 2 times, most recently from 1a0b687 to 8b60471 Compare January 7, 2026 07:41
@unmarshall unmarshall added API Updates to API enhancement New feature or request labels Jan 7, 2026
Comment thread operator/charts/templates/clusterrole.yaml Outdated

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1/n. I am going through the code currently, don't want to post all my comments prematurely since the rest of the code might fix my confusion.

Comment thread operator/samples/tas/single-node-disaggregated.yaml Outdated
Comment thread operator/cmd/main.go Outdated
Comment thread operator/internal/clustertopology/clustertopology.go Outdated
Comment thread operator/e2e/setup/k8s_clusters.go Outdated
Comment thread operator/internal/clustertopology/clustertopology.go Outdated
Comment thread operator/internal/controller/common/component/utils/clustertopology.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/clustertopology/clustertopology.go Outdated
renormalize
renormalize previously approved these changes Jan 8, 2026

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this PR! (and apologies for the slow review)

Ronkahn21
Ronkahn21 previously approved these changes Jan 8, 2026
@gflarity

gflarity commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

@unmarshall I see there's a timeout in the E2E tests. The tests can be tricky to debug so I made sure to include a lot of Debug logging. Take a look at setup.go in the E2E test dir. You can enable debug log level there:

	// increase logger verbosity for debugging
	logger = utils.NewTestLogger(utils.InfoLevel)

That should at least give an idea of why it's timing out. Also, I'm curious if it's timing out in the same spot, or it changes?

@unmarshall

Copy link
Copy Markdown
Collaborator Author

When I run it locally it passes sometimes. At other times a random test fails. On GHA it often fails but at different places.

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just suggestions except for the GPU Operator comments/depedencies.yaml comments.

Comment thread docs/api-reference/scheduler-api.md Outdated
Comment thread operator/e2e/setup/k8s_clusters.go
Comment thread operator/e2e/tests/setup.go
Comment thread operator/e2e/setup/shared_cluster.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread docs/api-reference/scheduler-api.md Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
@unmarshall unmarshall dismissed stale reviews from renormalize and Ronkahn21 via d13db93 January 9, 2026 05:09
@unmarshall unmarshall requested a review from gflarity January 9, 2026 05:13
Comment thread operator/samples/tas/single-node-disaggregated.yaml Outdated
@unmarshall unmarshall requested a review from renormalize January 9, 2026 11:44
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/podcliqueset/components/podgang/syncflow.go Outdated
@Ronkahn21 Ronkahn21 dismissed their stale review January 10, 2026 11:42

Want to validate some point

Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last point about removing the topology type condition when the feature is disabled

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

unmarshall and others added 19 commits January 11, 2026 12:05
* Renamed ClusterTopologyConfiguration to
  TopologyAwareSchedulingConfiguration in operator config.
* Introduced a new condition TopologyLevelsUnavailable on PCS
* PackDomain field in corev1alpha1 TopologyConstraint is now required.
* When creating ClusterTopology, if host topology level is not defined
  in TopologyAwareSchedulingConfiguration then the operator will set
  this level in ClusterTopology as this is a required level.
* Adapted PodGang component to set pack constraints at all hierarchy levels.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Introduced a new condition TopologyLevelsUnavailable in PCS status.
* Added reconciliation code to update the PCS status condition.
* Added missing cluster role to delete KAI topology CR.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…Gangs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…uration

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
* Added code to update or remove the condition on PCS.
* Create utility function for cluster topology with unit test

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Upgraded KAI scheduler version dependency for e2e test to v0.12.0
* Changed polling timeout for e2e tests to 2 mins.
* Removed NVIDIA GPU operator to be installed as its not required.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…trol plane servers for e2e from 3 to 1

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved `synchronizeTopology` in main to clustertopology package.
* Adjusted unit tests for clustertopology.go
* Removed the previously added delete cluster role for KAI Topology
  resource.
* Removed the code to setup NVIDIA GPU Operator from e2e tests as its
  not required.
* Increased the poll timeout to 4 mins.
* Added restartPolicy to Always for Grove operator deployment.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
  will be set later after requirements are clear.
* Added unit tests for computeExpectedPodGangs function.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Moved GetClusterTopologyLevels to clustertopology package.
* Added docstring for buildClusterTopology function.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Added docstrings to computeTopologyLevelsUnavailableCondition and
  mutateTopologyLevelUnavailableConditions functions.
* Changed the docstring for TopologyPackConstraint Preferred and
  Required fields.
* Removed all references to GPU operator from e2e test.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
* Removed tas and clustertopology samples. These will be added later.
* Reworded docstring for PodGang TopologyConstraint.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…unctions

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
@unmarshall

unmarshall commented Jan 11, 2026

Copy link
Copy Markdown
Collaborator Author

I have run the e2e tests locally and they are passing.
Unfortunately e2e tests are not getting scheduled when using GHA since m5.2xlarge runners are not available for more than 10hrs. I have also attached the screen shots. This is a massive problem and we need to seriously think on how we wish to do e2e tests going forward.

I have spoken to @sanjaychatterjee and we have decided to merge this PR and not wait for run of e2e tests to complete on github.

Screenshot 2026-01-11 at 10 31 14 PM Screenshot 2026-01-11 at 10 31 37 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Updates to API enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translate TopologyConstraints from Grove workload API to scheduler API

5 participants