Skip to content

api: add Topology aware support#235

Merged
Ronkahn21 merged 23 commits into
ai-dynamo:mainfrom
Ronkahn21:api/topology-support
Nov 2, 2025
Merged

api: add Topology aware support#235
Ronkahn21 merged 23 commits into
ai-dynamo:mainfrom
Ronkahn21:api/topology-support

Conversation

@Ronkahn21

@Ronkahn21 Ronkahn21 commented Oct 27, 2025

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind api

What this PR does / why we need it

This PR introduces topology-aware scheduling APIs for network locality optimization in multi-node AI inference workloads.

Which issue(s) this PR fixes:

Fixes #245

Key additions

  • ClusterTopology CRD: Cluster-scoped singleton (user-chosen name) defining topology hierarchy by mapping level names (region/zone/datacenter/block/rack/host/numa) to node labels. Immutable with deletion protection via finalizer.

  • TopologyConstraint fields: packDomain added to PodCliqueSet, PodCliqueScalingGroup, and PodClique for per-replica placement within one topology domain

  • PodGang API: Three-level constraints (gang/scaling-group/clique) with required (user-specified) and preferred (auto-generated) semantics, annotation-based topology reference

Removed deprecated fields

  • PodCliqueSetSpec.replicaSpreadConstraints
  • PodCliqueSetTemplateSpec.schedulingPolicyConfig and NetworkPackGroupConfig
  • PodGangSpec.spreadConstraints and networkPackGroupConfigs

Note: API-only PR. Controller logic, validation webhooks, and translation implementation in follow-up PRs.

Does this PR introduce an API change?

action required: Removed deprecated scheduling fields. Added ClusterTopology CRD and topology constraints. Existing workloads using replicaSpreadConstraints or schedulingPolicyConfig must migrate. Admins must create ClusterTopology and aligned Kueue Topology before enabling topology features.

Additional documentation

Design document: #224

@Ronkahn21 Ronkahn21 changed the title feat: add TopologyDomain and related constraints for enhanced topolog… api: add Topology aware support Oct 27, 2025
…y management

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the api/topology-support branch from 7eb1c65 to d667527 Compare October 27, 2025 20:54
…d constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are missing the Grove's pre-defined topology level constants?

Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread scheduler/api/core/v1alpha1/podgang.go
Comment thread scheduler/api/core/v1alpha1/podgang.go Outdated
Comment thread operator/api/core/v1alpha1/crds/grove.io_podcliquesets.yaml
…d constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d validation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Remove redundant comment for TopologyDomainList.Items field
- Change 'highest' to 'broadest' for consistency with scope terminology

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…mentation files

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 marked this pull request as ready for review October 29, 2025 21:14
@Ronkahn21 Ronkahn21 requested a review from unmarshall as a code owner October 29, 2025 21:14
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread operator/api/core/v1alpha1/topologydomain.go Outdated
Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Ronkahn21 and others added 10 commits October 30, 2025 18:53
- Rename TopologyDomain -> ClusterTopology with shortName ct
- Rename TopologyLevelName type -> TopologyDomain type
- Rename PackLevel field -> PackDomain
- Rename TopologyLevel.Name -> Domain, TopologyKey -> Key
- Update all generated code and CRDs

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… levels constants

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…Topology

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…opologyDomain

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Update operator-api.md with simpler markdown table syntax

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@Ronkahn21 Ronkahn21 merged commit 141155e into ai-dynamo:main Nov 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create ClusterTopology CRD and TopologyConstraints API

3 participants