Skip to content

multi topology implementation#496

Merged
enoodle merged 40 commits into
ai-dynamo:mainfrom
enoodle:erez/multi-topology-implementation
May 1, 2026
Merged

multi topology implementation#496
enoodle merged 40 commits into
ai-dynamo:mainfrom
enoodle:erez/multi-topology-implementation

Conversation

@enoodle

@enoodle enoodle commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind api

What this PR does / why we need it:

Implementation of Multiple Topology GREP to enhance TAS support (#413 )

Which issue(s) this PR fixes:

Fixes #369

Special notes for your reviewer:

Does this PR introduce a API change?

ClusterTopology API extended with multi-topology support:

ClusterTopologySpec:

  • spec.levels[].domain — removed fixed enum constraint (was region|zone|datacenter|block|rack|host|numa); domains are now free-form strings validated by pattern ^[a-z][a-z0-9-]*$ with max 63 characters. This enables heterogeneous GPU cluster topologies where hardware segments define their own domain names.
  • spec.schedulerReferences (new, optional) — list of {schedulerName, reference} entries. Controls per-backend topology resource lifecycle: absent = operator auto-manages the backend topology resource; present = resource is externally managed and the operator performs drift detection only.

ClusterTopologyStatus (new):

  • status.observedGeneration — generation last reconciled by the controller.
  • status.conditions — standard condition list; includes SchedulerTopologyDrift condition (True/Drift when any backend is out of sync, False/InSync when all match).
  • status.schedulerTopologyStatuses — per-backend sync state reporting {schedulerName, reference, inSync, message, schedulerBackendTopologyObservedGeneration}.

PodCliqueSet API:

  • spec.template.topologyConstraint type changed from TopologyConstraint to PodCliqueSetTopologyConstraint, adding topologyName field — the name of the ClusterTopology resource to use. Required when packDomain is specified. Immutable after creation.
  • spec.template.topologyConstraint.packDomain enum constraint removed; must now reference a domain defined in the named ClusterTopology's levels (validated by webhook).

Removed:

  • ClusterTopology.spec.levels immutability constraint and MaxItems=7 cap.
  • DefaultClusterTopologyName ("grove-topology") constant — the operator no longer manages a single implicit topology; all topologies are now admin-created ClusterTopology resources.
  • TopologyDomain.IsTopologyDomainNarrower, SupportedTopologyDomains, SortTopologyLevels helper functions — ordering was predicated on the fixed enum which no longer exists.

Conditions added (constants):

  • SchedulerTopologyDrift / InSync / Drift — on ClusterTopology
  • TopologyNameMissing / TopologyAwareSchedulingDisabled — on PodCliqueSet
> [!WARNING]
> **Upgrade Notice — Topology-Aware Scheduling**
>
> Affects users who have deployed `PodCliqueSet` resources with a `topologyConstraint` using the previous single-topology design.

### What changed

The operator now supports multiple named `ClusterTopology` resources. Each `PodCliqueSet.spec.template.topologyConstraint` now requires a `topologyName` field that explicitly names the `ClusterTopology` to use. Without it, topology placement constraints are not enforced.

### Action required

Add `topologyName: grove-topology` to each existing `PodCliqueSet` that has a `topologyConstraint`:

spec:
  template:
    topologyConstraint:
      topologyName: grove-topology   # add this
      packDomain: rack

The existing `grove-topology` ClusterTopology resource created by the previous operator version remains in the cluster and is valid — no changes to it are needed.

### What happens if you don't act immediately

The operator degrades gracefully but **topology placement constraints will not be enforced** until `topologyName` is set. This is visible via the `TopologyLevelsUnavailable` condition on the PCS:

| Condition status | Reason | Meaning |
|---|---|---|
| `Unknown` | `TopologyNameMissing` | PCS has `topologyConstraint` but no `topologyName`. Constraints are not enforced. Add `topologyName` to restore them. |
| `True` | `ClusterTopologyNotFound` | The named `ClusterTopology` does not exist. Create it or correct the name. |
| `False` | `AllClusterTopologyLevelsAvailable` | All good — constraints are being enforced. |

Check for affected PodCliqueSets after upgrade:

kubectl get podcliquesets -A -o json | jq -r '
  .items[] | select(.status.conditions != null) |
  .status.conditions[] | select(.type == "TopologyLevelsUnavailable" and .status != "False") |
  "\(.type): \(.reason) — \(.message)"'


Additional documentation e.g., enhancement proposals, usage docs, etc.:

TBA



Open with Devin

@copy-pr-bot

copy-pr-bot Bot commented Mar 23, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@enoodle enoodle force-pushed the erez/multi-topology-implementation branch 3 times, most recently from 113525c to 46c4f22 Compare March 24, 2026 22:32
@enoodle enoodle force-pushed the erez/multi-topology-implementation branch 3 times, most recently from 9cd571d to 133d33b Compare April 15, 2026 21:45
@enoodle enoodle marked this pull request as ready for review April 15, 2026 22:25
@enoodle enoodle force-pushed the erez/multi-topology-implementation branch 2 times, most recently from 5f5c4a3 to 5d51aa2 Compare April 16, 2026 10:49
Comment thread operator/e2e/grove/topology/topology.go Outdated
Comment thread operator/internal/clustertopology/clustertopology.go Outdated
Comment thread operator/internal/controller/clustertopology/register.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/webhook/admission/clustertopology/validation/validation.go Outdated
@enoodle enoodle force-pushed the erez/multi-topology-implementation branch 4 times, most recently from 8dfb298 to 43c36e7 Compare April 17, 2026 20:36

@devin-ai-integration devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/clustertopology/register.go
danbar2
danbar2 previously approved these changes Apr 19, 2026
@enoodle enoodle force-pushed the erez/multi-topology-implementation branch 2 times, most recently from 73bd7ac to 0674588 Compare April 21, 2026 12:17
Comment thread operator/internal/controller/clustertopology/reconciler.go Outdated
Comment thread operator/internal/controller/clustertopology/reconciler.go Outdated
Comment thread operator/internal/controller/clustertopology/register.go Outdated
Comment thread operator/internal/webhook/admission/pcs/validation/topologyconstraints.go Outdated
Comment thread operator/internal/webhook/admission/pcs/validation/topologyconstraints.go Outdated
enoodle added 22 commits May 1, 2026 09:00
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…/remove

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
But only in the API - the validation webhook will still reject cross
topology PCS

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
@enoodle enoodle force-pushed the erez/multi-topology-implementation branch from 4e7af31 to 78e6d13 Compare May 1, 2026 07:00
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the PR!

@enoodle enoodle merged commit 4342e65 into ai-dynamo:main May 1, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support multiple topologies

4 participants