multi topology implementation#496
Merged
enoodle merged 40 commits intoMay 1, 2026
Merged
Conversation
113525c to
46c4f22
Compare
9cd571d to
133d33b
Compare
5f5c4a3 to
5d51aa2
Compare
danbar2
reviewed
Apr 16, 2026
danbar2
reviewed
Apr 16, 2026
danbar2
reviewed
Apr 16, 2026
danbar2
reviewed
Apr 16, 2026
danbar2
reviewed
Apr 16, 2026
8dfb298 to
43c36e7
Compare
danbar2
previously approved these changes
Apr 19, 2026
73bd7ac to
0674588
Compare
gflarity
reviewed
Apr 22, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…/remove Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
But only in the API - the validation webhook will still reject cross topology PCS Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
4e7af31 to
78e6d13
Compare
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
gflarity
approved these changes
May 1, 2026
sanjaychatterjee
approved these changes
May 1, 2026
sanjaychatterjee
left a comment
Collaborator
There was a problem hiding this comment.
LGTM! Thanks for the PR!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
/kind api
What this PR does / why we need it:
Implementation of Multiple Topology GREP to enhance TAS support (#413 )
Which issue(s) this PR fixes:
Fixes #369
Special notes for your reviewer:
Does this PR introduce a API change?
ClusterTopology API extended with multi-topology support:
ClusterTopologySpec:
spec.levels[].domain— removed fixed enum constraint (wasregion|zone|datacenter|block|rack|host|numa); domains are now free-form strings validated by pattern^[a-z][a-z0-9-]*$with max 63 characters. This enables heterogeneous GPU cluster topologies where hardware segments define their own domain names.spec.schedulerReferences(new, optional) — list of{schedulerName, reference}entries. Controls per-backend topology resource lifecycle: absent = operator auto-manages the backend topology resource; present = resource is externally managed and the operator performs drift detection only.ClusterTopologyStatus (new):
status.observedGeneration— generation last reconciled by the controller.status.conditions— standard condition list; includesSchedulerTopologyDriftcondition (True/Driftwhen any backend is out of sync,False/InSyncwhen all match).status.schedulerTopologyStatuses— per-backend sync state reporting{schedulerName, reference, inSync, message, schedulerBackendTopologyObservedGeneration}.PodCliqueSet API:
spec.template.topologyConstrainttype changed fromTopologyConstrainttoPodCliqueSetTopologyConstraint, addingtopologyNamefield — the name of theClusterTopologyresource to use. Required whenpackDomainis specified. Immutable after creation.spec.template.topologyConstraint.packDomainenum constraint removed; must now reference a domain defined in the namedClusterTopology's levels (validated by webhook).Removed:
ClusterTopology.spec.levelsimmutability constraint andMaxItems=7cap.DefaultClusterTopologyName("grove-topology") constant — the operator no longer manages a single implicit topology; all topologies are now admin-createdClusterTopologyresources.TopologyDomain.IsTopologyDomainNarrower,SupportedTopologyDomains,SortTopologyLevelshelper functions — ordering was predicated on the fixed enum which no longer exists.Conditions added (constants):
SchedulerTopologyDrift/InSync/Drift— onClusterTopologyTopologyNameMissing/TopologyAwareSchedulingDisabled— onPodCliqueSetAdditional documentation e.g., enhancement proposals, usage docs, etc.:
TBA