GREP 369 - support multiple ClusterTopology#413
Merged
danbar2 merged 38 commits intoMar 31, 2026
Conversation
Collaborator
a94af4a to
4bd11ac
Compare
ebdc1e5 to
aade80a
Compare
shayasoolin
reviewed
Mar 8, 2026
shayasoolin
reviewed
Mar 16, 2026
shayasoolin
reviewed
Mar 16, 2026
shayasoolin
reviewed
Mar 16, 2026
shayasoolin
left a comment
Contributor
There was a problem hiding this comment.
LGTM, added a few minor comments. Plus waiting to see the next changes following the discussion today about topologyName immutability as well as placement under the PCS's topologyConstraints field.
5991cd8 to
77ea868
Compare
danbar2
previously approved these changes
Mar 18, 2026
danbar2
previously approved these changes
Mar 23, 2026
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Weave GREP-369 multi-topology design into the existing GREP-244 topology-aware scheduling proposal as a unified document. Signed-off-by: Erez Freiberger <enoodle@gmail.com>
* Add missing validation rule: reject TopologyConstraint when TAS is disabled cluster-wide (symmetric with clusterTopologyName check). * Update Dependencies to reference KaiSchedulerConfig.CreateTopologyResources with default-true semantics and link to GREP-375. * Add webhook bypass note to ClusterTopology Lifecycle for the default topology, cross-referencing GREP-244 Topology Configuration Drift. * Update Change Summary to include the new validation case. Signed-off-by: Erez Freiberger <enoodle@gmail.com>
GREP-369 (multi-topology support) has been fully merged into GREP-244 (topology-aware scheduling). The content is preserved in git history and can be restored if needed. Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Replace the Excalidraw diagram with an updated architecture overview showing: multiple ClusterTopology resources with immutable levels, CT Controller with finalizer management, two scheduler backend topology modes (auto-managed and externally-managed via schedulerTopologyRef), and PCS clusterTopologyName reference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…Configuration Replace single topology levels with named TopologyProfiles array. Scheduler backend topology references move into each SchedulerProfile's backend-specific config (aligned with GREP-375). Update validation, operator startup, topology configuration updates, and limitations/risks sections. Remove "Two Management Paths" risk section — all ClusterTopologies are now operator-managed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
All ClusterTopology resources are created by the operator from configured topology profiles. Remove admin-created CT path, update lifecycle section with operator-only mermaid diagram, invert Alternatives section (hybrid model is now the alternative). Clarify auto-managed vs referenced scheduler backend topology: profiles NOT in topologyReferences get auto-created CRs, profiles IN topologyReferences use externally managed resources with drift detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Scheduler backend topology mapping is now configured via topologyReferences in each scheduler profile's backend-specific config (OperatorConfiguration), not on the ClusterTopology CRD itself. Remove SchedulerTopologyRef field and SchedulerTopologyReference type. Update status condition and Dependencies section to reference the new topologyReferences mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Rename clusterTopologyName -> topologyProfileName throughout. Make the field required when any TopologyConstraint is set — no implicit default topology. Remove TopologyLevelsUnavailable PCS status condition: immutable levels + finalizers make the condition unreachable in normal operation. Update Rule-3, PodGang resolution, backward compatibility, monitoring kubectl example, and test plan to reflect the new model. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Fix collateral damage from replace_all (topologyProfiles JSON tag and YAML key incorrectly renamed to topologyProfileNames), update story-4 to use topologyProfileName field, remove stale "default topology" references, and unify "operator fails to start" language throughout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Instead of blocking operator startup with finalizers when ClusterTopology resources need to be deleted or recreated, the operator now proceeds with deletion and the PCS reconciler sets a TopologyLevelsUnavailable condition on affected PodCliqueSets. Invalid topology constraints are removed from PodGang resources (graceful degradation). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Regenerate the architecture diagram to reflect the current design: named topologyProfiles, all operator-managed CTs, topologyProfileName on PCS, no finalizers, scheduler topologyReferences in config. Promote PodCliqueSet Status Conditions from bold text to heading for proper anchor linking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
The CEL immutability constraint (self == oldSelf) on spec.levels is unnecessary since the TopologyLevelsUnavailable condition already handles domain removal gracefully — stripping invalid PodGang constraints without evicting running pods. This avoids forcing administrators through delete+recreate workflows when updating topology levels. For auto-managed scheduler backend topologies, the CT controller still deletes and recreates the downstream resource when levels change (since KAI Topology has its own immutability constraint). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Follow Kubernetes condition naming convention: conditions should use negative polarity (True = problem) except Ready. Flipped polarity so True = drift detected, False = all backends in sync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…rvedVersion Rename the observedGeneration field in SchedulerTopologyStatus to schedulerBackendTopologyObservedVersion to clearly distinguish it from the ClusterTopology-level observedGeneration field. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
topologyName on PodCliqueSet is now immutable after creation rather than only after scheduling. Users must delete and recreate the PCS to change topology. This may be relaxed in the future when the Grove scheduler backend is implemented, enabling safe re-resolution of topology references while pods are pending. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Introduce PodCliqueSetTopologyConstraint that extends TopologyConstraint
with the topologyName field. This groups the topology reference with
the constraint at the PCS level. PodClique and PodCliqueScalingGroup
continue to use the base TopologyConstraint (no topologyName).
YAML changes from:
template:
topologyName: h100-topology
topologyConstraint:
packDomain: zone
to:
template:
topologyConstraint:
topologyName: h100-topology
packDomain: zone
Also makes topologyName fully immutable after creation (previously
mutable while pending). Story 5 (topology retry) is deferred until
the Grove scheduler backend is implemented.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
- Use "provided" instead of "populated" for schedulerReferences and clarify that drift detection compares domain/key pairs and their order - Remove "not by a fixed global order" from hierarchy note - Simplify the hierarchy strictness example to a single sentence Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…yName Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
The CEL uniqueness validation rules on ClusterTopology.spec.levels use self.all(x, self.filter(...)) which has O(n²) cost estimation. Without a MaxItems bound, the Kubernetes API server rejects the CRD because the estimated rule cost exceeds the validation budget by >100x. MaxItems=16 provides a generous upper bound for real-world topology hierarchies (most have 3-6 levels) while keeping the CEL cost within the Kubernetes validation budget. Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Describes how the ClusterTopology controller integrates with the Scheduler Backend Framework (GREP-375) for topology CRD management: - TopologyAwareSchedBackend optional interface with TopologyGVR(), SyncTopology(), OnTopologyDelete(), and CheckTopologyDrift() - CT controller registers dynamic watches at startup by querying all registered TopologyAwareSchedBackend implementations for their GVR - On every reconcile, all enabled TopologyAwareSchedBackends are iterated: auto-managed (not in schedulerReferences) calls SyncTopology(); externally-managed (in schedulerReferences) calls CheckTopologyDrift() - Updated Dependencies section to reference the new interface Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…Unknown to True If the referenced ClusterTopology no longer exists, the topology levels are definitively unavailable — True is more accurate than Unknown, which implies indeterminate state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Erez Freiberger <enoodle@gmail.com>
4fa3722 to
133c5df
Compare
shayasoolin
approved these changes
Mar 31, 2026
danbar2
approved these changes
Mar 31, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind feature
/kind api
What this PR does / why we need it:
Introduces GREP-369 (#369)
This is a proposal to extend Grove's topology API to support multiple named ClusterTopology resources within a single cluster. This allows heterogeneous clusters with different GPU architectures or multi-cloud environments to define separate topologies, and lets PodCliqueSets reference a specific topology via a new
clusterTopologyNamefield.