Skip to content

GREP 369 - support multiple ClusterTopology#413

Merged
danbar2 merged 38 commits into
ai-dynamo:mainfrom
enoodle:proposal/multi-topology-support
Mar 31, 2026
Merged

GREP 369 - support multiple ClusterTopology#413
danbar2 merged 38 commits into
ai-dynamo:mainfrom
enoodle:proposal/multi-topology-support

Conversation

@enoodle

@enoodle enoodle commented Feb 9, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind api

What this PR does / why we need it:

Introduces GREP-369 (#369)
This is a proposal to extend Grove's topology API to support multiple named ClusterTopology resources within a single cluster. This allows heterogeneous clusters with different GPU architectures or multi-cloud environments to define separate topologies, and lets PodCliqueSets reference a specific topology via a new
clusterTopologyName field.

@copy-pr-bot

copy-pr-bot Bot commented Feb 9, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread docs/proposals/369-multi-topology-support/README.md Outdated
Comment thread docs/proposals/369-multi-topology-support/README.md Outdated
@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

@enoodle Can we edit the current TAS GREP file (link) instead of creating a new one. This will help keep the flow aligned.

@enoodle enoodle force-pushed the proposal/multi-topology-support branch from a94af4a to 4bd11ac Compare February 25, 2026 10:37
@enoodle enoodle changed the title GREP 369 - multi topology support GREP 369 - support multiple ClusterTopology Feb 26, 2026
@enoodle enoodle force-pushed the proposal/multi-topology-support branch from ebdc1e5 to aade80a Compare March 3, 2026 21:14

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread docs/proposals/244-topology-aware-scheduling/README.md
Comment thread docs/proposals/244-topology-aware-scheduling/README.md Outdated
Comment thread docs/proposals/244-topology-aware-scheduling/README.md Outdated
Comment thread docs/proposals/244-topology-aware-scheduling/README.md

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few minor comments. Plus waiting to see the next changes following the discussion today about topologyName immutability as well as placement under the PCS's topologyConstraints field.

Comment thread docs/proposals/244-topology-aware-scheduling/README.md Outdated
Comment thread docs/proposals/244-topology-aware-scheduling/README.md Outdated
@enoodle enoodle force-pushed the proposal/multi-topology-support branch from 5991cd8 to 77ea868 Compare March 17, 2026 21:59
danbar2
danbar2 previously approved these changes Mar 18, 2026
danbar2
danbar2 previously approved these changes Mar 23, 2026
enoodle added 10 commits March 30, 2026 13:55
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Weave GREP-369 multi-topology design into the existing GREP-244
topology-aware scheduling proposal as a unified document.

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
* Add missing validation rule: reject TopologyConstraint when TAS is
  disabled cluster-wide (symmetric with clusterTopologyName check).
* Update Dependencies to reference KaiSchedulerConfig.CreateTopologyResources
  with default-true semantics and link to GREP-375.
* Add webhook bypass note to ClusterTopology Lifecycle for the default
  topology, cross-referencing GREP-244 Topology Configuration Drift.
* Update Change Summary to include the new validation case.

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
GREP-369 (multi-topology support) has been fully merged into GREP-244
(topology-aware scheduling). The content is preserved in git history
and can be restored if needed.

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
enoodle added 25 commits March 30, 2026 13:55
Replace the Excalidraw diagram with an updated architecture overview
showing: multiple ClusterTopology resources with immutable levels,
CT Controller with finalizer management, two scheduler backend topology
modes (auto-managed and externally-managed via schedulerTopologyRef),
and PCS clusterTopologyName reference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…Configuration

Replace single topology levels with named TopologyProfiles array. Scheduler
backend topology references move into each SchedulerProfile's backend-specific
config (aligned with GREP-375). Update validation, operator startup, topology
configuration updates, and limitations/risks sections. Remove "Two Management
Paths" risk section — all ClusterTopologies are now operator-managed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
All ClusterTopology resources are created by the operator from configured
topology profiles. Remove admin-created CT path, update lifecycle section
with operator-only mermaid diagram, invert Alternatives section (hybrid model
is now the alternative). Clarify auto-managed vs referenced scheduler backend
topology: profiles NOT in topologyReferences get auto-created CRs, profiles
IN topologyReferences use externally managed resources with drift detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Scheduler backend topology mapping is now configured via topologyReferences
in each scheduler profile's backend-specific config (OperatorConfiguration),
not on the ClusterTopology CRD itself. Remove SchedulerTopologyRef field and
SchedulerTopologyReference type. Update status condition and Dependencies
section to reference the new topologyReferences mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Rename clusterTopologyName -> topologyProfileName throughout. Make the field
required when any TopologyConstraint is set — no implicit default topology.
Remove TopologyLevelsUnavailable PCS status condition: immutable levels +
finalizers make the condition unreachable in normal operation. Update Rule-3,
PodGang resolution, backward compatibility, monitoring kubectl example, and
test plan to reflect the new model.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Fix collateral damage from replace_all (topologyProfiles JSON tag and
YAML key incorrectly renamed to topologyProfileNames), update story-4
to use topologyProfileName field, remove stale "default topology"
references, and unify "operator fails to start" language throughout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Instead of blocking operator startup with finalizers when ClusterTopology
resources need to be deleted or recreated, the operator now proceeds with
deletion and the PCS reconciler sets a TopologyLevelsUnavailable condition
on affected PodCliqueSets. Invalid topology constraints are removed from
PodGang resources (graceful degradation).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Regenerate the architecture diagram to reflect the current design:
named topologyProfiles, all operator-managed CTs, topologyProfileName
on PCS, no finalizers, scheduler topologyReferences in config.
Promote PodCliqueSet Status Conditions from bold text to heading
for proper anchor linking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
The CEL immutability constraint (self == oldSelf) on spec.levels is
unnecessary since the TopologyLevelsUnavailable condition already
handles domain removal gracefully — stripping invalid PodGang
constraints without evicting running pods. This avoids forcing
administrators through delete+recreate workflows when updating
topology levels.

For auto-managed scheduler backend topologies, the CT controller
still deletes and recreates the downstream resource when levels
change (since KAI Topology has its own immutability constraint).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Follow Kubernetes condition naming convention: conditions should use
negative polarity (True = problem) except Ready. Flipped polarity so
True = drift detected, False = all backends in sync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…rvedVersion

Rename the observedGeneration field in SchedulerTopologyStatus to
schedulerBackendTopologyObservedVersion to clearly distinguish it
from the ClusterTopology-level observedGeneration field.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
topologyName on PodCliqueSet is now immutable after creation rather
than only after scheduling. Users must delete and recreate the PCS to
change topology. This may be relaxed in the future when the Grove
scheduler backend is implemented, enabling safe re-resolution of
topology references while pods are pending.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Introduce PodCliqueSetTopologyConstraint that extends TopologyConstraint
with the topologyName field. This groups the topology reference with
the constraint at the PCS level. PodClique and PodCliqueScalingGroup
continue to use the base TopologyConstraint (no topologyName).

YAML changes from:
  template:
    topologyName: h100-topology
    topologyConstraint:
      packDomain: zone

to:
  template:
    topologyConstraint:
      topologyName: h100-topology
      packDomain: zone

Also makes topologyName fully immutable after creation (previously
mutable while pending). Story 5 (topology retry) is deferred until
the Grove scheduler backend is implemented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
- Use "provided" instead of "populated" for schedulerReferences and
  clarify that drift detection compares domain/key pairs and their order
- Remove "not by a fixed global order" from hierarchy note
- Simplify the hierarchy strictness example to a single sentence

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…yName

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Signed-off-by: Erez Freiberger <enoodle@gmail.com>
The CEL uniqueness validation rules on ClusterTopology.spec.levels use
self.all(x, self.filter(...)) which has O(n²) cost estimation. Without
a MaxItems bound, the Kubernetes API server rejects the CRD because the
estimated rule cost exceeds the validation budget by >100x.

MaxItems=16 provides a generous upper bound for real-world topology
hierarchies (most have 3-6 levels) while keeping the CEL cost within
the Kubernetes validation budget.

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
Describes how the ClusterTopology controller integrates with the
Scheduler Backend Framework (GREP-375) for topology CRD management:

- TopologyAwareSchedBackend optional interface with TopologyGVR(),
  SyncTopology(), OnTopologyDelete(), and CheckTopologyDrift()
- CT controller registers dynamic watches at startup by querying all
  registered TopologyAwareSchedBackend implementations for their GVR
- On every reconcile, all enabled TopologyAwareSchedBackends are iterated:
  auto-managed (not in schedulerReferences) calls SyncTopology();
  externally-managed (in schedulerReferences) calls CheckTopologyDrift()
- Updated Dependencies section to reference the new interface

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
…Unknown to True

If the referenced ClusterTopology no longer exists, the topology levels
are definitively unavailable — True is more accurate than Unknown, which
implies indeterminate state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Erez Freiberger <enoodle@gmail.com>
@enoodle enoodle force-pushed the proposal/multi-topology-support branch from 4fa3722 to 133c5df Compare March 30, 2026 12:03
@danbar2 danbar2 merged commit f41db79 into ai-dynamo:main Mar 31, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants