Skip to content

Feat/create cluster topology and KAI topology#298

Merged
unmarshall merged 23 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/configuraion-api-changes
Dec 30, 2025
Merged

Feat/create cluster topology and KAI topology#298
unmarshall merged 23 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/configuraion-api-changes

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind api

What this PR does / why we need it:

This PR enables automatic ClusterTopology and KAI Topology resource creation from operator configuration at startup.

Key Changes:

  • Configuration API: Replaced clusterTopology.name field with clusterTopology.levels array to directly define topology hierarchy
  • Topology Manager: New package (internal/topology) that creates/updates both ClusterTopology and KAI Topology CRs at operator startup
  • Validation Migration: Moved ClusterTopology validation from webhook to CRD using CEL validation rules (domain/key uniqueness)
  • RBAC: Added permissions for kai.scheduler/topologies resource
  • Ownership Management: KAI Topology is now owned by ClusterTopology with proper lifecycle management

Why we need this:

  • Eliminates manual ClusterTopology CR creation step
  • Ensures ClusterTopology and KAI Topology stay in sync
  • Validates topology configuration before operator starts
  • Simplifies deployment and reduces operational complexity

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Key Review Points:

  • Topology manager creates ClusterTopology with fixed name (grove-topology)
  • KAI Topology recreated when levels change (immutable field) (it will be create by cluster topology operator later)
  • Operator fails to start if topology creation fails when enabled
  • ClusterTopology validating webhook removed (validation now in CRD)

Does this PR introduce an API change?

Yes

action required: Operator configuration API changed for cluster topology. Replace clusterTopology.name with clusterTopology.levels array defining topology hierarchy. Operator now automatically creates and
manages ClusterTopology and KAI Topology resources from configuration.

@Ronkahn21 Ronkahn21 changed the title Feat/configuraion api changes Feat/create cluster topology and KAI topology Dec 24, 2025
…-driven CR creation

  Move validation logic from admission webhook to struct-level kubebuilder
  markers and CEL rules. Operator now creates/updates ClusterTopology CR
  from configuration at startup, eliminating need for pre-created resources.

  - Add CEL validation for domain/key uniqueness to ClusterTopology CRD
  - Add Pattern validation for Kubernetes label key format
  - Add Levels field to OperatorConfiguration.ClusterTopologyConfiguration
  - Remove admission webhook validation code and registration
  - Implement ensureClusterTopology() to manage CR lifecycle at startup
  - Update configuration validation with domain/key uniqueness checks
  - Remove ClusterTopology webhook from cert management
  - Update tests to reflect webhook removal
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…tterns

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…le permissions

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…alidation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…istency

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@Ronkahn21 Ronkahn21 force-pushed the feat/configuraion-api-changes branch from 5d29c83 to 9c3353b Compare December 24, 2025 21:44
Ronkahn21 and others added 5 commits December 25, 2025 10:36
… logic for topology management

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
disable the feature by default
* Removed duplicate definition of TopologyDomain and TopologyLevel
  across config and core API.
* Simplified sorting and removed unnecessary functions.
* Added sample values for levels in values.yaml
* Improved validations for OperatorConfiguration.ClusterTopology
* Added missing JSON tag for LeaderElection in OperatorConfiguration
* Refactored operator/cmd
* Removed unused install-helm-charts script
* Removed Makefile target added for running a specific int test.
* Renamed and refactored internal/topology to internal/clustertopology.
* Refactored application version handling.

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Comment thread operator/api/core/v1alpha1/clustertopology.go
Comment thread operator/api/config/v1alpha1/types.go
Comment thread operator/cmd/cli/cli.go
Comment thread operator/api/core/v1alpha1/clustertopology.go
Comment thread operator/internal/utils/ioutil/ioutil.go
Comment thread operator/internal/version/version.go
unmarshall and others added 6 commits December 29, 2025 10:45
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
tag. Fixing that uncovered missing role and role binding. This commit
adds the missing role and rolebinding allowing Grove operate on leases

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…e test run, to be changed later

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…o longer set

Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Madhav Bhargava <madhav.bhargava@sap.com>
…binding in values.yaml

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@unmarshall unmarshall merged commit f8852f0 into ai-dynamo:main Dec 30, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants