Skip to content

Feat/Topology Configuration Infrastructure#247

Merged
Ronkahn21 merged 26 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/operator-topology-config
Nov 12, 2025
Merged

Feat/Topology Configuration Infrastructure#247
Ronkahn21 merged 26 commits into
ai-dynamo:mainfrom
Ronkahn21:feat/operator-topology-config

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

PR Description: Topology Configuration Infrastructure

What type of PR is this?

/kind feature
/kind api

What this PR does / why we need it

This PR implements the foundational infrastructure for topology-aware scheduling in Grove by adding operator-level configuration and validation for topology support.

Key Changes:

  1. Operator Configuration: Adds TopologyConfiguration to operator config with:

    • enabled field to enable/disable topology features (defaults to false)
    • name field to specify ClusterTopology resource name (defaults to "grove-topology")
    • Validation that topology name is required when enabled
  2. Startup Validation: Operator validates topology configuration at startup:

    • When topology.enabled=true, verifies ClusterTopology resource exists
    • Exits with clear error if enabled but ClusterTopology not found
    • Prevents operator from running in invalid topology state
  3. RBAC Permissions: Updates cluster role to grant Grove operator read access to ClusterTopology resources

  4. Documentation: Adds TopologyConfiguration section to operator API reference

  5. Sample Configuration: Provides example topology configuration YAML

Why This Matters:

Topology-aware scheduling is critical for Grove's multi-node inference workloads because:

  • Network locality improves high-bandwidth communication between leaders and workers
  • Coordinated placement of related components (model shards) improves performance
  • Minimizing network hops between inference components reduces latency

This PR establishes the operator-side foundation that will enable workloads to specify topology constraints in future PRs.

Special notes for your reviewer

Implementation Scope:

This is Phase 1 of topology-aware scheduling implementation, focusing on operator configuration infrastructure. It includes:

  • Operator config schema and defaults
  • Validation logic for topology settings
  • Startup-time ClusterTopology existence checks
  • RBAC permissions

What's NOT in this PR:

Future PRs will add:

  • ClusterTopology CRD definition and controller
  • TopologyConstraint fields in PodCliqueSet/PodClique APIs
  • Webhook validation for topology constraint hierarchy
  • PodGang topology constraint translation

Testing Notes:

  • Added unit tests for config defaults and validation
  • Tests verify proper error messages and default values
  • Startup validation tested via config validation tests

Migration Path:

  • Default topology.enabled=false ensures backward compatibility
  • Existing workloads unaffected until topology explicitly enabled
  • No breaking changes to existing APIs

Does this PR introduce an API change?

Add topology configuration to Grove operator config, enabling foundational support for topology-aware scheduling. Topology support is disabled by default and requires explicit configuration.

Additional documentation

See design document: docs/designs/topology.md

Operator configuration reference updated with TopologyConfiguration section showing:
- How to enable topology support
- How to configure ClusterTopology resource name
- Validation requirements and startup behav

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…script

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ion script

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ion script

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…rror messages

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/charts/templates/clusterrole.yaml Outdated
shayasoolin
shayasoolin previously approved these changes Nov 5, 2025
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/api/config/v1alpha1/defaults.go Outdated
Comment thread operator/api/core/v1alpha1/crds/grove.io_topologydomains.yaml
Comment thread operator/api/config/validation/validation_test.go
Comment thread operator/cmd/main.go Outdated
Comment thread operator/api/config/validation/validation.go Outdated

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just questions, but I think the validation one is work looking into.

Comment thread operator/api/config/validation/validation.go
- Add function documentation for validateTopologyConfiguration
- Validate topology name is valid K8s DNS subdomain when enabled
- Add test cases for invalid characters and DNS violations

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/api/config/validation/validation.go Outdated
Comment thread operator/api/config/validation/validation.go Outdated
Comment thread operator/api/config/v1alpha1/types.go Outdated
Comment thread operator/api/config/validation/validation.go
Comment thread operator/charts/templates/clusterrole.yaml Outdated
Comment thread operator/charts/values.yaml Outdated
Comment thread operator/cmd/main.go Outdated
Comment thread operator/cmd/opts/options.go Outdated
Comment thread operator/samples/clustertopology/cluster-topology-host-only.yaml

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Comment thread operator/api/config/validation/validation.go Outdated
Comment thread operator/cmd/main.go Outdated
Comment thread operator/samples/clustertopology/cluster-topology-host-only.yaml
Ronkahn21 and others added 6 commits November 12, 2025 15:18
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@unmarshall

Copy link
Copy Markdown
Collaborator

@Ronkahn21 Your last commit (e213b73) is a merge commit. Can we avoid merge commits and prefer rebasing your fork branch over main as this will create a more linear and clearer commit history?

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>

@unmarshall unmarshall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@Ronkahn21 Ronkahn21 merged commit 9280705 into ai-dynamo:main Nov 12, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants