Feat/Topology Configuration Infrastructure#247
Merged
Ronkahn21 merged 26 commits intoNov 12, 2025
Conversation
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…script Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ion script Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ion script Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…rror messages Signed-off-by: Ron Kahn <rkahn@nvidia.com>
shayasoolin
reviewed
Nov 5, 2025
shayasoolin
previously approved these changes
Nov 5, 2025
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
gflarity
reviewed
Nov 6, 2025
gflarity
reviewed
Nov 6, 2025
gflarity
reviewed
Nov 6, 2025
gflarity
requested changes
Nov 6, 2025
gflarity
left a comment
Contributor
There was a problem hiding this comment.
Mostly just questions, but I think the validation one is work looking into.
gflarity
reviewed
Nov 6, 2025
- Add function documentation for validateTopologyConfiguration - Validate topology name is valid K8s DNS subdomain when enabled - Add test cases for invalid characters and DNS violations Signed-off-by: Ron Kahn <rkahn@nvidia.com>
unmarshall
requested changes
Nov 12, 2025
Ronkahn21
commented
Nov 12, 2025
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Collaborator
|
@Ronkahn21 Your last commit (e213b73) is a merge commit. Can we avoid merge commits and prefer rebasing your fork branch over main as this will create a more linear and clearer commit history? |
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
gflarity
approved these changes
Nov 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Description: Topology Configuration Infrastructure
What type of PR is this?
/kind feature
/kind api
What this PR does / why we need it
This PR implements the foundational infrastructure for topology-aware scheduling in Grove by adding operator-level configuration and validation for topology support.
Key Changes:
Operator Configuration: Adds
TopologyConfigurationto operator config with:enabledfield to enable/disable topology features (defaults tofalse)namefield to specify ClusterTopology resource name (defaults to"grove-topology")Startup Validation: Operator validates topology configuration at startup:
topology.enabled=true, verifies ClusterTopology resource existsRBAC Permissions: Updates cluster role to grant Grove operator read access to ClusterTopology resources
Documentation: Adds TopologyConfiguration section to operator API reference
Sample Configuration: Provides example topology configuration YAML
Why This Matters:
Topology-aware scheduling is critical for Grove's multi-node inference workloads because:
This PR establishes the operator-side foundation that will enable workloads to specify topology constraints in future PRs.
Special notes for your reviewer
Implementation Scope:
This is Phase 1 of topology-aware scheduling implementation, focusing on operator configuration infrastructure. It includes:
What's NOT in this PR:
Future PRs will add:
Testing Notes:
Migration Path:
topology.enabled=falseensures backward compatibilityDoes this PR introduce an API change?
Additional documentation