Skip to content

Grove proposal/topology#224

Merged
Ronkahn21 merged 19 commits into
ai-dynamo:mainfrom
Ronkahn21:grove-proposal/topology
Nov 2, 2025
Merged

Grove proposal/topology#224
Ronkahn21 merged 19 commits into
ai-dynamo:mainfrom
Ronkahn21:grove-proposal/topology

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

This PR adds a comprehensive design document for topology-aware scheduling in the Grove operator. The design introduces a flexible topology system that enables optimal placement of multinode inference
workloads based on cluster network topology.

Key components:

  • TopologyDomain CRD: Admin-configured cluster topology hierarchy mapping friendly names to node labels
  • Operator Configuration: Selects active topology via --topology-domain-name argument
  • TopologyConstraint API: User-specified packing requirements in workloads (PodCliqueSet, PodCliqueScalingGroup, PodClique)
  • Automatic Optimization: Out-of-box topology optimization via auto-generated preferred constraints
  • KAI Scheduler Integration: Automatic Kueue Topology generation and three-level constraint translation

The design addresses critical requirements for multinode inference workloads including network locality, coordinated placement, and latency optimization.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  • This is a design document only - no implementation changes
  • Design includes detailed API specifications, validation rules, and scheduler contract
  • Addresses review feedback including RBAC, runtime behavior, and edge cases
  • Immutability enforced at multiple levels to ensure scheduling consistency
  • Explicit failure modes prevent silent degradation of topology features

Does this PR introduce a API change?

suggest change

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Added some comments to simplify the implementation.

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated

@unmarshall unmarshall left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only partly done the review. Posting comments in batches.

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated

@renormalize renormalize left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the proposal!

1/n as I have yet to go through most of the document.

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
@kangclzjc

Copy link
Copy Markdown
Contributor

And for GB200, not sure how do integrate nvlink domin with this topology? And if I use nvidia-gpu-driver-plugin for imex, would it conflict with this topology?

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
gflarity
gflarity previously approved these changes Oct 28, 2025

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ronkahn21! IMO we should simplify things and just go with the Kueue CR since it's required regardless. Having both just seems like more work for the Admin for very little benefit. I don't want to waste any time with a debate though if I can't convince you. 🚢 it :)

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR update. Mostly looks good to me. Once you fix the OperatorConfiguration CRD issue, I can approve.

Comment thread docs/designs/topology.md

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks!

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Ronkahn21 and others added 3 commits November 2, 2025 08:54
…ator

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
typo fix

Co-authored-by: Roman Baron <91824211+romanbaron@users.noreply.github.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Ronkahn21 and others added 16 commits November 2, 2025 08:54
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
…pologyDomain

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… for TopologyDomain

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and constraints

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and clarify level definitions

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…nfig and annotations

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com>
Signed-off-by: Sanjay Chatterjee <sanjay.chatterjee@gmail.com>
…umentation

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…usterTopology and clarify resource definitions

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… protection, and enable/disable behavior

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ming and clarify setup instructions

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d clarify resource management

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
@sanjaychatterjee sanjaychatterjee dismissed unmarshall’s stale review November 2, 2025 17:03

All prior comments have been addressed.

@Ronkahn21 Ronkahn21 merged commit e3696f3 into ai-dynamo:main Nov 2, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants