Grove proposal/topology#224
Conversation
sanjaychatterjee
left a comment
There was a problem hiding this comment.
Thanks for the PR. Added some comments to simplify the implementation.
unmarshall
left a comment
There was a problem hiding this comment.
I have only partly done the review. Posting comments in batches.
renormalize
left a comment
There was a problem hiding this comment.
thanks for the proposal!
1/n as I have yet to go through most of the document.
|
And for GB200, not sure how do integrate nvlink domin with this topology? And if I use nvidia-gpu-driver-plugin for imex, would it conflict with this topology? |
gflarity
left a comment
There was a problem hiding this comment.
Thanks @Ronkahn21! IMO we should simplify things and just go with the Kueue CR since it's required regardless. Having both just seems like more work for the Admin for very little benefit. I don't want to waste any time with a debate though if I can't convince you. 🚢 it :)
sanjaychatterjee
left a comment
There was a problem hiding this comment.
Thanks for the PR update. Mostly looks good to me. Once you fix the OperatorConfiguration CRD issue, I can approve.
sanjaychatterjee
left a comment
There was a problem hiding this comment.
Looks good to me. Thanks!
4068734 to
b683fb1
Compare
…ator Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
typo fix Co-authored-by: Roman Baron <91824211+romanbaron@users.noreply.github.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Saketh Kalaga <51327242+renormalize@users.noreply.github.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
…pologyDomain Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… for TopologyDomain Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and constraints Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and clarify level definitions Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…nfig and annotations Signed-off-by: Ron Kahn <rkahn@nvidia.com>
Co-authored-by: Madhav Bhargava <madhav.bhargava@sap.com> Signed-off-by: Sanjay Chatterjee <sanjay.chatterjee@gmail.com>
…umentation Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…usterTopology and clarify resource definitions Signed-off-by: Ron Kahn <rkahn@nvidia.com>
… protection, and enable/disable behavior Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ming and clarify setup instructions Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…d clarify resource management Signed-off-by: Ron Kahn <rkahn@nvidia.com>
d6f7b9d to
7e7878f
Compare
All prior comments have been addressed.
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
This PR adds a comprehensive design document for topology-aware scheduling in the Grove operator. The design introduces a flexible topology system that enables optimal placement of multinode inference
workloads based on cluster network topology.
Key components:
--topology-domain-nameargumentThe design addresses critical requirements for multinode inference workloads including network locality, coordinated placement, and latency optimization.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a API change?
suggest change