New TAS Design#288
Merged
Merged
Conversation
shayasoolin
reviewed
Dec 10, 2025
…tion and lifecycle management Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…uration changes Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ype and terminology Signed-off-by: Ron Kahn <rkahn@nvidia.com>
56ba252 to
9446813
Compare
- Rewrite Goals and Non-Goals to focus on capabilities - Update Proposal to emphasize admin defines topology → users reference - Clarify diagram annotation: KAI Topology used by KAI Scheduler - Clarify Ready condition: only downstream topology failures - Split configuration steps to distinguish CR failure impacts - Clarify ClusterTopology remains when topology disabled - Specify invalid constraints removal from PodGang with status update - Add context for "three levels" in scheduler API - Fix topology name reference to point to KAI Topology CR - Add decoupling rationale for topology discovery annotation Addresses review comments from shayasoolin on PR ai-dynamo#288 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
gflarity
requested changes
Dec 15, 2025
gflarity
left a comment
Contributor
There was a problem hiding this comment.
Nothing major, just a bunch of clarifications requested along with some suggestions.
I still need to review from the Security and RBAC section and onward, but I've run out of time today. So I'll submit this review now and finish it off tomorrow.
sanjaychatterjee
left a comment
Collaborator
There was a problem hiding this comment.
Looks good. One clarification needed.
shayasoolin
reviewed
Dec 16, 2025
shayasoolin
reviewed
Dec 16, 2025
shayasoolin
reviewed
Dec 16, 2025
shayasoolin
reviewed
Dec 16, 2025
shayasoolin
left a comment
Contributor
There was a problem hiding this comment.
Looks good, added a few minor comment and one item for discussion - on the necessity of the topology name annotation for KAI.
Co-authored-by: shayasoolin <128282919+shayasoolin@users.noreply.github.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com> Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
- Add inline clarification for root domain constraints vs packDomain - Update Order-Independent Configuration wording per review - Fix authorization typo (Manged → Managed) and add resource name - Remove duplicate mutability statement from webhook validation - Move TopologyDomain Definitions before Characteristics - Document ClusterTopology CR deletion when topology disabled Addresses review comments from gflarity and shayasoolin on PR ai-dynamo#288 Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…y.md Signed-off-by: Ron Kahn <rkahn@nvidia.com>
gflarity
approved these changes
Dec 16, 2025
shayasoolin
approved these changes
Dec 17, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What type of PR is this?
/kind documentation
/kind feature
What this PR does / why we need it:
This PR adds a comprehensive design document for topology-aware scheduling in the Grove operator. The design introduces a flexible topology system that enables optimal placement of multinode inference
workloads based on cluster network topology.
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a API change?
suggest change