Skip to content

New TAS Design#288

Merged
Ronkahn21 merged 12 commits into
ai-dynamo:mainfrom
Ronkahn21:new-topology-design
Dec 17, 2025
Merged

New TAS Design#288
Ronkahn21 merged 12 commits into
ai-dynamo:mainfrom
Ronkahn21:new-topology-design

Conversation

@Ronkahn21

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind documentation
/kind feature

What this PR does / why we need it:

This PR adds a comprehensive design document for topology-aware scheduling in the Grove operator. The design introduces a flexible topology system that enables optimal placement of multinode inference
workloads based on cluster network topology.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a API change?

suggest change

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
…tion and lifecycle management

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…and flexible ordering

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…uration changes

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…ype and terminology

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
- Rewrite Goals and Non-Goals to focus on capabilities
- Update Proposal to emphasize admin defines topology → users reference
- Clarify diagram annotation: KAI Topology used by KAI Scheduler
- Clarify Ready condition: only downstream topology failures
- Split configuration steps to distinguish CR failure impacts
- Clarify ClusterTopology remains when topology disabled
- Specify invalid constraints removal from PodGang with status update
- Add context for "three levels" in scheduler API
- Fix topology name reference to point to KAI Topology CR
- Add decoupling rationale for topology discovery annotation

Addresses review comments from shayasoolin on PR ai-dynamo#288

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing major, just a bunch of clarifications requested along with some suggestions.

I still need to review from the Security and RBAC section and onward, but I've run out of time today. So I'll submit this review now and finish it off tomorrow.

Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One clarification needed.

Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md
Comment thread docs/designs/topology.md Outdated
Comment thread docs/designs/topology.md

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, added a few minor comment and one item for discussion - on the necessity of the topology name annotation for KAI.

Ronkahn21 and others added 3 commits December 16, 2025 13:32
Co-authored-by: shayasoolin <128282919+shayasoolin@users.noreply.github.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
Co-authored-by: Geoff Flarity <geoff.flarity@gmail.com>
Signed-off-by: Ron Kahn <122778260+Ronkahn21@users.noreply.github.com>
@ai-dynamo ai-dynamo deleted a comment from sanjaychatterjee Dec 16, 2025
- Add inline clarification for root domain constraints vs packDomain
- Update Order-Independent Configuration wording per review
- Fix authorization typo (Manged → Managed) and add resource name
- Remove duplicate mutability statement from webhook validation
- Move TopologyDomain Definitions before Characteristics
- Document ClusterTopology CR deletion when topology disabled

Addresses review comments from gflarity and shayasoolin on PR ai-dynamo#288

Signed-off-by: Ron Kahn <rkahn@nvidia.com>
…y.md

Signed-off-by: Ron Kahn <rkahn@nvidia.com>

@sanjaychatterjee sanjaychatterjee left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@Ronkahn21 Ronkahn21 merged commit 4c3ac6a into ai-dynamo:main Dec 17, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants