Skip to content

Auto-MNNVL: add ComputeDomain component for PCS controller#363

Merged
shmuel-runai merged 4 commits into
ai-dynamo:mainfrom
shmuel-runai:RUN-35354/cd-component
Jan 29, 2026
Merged

Auto-MNNVL: add ComputeDomain component for PCS controller#363
shmuel-runai merged 4 commits into
ai-dynamo:mainfrom
shmuel-runai:RUN-35354/cd-component

Conversation

@shmuel-runai

Copy link
Copy Markdown
Contributor

/kind feature

ref: GREP-270

Implement ComputeDomain lifecycle management in the PodCliqueSet controller to enable automatic Multi-Node NVLink support for GPU workloads.

  • Add ComputeDomain component with Sync/Delete operations
  • Handle scale-out (create CDs) and scale-in (remove CDs)
  • Use finalizers to protect CDs from accidental deletion
  • Register the component in PCS reconcile order before PodClique
  • Add comprehensive unit tests with fake client support

danbar2
danbar2 previously approved these changes Jan 22, 2026
@gflarity

gflarity commented Jan 22, 2026

Copy link
Copy Markdown
Contributor

I'm taking a look at why these E2E tests are failing. Something already stands out though. Early tests are failing to clean up resources and causing the remaining tests to be skipped. Try rebasing from main, I added a PR to clean up better which should help.

@gflarity

Copy link
Copy Markdown
Contributor

From the logs, you can see the controller continuously retrying:

Error deleting managed resources...
FailedTasks: [delete-ComputeDomain]
error: no matches for kind "ComputeDomain" in version "resource.nvidia.com/v1beta1"

Comment thread operator/internal/mnnvl/constants.go Outdated
Comment thread operator/internal/mnnvl/computedomain/computedomain.go Outdated
Comment thread operator/internal/mnnvl/computedomain/computedomain.go
Comment thread operator/internal/mnnvl/computedomain/computedomain.go
Comment thread operator/internal/mnnvl/computedomain/computedomain.go Outdated
Comment thread operator/internal/mnnvl/computedomain/computedomain.go Outdated
@shmuel-runai shmuel-runai force-pushed the RUN-35354/cd-component branch 2 times, most recently from 011abf2 to af54d26 Compare January 28, 2026 09:53
danbar2
danbar2 previously approved these changes Jan 28, 2026
shayasoolin
shayasoolin previously approved these changes Jan 28, 2026
@shmuel-runai shmuel-runai dismissed stale reviews from shayasoolin and danbar2 via 94673eb January 28, 2026 12:25
Implement ComputeDomain lifecycle management in the PodCliqueSet controller
to enable automatic Multi-Node NVLink support for GPU workloads.

- Add ComputeDomain component with Sync/Delete operations
- Handle scale-out (create CDs) and scale-in (remove CDs)
- Use finalizers to protect CDs from accidental deletion
- Register component in PCS reconcile order before PodClique
- Add comprehensive unit tests with fake client support
@shmuel-runai shmuel-runai force-pushed the RUN-35354/cd-component branch from 94673eb to 3c0e5ca Compare January 28, 2026 14:29
danbar2
danbar2 previously approved these changes Jan 28, 2026
shayasoolin
shayasoolin previously approved these changes Jan 28, 2026
@shmuel-runai shmuel-runai dismissed stale reviews from shayasoolin and danbar2 via c1c556b January 28, 2026 18:08
@shmuel-runai shmuel-runai merged commit 2f7524e into ai-dynamo:main Jan 29, 2026
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants