Skip to content

feat: add hierarchical resource sharing#507

Merged
julienmancuso merged 32 commits into
ai-dynamo:mainfrom
julienmancuso:jsm/hierarchical-resource-sharing
Apr 16, 2026
Merged

feat: add hierarchical resource sharing#507
julienmancuso merged 32 commits into
ai-dynamo:mainfrom
julienmancuso:jsm/hierarchical-resource-sharing

Conversation

@julienmancuso

@julienmancuso julienmancuso commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature
/kind api

What this PR does / why we need it:

Implements GREP-390: Hierarchical Resource Sharing for Grove PodCliqueSets using Kubernetes Dynamic Resource Allocation (DRA).

This PR introduces a flexible, multi-level resource sharing model that allows ResourceClaim objects to be created and shared at the PCS, PCSG, and PCLQ levels with two scoping modes:

  • AllReplicas: A single ResourceClaim shared across all replicas of the owner (named <ownerName>-all-<rctName>).
  • PerReplica: A dedicated ResourceClaim per replica (named <ownerName>-<replicaIndex>-<rctName>).

Key changes:

1. API types (podcliqueset.go):

  • ResourceClaimTemplateConfig for defining inline ResourceClaimTemplate specs at the PCS level, with a TemplateSpec field holding the ResourceClaimTemplateSpec.
  • ResourceSharingSpecBase with Name, Scope, and Namespace fields — used directly for PCLQ-level sharing (no filter needed).
  • PCSResourceSharingSpec embeds ResourceSharingSpecBase and adds an optional PCSResourceSharingFilter with ChildCliqueNames and ChildScalingGroupNames for targeting specific children.
  • PCSGResourceSharingSpec embeds ResourceSharingSpecBase and adds an optional PCSGResourceSharingFilter with only ChildCliqueNames (PCSGs cannot target sibling scaling groups).
  • This level-specific type design enforces valid states at the API level — PCLQ entries have no filter field, and PCSG entries cannot specify childScalingGroupNames — eliminating the need for webhook checks on these constraints.
  • ResourceSharing field added at PCS ([]PCSResourceSharingSpec), PCSG ([]PCSGResourceSharingSpec), and PCLQ ([]ResourceSharingSpecBase) levels.

2. ResourceClaim lifecycle (internal/resourceclaim/):

  • naming.go: Deterministic naming via AllReplicasRCName and PerReplicaRCName.
  • resolve.go: Resolves ResourceSharingSpecBase to a ResourceClaimTemplateSpec (internal or external). Includes a defensive namespace check: when base.Namespace is set, internal resolution is skipped to ensure external resolution is used even if webhook validation is bypassed.
  • reconcile.go: ResourceSharer interface with GetBase() and FilterMatches() methods, plus ResourceSharersFromPCS, ResourceSharersFromPCSG, ResourceSharersFromPCLQ adapter functions. EnsureResourceClaim uses a Get/Create flow — new RCs are created with the full spec, while existing RCs only receive metadata updates (labels, ownerReference) since ResourceClaim.spec is immutable in Kubernetes. InjectResourceClaimRefs injects both pod-level spec.resourceClaims and container-level resources.claims; DeletePerReplicaRCs handles explicit per-name deletion; CleanupStalePerReplicaRCs handles scale-in cleanup using a unified label selector with server-side DeleteCollection.
  • injectAllResourceClaimRefs returns errors instead of silently swallowing them, with failures propagated through buildResource.

3. Ownership model:

Each level owns its own ResourceClaims via SetControllerReference:

  • PCS controller: Creates and owns PCS-level AllReplicas and PerReplica ResourceClaims.
  • PCSG controller: Creates and owns PCSG-level ResourceClaims.
  • PCLQ controller: Creates and owns PCLQ-level AllReplicas and PerReplica ResourceClaims.

This aligns RC lifecycle with the owning resource — deleting a PCLQ, PCSG, or PCS automatically garbage-collects its RCs. Parent controllers (PCS, PCSG) inject claim references into child PodSpecs using filter matching.

4. Scale-in cleanup:

Stale PerReplica RCs are cleaned up using label-based selectors scoped to the owning resource. Each RC is stamped with labels identifying its owner and replica index. The cleanup builds a unified labels.Selector combining equality, Exists, and NotIn requirements to target only stale replicas without affecting other owners.

5. Validation webhooks:

  • No duplicate RCT names within a single resourceSharing list.
  • Filter must specify at least one name.
  • childCliqueNames/childScalingGroupNames in filters must reference actual children defined in the PCS spec.
  • Scope validation: rejects invalid scope values as defense-in-depth (in addition to CRD-level +kubebuilder:validation:Enum).
  • Immutability enforcement: resourceClaimTemplates and resourceSharing are validated as immutable on update at PCS, PCSG, and PCLQ levels — consistent with the guarantees documented in the README.
  • Note: childScalingGroupNames at the PCSG level and Filter at the PCLQ level are prevented by the type system (those types simply don't have those fields), so no webhook checks are needed for them.

6. E2E test (e2e/tests/resource_sharing_test.go):

  • 14-step hierarchical resource sharing test (Test_RS1_HierarchicalResourceSharing) covering all filter paths:

    • PCS → childCliqueNames (int-tpl targets worker-a only)
    • PCS → childScalingGroupNames (int-tpl7 targets sga only)
    • PCSG → childCliqueNames (int-tpl8 targets worker-b only)
    • Cross-namespace external RCT (ext-ns-tpl in rs-shared namespace, referenced from PCLQ worker-a)

    Steps:

    1. Create cross-namespace ResourceClaimTemplate (rs-shared/ext-ns-tpl)
    2. Deploy workload with PCS=1, PCSG sga=2 replicas, standalone PCLQ worker-a=1 replica
    3. Verify initial 12 ResourceClaims with correct names, labels, and pod-to-RC references
    4. Verify ownerReferences correctness (PCS→PodCliqueSet, PCSG→PodCliqueScalingGroup, PCLQ→PodClique)
    5. PCSG scale-in (2→1): verify 2 RCs cleaned up (10 remaining)
    6. PCSG scale-out (1→3): verify 4 new RCs created (14 total)
    7. PCLQ scale-out (1→2): verify 1 new PerReplica RC created (15 total), verify per-pod RC wiring (common + unique PerReplica refs)
    8. PCLQ scale-in (2→1): verify 1 PerReplica RC cleaned up (14 remaining)
    9. PCS scale-out (1→2): verify 10 new RCs created (24 total, asymmetric state: rep 0 sga=3, rep 1 sga=2 from template)
    10. PCS scale-in (2→1): verify cleanup back to post-PCSG-scale-out state (14 remaining)
    11. Immutability rejection: attempt to modify resourceSharing scope on a deployed PCS, assert webhook rejects with "field is immutable"
  • Added resource_sharing entry to the CI E2E test matrix in .github/workflows/build-check-test.yaml

7. RBAC: Added resourceclaims and resourceclaimtemplates permissions to the operator ClusterRole.

Which issue(s) this PR fixes:

Fixes #390

Special notes for your reviewer:

  • The ResourceClaim naming convention uses the RCT name (not an index) for readability and determinism. all replaces the replica index for AllReplicas scope to avoid naming collisions with PerReplica claims.
    • Theoretical RC name collision: Because Kubernetes names only allow [a-z0-9-], there is no reserved delimiter character. The all keyword in <pcs>-all-<tpl> could theoretically collide if a PCS name contains -all- and another PCS+template combination produces the same concatenated string. In practice this requires two PCS instances in the same namespace with overlapping name/template segments, which is extremely unlikely. Random suffixes were considered but rejected because deterministic naming is essential for idempotent reconciliation (Get-by-name), pure in-memory pod injection, and stale RC cleanup. Documented as a known limitation in naming.go.
  • InjectResourceClaimRefs injects both pod.spec.resourceClaims (pod-level) and containers[].resources.claims (container-level) for all containers and init containers, which is required by DRA for devices to actually be accessible.
  • The API uses level-specific types (PCSResourceSharingSpec, PCSGResourceSharingSpec, ResourceSharingSpecBase) to enforce valid filter combinations at the type level rather than relying solely on webhook validation.
  • EnsureResourceClaim uses a Get/Create pattern: new RCs are created with the full spec; existing RCs only get metadata updates (labels, ownerReference). This respects the Kubernetes immutability constraint on ResourceClaim.spec.
  • resolveInternalRef includes a defensive namespace check — when base.Namespace is non-empty, internal template resolution is skipped entirely, ensuring external resolution even if webhook validation is bypassed.
  • The E2E test exercises all scaling operations at every level (PCSG, standalone PCLQ, PCS) and verifies RC creation/cleanup, pod-to-RC references (including per-pod PerReplica wiring after PCLQ scale-out), ownerReferences, cross-namespace RCT resolution, and immutability webhook rejection — covering all three filter paths (PCS→clique, PCS→scalingGroup, PCSG→clique).

Does this PR introduce an API change?

Added hierarchical resource sharing support (GREP-390) for PodCliqueSets using Kubernetes DRA. New API fields: resourceClaimTemplates and resourceSharing on PodCliqueSetTemplateSpec, resourceSharing on PodCliqueScalingGroupConfig and PodCliqueTemplateSpec. ResourceClaims can be scoped as AllReplicas (shared) or PerReplica (dedicated). Level-specific types enforce valid API states: PCSResourceSharingSpec supports filtering by childCliqueNames and childScalingGroupNames; PCSGResourceSharingSpec supports filtering by childCliqueNames only; ResourceSharingSpecBase (used at PCLQ level) has no filter field. The resourceClaimTemplates and resourceSharing fields are immutable on update at all levels.

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…urce-sharing

# Conflicts:
#	operator/internal/controller/podclique/components/pod/pod.go
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso marked this pull request as ready for review April 4, 2026 03:23
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Comment thread operator/e2e/yaml/workload-resource-sharing.yaml
Comment thread operator/internal/resourceclaim/reconcile.go Outdated
Comment thread operator/e2e/yaml/workload-resource-sharing.yaml Outdated
Comment thread operator/e2e/yaml/workload-resource-sharing.yaml
Comment thread operator/internal/webhook/admission/pcs/validation/podcliqueset.go Outdated
Comment thread operator/internal/webhook/admission/pcs/validation/podcliqueset.go Outdated
Comment thread operator/internal/webhook/admission/pcs/validation/podcliqueset.go
…ove E2E tests coverage

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…Client infrastructure

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread operator/internal/controller/podclique/components/pod/pod.go Outdated
Comment thread operator/api/core/v1alpha1/podcliqueset.go
…Replica doc comment

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…Replica doc comment

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — Hierarchical Resource Sharing

Nice work on this — the hierarchical design is clean, the ResourceSharer interface is a good abstraction, and the 14-step E2E test is thorough. A few things worth looking at:

Should Fix

1. RC name collision risk in naming.go

AllReplicasRCName produces <owner>-all-<tpl> and PerReplicaRCName produces <owner>-<idx>-<tpl>. Since dashes are valid in owner and template names, segment boundaries are ambiguous:

ownerName rctName Result
my-pcs all-gpu my-pcs-all-all-gpu
my-pcs-all gpu my-pcs-all-all-gpu

Low probability in practice, but worth either documenting as a known limitation or using a unique delimiter.

2. RCName panics on nil replicaIndex with PerReplica scope (reconcile.go:174)

Callers guard this correctly, but RCName is exported. A defensive nil-check or a doc comment stating the precondition would prevent a confusing panic for future callers.

3. Silent skip when FindPCSGConfigByName returns nil (pod.go:213-215)

If a PCLQ has a PCSG label but FindPCSGConfigByName returns nil, PCSG-level resource sharing injection is silently skipped. This indicates a data inconsistency and should at minimum log a warning.

4. Test constants violate project convention

resourceclaim_test.go files in both podclique and podcliqueset define constants like pcsName = "my-pcs" and namespace = "default". Project convention is to use string literals in tests, not constants.

Consider

5. EnsureResourceClaim overwrites all labels on update (reconcile.go:105-109)

rc.Labels = labels replaces the entire label map. If this is intentional (operator-owned resources shouldn't have external labels), a comment would clarify. Otherwise, consider merging.

6. CleanupStalePerReplicaRCs builds a NotIn selector with all valid indices (reconcile.go:284-317)

Works fine at current scale, but at 1000+ replicas the selector gets large. Worth a comment noting the scaling characteristic for future reference.

7. Variable clients shadows the imported package in shared_cluster.go:192 and kai_scheduler.go:101. Quick rename to bundledClients or k8sClients would fix it.

8. No E2E test for full PCS deletion verifying ResourceClaim garbage collection

The test covers scale-in/out thoroughly but doesn't verify that deleting the PCS cascades to all owned RCs. Worth adding as a final step.

@Ronkahn21 Ronkahn21 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2/n review

Comment thread operator/internal/resourceclaim/reconcile.go
Comment thread operator/internal/resourceclaim/reconcile.go Outdated
Comment thread operator/internal/resourceclaim/reconcile.go
Comment thread operator/internal/resourceclaim/reconcile.go Outdated
Comment thread operator/internal/resourceclaim/reconcile.go
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review. Here is the status of each point:

Should Fix

1. RC name collision risk in naming.go
Documented as a known limitation with a comment in naming.go explaining the theoretical collision scenario and why it is extremely unlikely in practice.

2. RCName panics on nil replicaIndex with PerReplica scope
Added a doc comment on RCName stating the precondition that callers must ensure replicaIndex is non-nil for PerReplica scope.

3. Silent skip when FindPCSGConfigByName returns nil
Now returns an error when a PCLQ has a PCSG label but FindPCSGConfigByName returns nil, surfacing the data inconsistency instead of silently skipping PCSG-level injection.

4. Test constants violate project convention
The other E2E test files (cert_management_test.go, scale_test.go, auto-mnnvl/testutils.go) all use named constants in the same pattern. Keeping as-is for consistency with the existing project convention.

Consider

5. EnsureResourceClaim overwrites all labels on update
Changed to rc.Labels = lo.Assign(rc.Labels, labels) to merge rather than overwrite, matching the annotation pattern in PR #500.

6. CleanupStalePerReplicaRCs NotIn selector scaling
Added a comment noting the NotIn requirement grows linearly with replica count and is acceptable at current scale.

7. Variable clients shadows imported package
These files (shared_cluster.go, kai_scheduler.go) are from main's PR #502 refactoring, not part of this PR's changes. Deferring to a separate PR.

8. No E2E test for full PCS deletion
Added step 16 to the E2E test: deletes the PCS and polls until all ResourceClaims are garbage-collected (count drops to 0).

Also addressed the stale cache issue raised inline: EnsureResourceClaim now catches apierrors.IsAlreadyExists on Create and falls through to the update path, matching the pattern in cert.go and podclique.go.

@julienmancuso julienmancuso requested a review from Ronkahn21 April 14, 2026 14:41
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
shayasoolin
shayasoolin previously approved these changes Apr 16, 2026

@shayasoolin shayasoolin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sanjaychatterjee

Copy link
Copy Markdown
Collaborator

Looks pretty complete to me. Great job @julienmancuso!

Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
@julienmancuso julienmancuso merged commit 611224c into ai-dynamo:main Apr 16, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support GPU Sharing Between Pods in a PodCliqueScalingGroup

8 participants