feat: add hierarchical resource sharing#507
Conversation
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…urce-sharing # Conflicts: # operator/internal/controller/podclique/components/pod/pod.go
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…ove E2E tests coverage Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
…Client infrastructure Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…Replica doc comment Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 38 out of 38 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…Replica doc comment Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Ronkahn21
left a comment
There was a problem hiding this comment.
Review — Hierarchical Resource Sharing
Nice work on this — the hierarchical design is clean, the ResourceSharer interface is a good abstraction, and the 14-step E2E test is thorough. A few things worth looking at:
Should Fix
1. RC name collision risk in naming.go
AllReplicasRCName produces <owner>-all-<tpl> and PerReplicaRCName produces <owner>-<idx>-<tpl>. Since dashes are valid in owner and template names, segment boundaries are ambiguous:
| ownerName | rctName | Result |
|---|---|---|
my-pcs |
all-gpu |
my-pcs-all-all-gpu |
my-pcs-all |
gpu |
my-pcs-all-all-gpu |
Low probability in practice, but worth either documenting as a known limitation or using a unique delimiter.
2. RCName panics on nil replicaIndex with PerReplica scope (reconcile.go:174)
Callers guard this correctly, but RCName is exported. A defensive nil-check or a doc comment stating the precondition would prevent a confusing panic for future callers.
3. Silent skip when FindPCSGConfigByName returns nil (pod.go:213-215)
If a PCLQ has a PCSG label but FindPCSGConfigByName returns nil, PCSG-level resource sharing injection is silently skipped. This indicates a data inconsistency and should at minimum log a warning.
4. Test constants violate project convention
resourceclaim_test.go files in both podclique and podcliqueset define constants like pcsName = "my-pcs" and namespace = "default". Project convention is to use string literals in tests, not constants.
Consider
5. EnsureResourceClaim overwrites all labels on update (reconcile.go:105-109)
rc.Labels = labels replaces the entire label map. If this is intentional (operator-owned resources shouldn't have external labels), a comment would clarify. Otherwise, consider merging.
6. CleanupStalePerReplicaRCs builds a NotIn selector with all valid indices (reconcile.go:284-317)
Works fine at current scale, but at 1000+ replicas the selector gets large. Worth a comment noting the scaling characteristic for future reference.
7. Variable clients shadows the imported package in shared_cluster.go:192 and kai_scheduler.go:101. Quick rename to bundledClients or k8sClients would fix it.
8. No E2E test for full PCS deletion verifying ResourceClaim garbage collection
The test covers scale-in/out thoroughly but doesn't verify that deleting the PCS cascades to all owned RCs. Worth adding as a final step.
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
|
Thanks for the thorough review. Here is the status of each point: Should Fix1. RC name collision risk in 2. 3. Silent skip when 4. Test constants violate project convention Consider5. 6. 7. Variable 8. No E2E test for full PCS deletion Also addressed the stale cache issue raised inline: |
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
|
Looks pretty complete to me. Great job @julienmancuso! |
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
Signed-off-by: Julien Mancuso <jmancuso@nvidia.com>
What type of PR is this?
/kind feature
/kind api
What this PR does / why we need it:
Implements GREP-390: Hierarchical Resource Sharing for Grove PodCliqueSets using Kubernetes Dynamic Resource Allocation (DRA).
This PR introduces a flexible, multi-level resource sharing model that allows
ResourceClaimobjects to be created and shared at the PCS, PCSG, and PCLQ levels with two scoping modes:AllReplicas: A single ResourceClaim shared across all replicas of the owner (named<ownerName>-all-<rctName>).PerReplica: A dedicated ResourceClaim per replica (named<ownerName>-<replicaIndex>-<rctName>).Key changes:
1. API types (
podcliqueset.go):ResourceClaimTemplateConfigfor defining inline ResourceClaimTemplate specs at the PCS level, with aTemplateSpecfield holding theResourceClaimTemplateSpec.ResourceSharingSpecBasewithName,Scope, andNamespacefields — used directly for PCLQ-level sharing (no filter needed).PCSResourceSharingSpecembedsResourceSharingSpecBaseand adds an optionalPCSResourceSharingFilterwithChildCliqueNamesandChildScalingGroupNamesfor targeting specific children.PCSGResourceSharingSpecembedsResourceSharingSpecBaseand adds an optionalPCSGResourceSharingFilterwith onlyChildCliqueNames(PCSGs cannot target sibling scaling groups).childScalingGroupNames— eliminating the need for webhook checks on these constraints.ResourceSharingfield added at PCS ([]PCSResourceSharingSpec), PCSG ([]PCSGResourceSharingSpec), and PCLQ ([]ResourceSharingSpecBase) levels.2. ResourceClaim lifecycle (
internal/resourceclaim/):naming.go: Deterministic naming viaAllReplicasRCNameandPerReplicaRCName.resolve.go: ResolvesResourceSharingSpecBaseto aResourceClaimTemplateSpec(internal or external). Includes a defensive namespace check: whenbase.Namespaceis set, internal resolution is skipped to ensure external resolution is used even if webhook validation is bypassed.reconcile.go:ResourceSharerinterface withGetBase()andFilterMatches()methods, plusResourceSharersFromPCS,ResourceSharersFromPCSG,ResourceSharersFromPCLQadapter functions.EnsureResourceClaimuses a Get/Create flow — new RCs are created with the full spec, while existing RCs only receive metadata updates (labels, ownerReference) sinceResourceClaim.specis immutable in Kubernetes.InjectResourceClaimRefsinjects both pod-levelspec.resourceClaimsand container-levelresources.claims;DeletePerReplicaRCshandles explicit per-name deletion;CleanupStalePerReplicaRCshandles scale-in cleanup using a unified label selector with server-sideDeleteCollection.injectAllResourceClaimRefsreturns errors instead of silently swallowing them, with failures propagated throughbuildResource.3. Ownership model:
Each level owns its own ResourceClaims via
SetControllerReference:This aligns RC lifecycle with the owning resource — deleting a PCLQ, PCSG, or PCS automatically garbage-collects its RCs. Parent controllers (PCS, PCSG) inject claim references into child PodSpecs using filter matching.
4. Scale-in cleanup:
Stale PerReplica RCs are cleaned up using label-based selectors scoped to the owning resource. Each RC is stamped with labels identifying its owner and replica index. The cleanup builds a unified
labels.Selectorcombining equality, Exists, and NotIn requirements to target only stale replicas without affecting other owners.5. Validation webhooks:
resourceSharinglist.childCliqueNames/childScalingGroupNamesin filters must reference actual children defined in the PCS spec.+kubebuilder:validation:Enum).resourceClaimTemplatesandresourceSharingare validated as immutable on update at PCS, PCSG, and PCLQ levels — consistent with the guarantees documented in the README.childScalingGroupNamesat the PCSG level andFilterat the PCLQ level are prevented by the type system (those types simply don't have those fields), so no webhook checks are needed for them.6. E2E test (
e2e/tests/resource_sharing_test.go):14-step hierarchical resource sharing test (
Test_RS1_HierarchicalResourceSharing) covering all filter paths:childCliqueNames(int-tpl targets worker-a only)childScalingGroupNames(int-tpl7 targets sga only)childCliqueNames(int-tpl8 targets worker-b only)rs-sharednamespace, referenced from PCLQ worker-a)Steps:
rs-shared/ext-ns-tpl)resourceSharingscope on a deployed PCS, assert webhook rejects with "field is immutable"Added
resource_sharingentry to the CI E2E test matrix in.github/workflows/build-check-test.yaml7. RBAC: Added
resourceclaimsandresourceclaimtemplatespermissions to the operator ClusterRole.Which issue(s) this PR fixes:
Fixes #390
Special notes for your reviewer:
allreplaces the replica index forAllReplicasscope to avoid naming collisions withPerReplicaclaims.[a-z0-9-], there is no reserved delimiter character. Theallkeyword in<pcs>-all-<tpl>could theoretically collide if a PCS name contains-all-and another PCS+template combination produces the same concatenated string. In practice this requires two PCS instances in the same namespace with overlapping name/template segments, which is extremely unlikely. Random suffixes were considered but rejected because deterministic naming is essential for idempotent reconciliation (Get-by-name), pure in-memory pod injection, and stale RC cleanup. Documented as a known limitation innaming.go.InjectResourceClaimRefsinjects bothpod.spec.resourceClaims(pod-level) andcontainers[].resources.claims(container-level) for all containers and init containers, which is required by DRA for devices to actually be accessible.PCSResourceSharingSpec,PCSGResourceSharingSpec,ResourceSharingSpecBase) to enforce valid filter combinations at the type level rather than relying solely on webhook validation.EnsureResourceClaimuses a Get/Create pattern: new RCs are created with the full spec; existing RCs only get metadata updates (labels, ownerReference). This respects the Kubernetes immutability constraint onResourceClaim.spec.resolveInternalRefincludes a defensive namespace check — whenbase.Namespaceis non-empty, internal template resolution is skipped entirely, ensuring external resolution even if webhook validation is bypassed.Does this PR introduce an API change?
Added hierarchical resource sharing support (GREP-390) for PodCliqueSets using Kubernetes DRA. New API fields:
resourceClaimTemplatesandresourceSharingon PodCliqueSetTemplateSpec,resourceSharingon PodCliqueScalingGroupConfig and PodCliqueTemplateSpec. ResourceClaims can be scoped asAllReplicas(shared) orPerReplica(dedicated). Level-specific types enforce valid API states:PCSResourceSharingSpecsupports filtering bychildCliqueNamesandchildScalingGroupNames;PCSGResourceSharingSpecsupports filtering bychildCliqueNamesonly;ResourceSharingSpecBase(used at PCLQ level) has no filter field. TheresourceClaimTemplatesandresourceSharingfields are immutable on update at all levels.