What happened?
The UpdateProgress.UpdatedPodCliques field on PodCliqueSetStatus (https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/podcliqueset.go#L159-L160) and PodCliqueScalingGroupStatus
(https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/scalinggroup.go#L153-L157) is an unbounded []string carrying the fully-qualified name of every PodClique that has been updated to the desired generation hash. The PCS additionally carries
UpdatedPodCliqueScalingGroups []string with the same shape, and the deprecated RollingUpdateProgress.UpdatedPodCliques is kept in sync — so each FQN is effectively persisted twice on the PCS, and a PCSG-owned PCLQ can be recorded up to four times across the cluster (PCS + PCS
mirror + PCSG + PCSG mirror).
Cost is driven by the total number of PodCliques in the PCS. Two failure modes appear at the scale Grove is targeting, well before any hard etcd limit:
- Storage / wire cost grows linearly. Realistic FQN is ~50–60 chars (~60 bytes JSON-encoded, ~120 bytes per PCLQ once the deprecated mirror is included):
- 1 000 PCLQs → ~120 KiB of status payload.
- 2 000 PCLQs → ~240 KiB. kubectl get pcs -o yaml visibly slow.
- 4 000 PCLQs → ~480 KiB. API-server / watch-cache pressure.
- ~13 000 PCLQs → ~1.5 MiB, exceeding etcd's default per-object limit. Status writes are rejected, the controller cannot persist progress.
Past ~100 KiB of status payload (reached at fewer than 1 000 PodCliques), serialization overhead and list/watch latency become noticeable. Every status write broadcasts the full object to every watcher (operator caches, dashboards, kubectl watches) because Kubernetes watches do not
do field-level patches.
- Reconciler CPU is O(N²) per reconcile step.
What did you expect to happen?
Status fields that summarize update progress should have bounded size and bounded per-reconcile cost, regardless of how many PodCliques the PCS contains.
Concretely, I would expect:
- The PCS and PCSG status to carry UpdatedPodCliquesCount / TotalPodCliquesCount (and the equivalent pair for PCSGs on the PCS) as fixed-width int32 fields, plus a small derived display string (e.g. PodCliquesUpdateProgress = "344/600") for kubectl printer columns — instead of the
unbounded []string of FQNs.
- Counts to be derived each reconcile from live child generation-hash labels (which the controller already loads into the informer cache for computeUpdateProgress), not accumulated. This is idempotent by construction — no drift on scale-in, re-queue, status-write retry, or
controller restart — and replaces today's O(N²) cleanup block with an O(N) in-memory walk over data already in memory, with zero new API-server reads.
- The deprecated RollingUpdateProgress.UpdatedPodCliques / UpdatedPodCliqueScalingGroups mirrors to be retired rather than kept in sync.
- kubectl get pcs and kubectl get pcsg to surface progress as a single PCLQS-UPDATED 344/600 printer column instead of a list of FQNs that is unreadable at scale.
End result: status payload drops from O(N · 60 B) ×2 to a small constant, the reconciler hot path drops from O(N²) to O(N), and the operator works correctly at the fleet sizes Grove is designed for.
issue-updated-podcliques-scale.md
Environment
The bug is design-level and reproducible on any environment that runs a PCS with enough PodCliques (the painful threshold begins at <1 000 PCLQs); the specific Kubernetes version, scheduler, or hardware do not gate the symptoms. Fill in your local details below for the report.
- Grove version - v0.1.0-alpha.8
What happened?
The UpdateProgress.UpdatedPodCliques field on PodCliqueSetStatus (https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/podcliqueset.go#L159-L160) and PodCliqueScalingGroupStatus
(https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/scalinggroup.go#L153-L157) is an unbounded []string carrying the fully-qualified name of every PodClique that has been updated to the desired generation hash. The PCS additionally carries
UpdatedPodCliqueScalingGroups []string with the same shape, and the deprecated RollingUpdateProgress.UpdatedPodCliques is kept in sync — so each FQN is effectively persisted twice on the PCS, and a PCSG-owned PCLQ can be recorded up to four times across the cluster (PCS + PCS
mirror + PCSG + PCSG mirror).
Cost is driven by the total number of PodCliques in the PCS. Two failure modes appear at the scale Grove is targeting, well before any hard etcd limit:
Past ~100 KiB of status payload (reached at fewer than 1 000 PodCliques), serialization overhead and list/watch latency become noticeable. Every status write broadcasts the full object to every watcher (operator caches, dashboards, kubectl watches) because Kubernetes watches do not
do field-level patches.
What did you expect to happen?
Status fields that summarize update progress should have bounded size and bounded per-reconcile cost, regardless of how many PodCliques the PCS contains.
Concretely, I would expect:
unbounded []string of FQNs.
controller restart — and replaces today's O(N²) cleanup block with an O(N) in-memory walk over data already in memory, with zero new API-server reads.
End result: status payload drops from O(N · 60 B) ×2 to a small constant, the reconciler hot path drops from O(N²) to O(N), and the operator works correctly at the fleet sizes Grove is designed for.
issue-updated-podcliques-scale.md
Environment
The bug is design-level and reproducible on any environment that runs a PCS with enough PodCliques (the painful threshold begins at <1 000 PCLQs); the specific Kubernetes version, scheduler, or hardware do not gate the symptoms. Fill in your local details below for the report.