Status field UpdatedPodCliques []string in PCS/PCSG does not scale —  unbounded status payload at thousands of PodCliques and O(N²) reconciler CPU

### What happened?

                                                                                                                                                                                                                                                                                            
  The UpdateProgress.UpdatedPodCliques field on PodCliqueSetStatus (https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/podcliqueset.go#L159-L160) and PodCliqueScalingGroupStatus                                                                                      
  (https://github.com/ai-dynamo/grove/blob/main/operator/api/core/v1alpha1/scalinggroup.go#L153-L157) is an unbounded []string carrying the fully-qualified name of every PodClique that has been updated to the desired generation hash. The PCS additionally carries
  UpdatedPodCliqueScalingGroups []string with the same shape, and the deprecated RollingUpdateProgress.UpdatedPodCliques is kept in sync — so each FQN is effectively persisted twice on the PCS, and a PCSG-owned PCLQ can be recorded up to four times across the cluster (PCS + PCS      
  mirror + PCSG + PCSG mirror).

  Cost is driven by the total number of PodCliques in the PCS. Two failure modes appear at the scale Grove is targeting, well before any hard etcd limit:                                                                                                                                   
   
  1. Storage / wire cost grows linearly. Realistic FQN is ~50–60 chars (~60 bytes JSON-encoded, ~120 bytes per PCLQ once the deprecated mirror is included):                                                                                                                                
                  
  - 1 000 PCLQs → ~120 KiB of status payload.                                                                                                                                                                                                                                               
  - 2 000 PCLQs → ~240 KiB. kubectl get pcs -o yaml visibly slow.
  - 4 000 PCLQs → ~480 KiB. API-server / watch-cache pressure.                                                                                                                                                                                                                              
  - ~13 000 PCLQs → ~1.5 MiB, exceeding etcd's default per-object limit. Status writes are rejected, the controller cannot persist progress.                                                                                                                                                
                                                                                                                                                                                                                                                                                            
  Past ~100 KiB of status payload (reached at fewer than 1 000 PodCliques), serialization overhead and list/watch latency become noticeable. Every status write broadcasts the full object to every watcher (operator caches, dashboards, kubectl watches) because Kubernetes watches do not
   do field-level patches.                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                                            
  2. Reconciler CPU is O(N²) per reconcile step.

### What did you expect to happen?

   
  Status fields that summarize update progress should have bounded size and bounded per-reconcile cost, regardless of how many PodCliques the PCS contains.                                                                                                                                 
                  
  Concretely, I would expect:                                                                                                                                                                                                                                                               
                  
  - The PCS and PCSG status to carry UpdatedPodCliquesCount / TotalPodCliquesCount (and the equivalent pair for PCSGs on the PCS) as fixed-width int32 fields, plus a small derived display string (e.g. PodCliquesUpdateProgress = "344/600") for kubectl printer columns — instead of the 
  unbounded []string of FQNs.
  - Counts to be derived each reconcile from live child generation-hash labels (which the controller already loads into the informer cache for computeUpdateProgress), not accumulated. This is idempotent by construction — no drift on scale-in, re-queue, status-write retry, or         
  controller restart — and replaces today's O(N²) cleanup block with an O(N) in-memory walk over data already in memory, with zero new API-server reads.                                                                                                                                    
  - The deprecated RollingUpdateProgress.UpdatedPodCliques / UpdatedPodCliqueScalingGroups mirrors to be retired rather than kept in sync.
  - kubectl get pcs and kubectl get pcsg to surface progress as a single PCLQS-UPDATED 344/600 printer column instead of a list of FQNs that is unreadable at scale.                                                                                                                        
                                                                                                                                                                                                                                                                                            
  End result: status payload drops from O(N · 60 B) ×2 to a small constant, the reconciler hot path drops from O(N²) to O(N), and the operator works correctly at the fleet sizes Grove is designed for.                                                                                    
            
[issue-updated-podcliques-scale.md](https://github.com/user-attachments/files/27234806/issue-updated-podcliques-scale.md)

### Environment

The bug is design-level and reproducible on any environment that runs a PCS with enough PodCliques (the painful threshold begins at <1 000 PCLQs); the specific Kubernetes version, scheduler, or hardware do not gate the symptoms. Fill in your local details below for the report.     

- Grove version - v0.1.0-alpha.8


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Status field UpdatedPodCliques []string in PCS/PCSG does not scale — unbounded status payload at thousands of PodCliques and O(N²) reconciler CPU #567

What happened?

What did you expect to happen?

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Status field UpdatedPodCliques []string in PCS/PCSG does not scale — unbounded status payload at thousands of PodCliques and O(N²) reconciler CPU #567

Description

What happened?

What did you expect to happen?

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions