Skip to content

bound UpdateProgress status payload#576

Merged
oleg-kushniriov merged 7 commits into
ai-dynamo:mainfrom
oleg-kushniriov:fix/567-bounded-update-progress
May 11, 2026
Merged

bound UpdateProgress status payload#576
oleg-kushniriov merged 7 commits into
ai-dynamo:mainfrom
oleg-kushniriov:fix/567-bounded-update-progress

Conversation

@oleg-kushniriov

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug
/kind api

What this PR does / why we need it:

The UpdateProgress.UpdatedPodCliques and UpdateProgress.UpdatedPodCliqueScalingGroups fields on PodCliqueSet/PodCliqueScalingGroup status were unbounded []string slices that grew linearly with the total number of PodCliques/PCSGs in the cluster. At the scale Grove
targets, this produced two failure modes well before the etcd 1.5 MiB per-object limit:

  1. Status payload growth. Each reconcile broadcast the full FQN list to every watcher (operator caches, dashboards, kubectl watch). At ~1 000 PodCliques the payload was ~120 KiB; at ~13 000 it would exceed etcd's per-object limit. The deprecated RollingUpdateProgress mirror
    doubled the cost.
  2. Reconciler O(N²) hot path. The cleanup block in rollingupdate.go did lo.Uniq(append(...)) followed by slices.DeleteFunc over the expected-FQN list every reconcile. A second O(M·E) pattern in reconcileStatus re-flattened lo.Values(map) per filtered element.

This PR replaces the unbounded slices with bounded int32 count + total fields:

  • PodCliqueSetUpdateProgress: UpdatedPodCliquesCount, TotalPodCliquesCount, UpdatedPodCliqueScalingGroupsCount, TotalPodCliqueScalingGroupsCount.
  • PodCliqueScalingGroupUpdateProgress: UpdatedPodCliquesCount, TotalPodCliquesCount.

Counts are recomputed each reconcile from child CurrentPodCliqueSetGenerationHash labels already in the informer cache — idempotent by construction (no drift on scale-in, requeue, status-write retry, or controller restart) and O(N) instead of O(N²). The stray-resource filter is
hoisted to a map[string]struct{} lookup with slices.DeleteFunc, making it O(M+E) instead of O(M·E).

The deprecated RollingUpdateProgress mirror is preserved with the new bounded shape so existing consumers keep working at the field-presence level. Slice sub-fields are gone from the mirror as well.

Adds four kubectl get pcs printer columns and two kubectl get pcsg printer columns surfacing the new counts.
2. Reconciler O(N²) hot path. The cleanup block in rollingupdate.go did lo.Uniq(append(...)) followed by slices.DeleteFunc over the expected-FQN list every reconcile. A second O(M·E) pattern in reconcileStatus re-flattened lo.Values(map) per filtered element.

This PR replaces the unbounded slices with bounded int32 count + total fields:

  • PodCliqueSetUpdateProgress: UpdatedPodCliquesCount, TotalPodCliquesCount, UpdatedPodCliqueScalingGroupsCount, TotalPodCliqueScalingGroupsCount.
  • PodCliqueScalingGroupUpdateProgress: UpdatedPodCliquesCount, TotalPodCliquesCount.

Counts are recomputed each reconcile from child CurrentPodCliqueSetGenerationHash labels already in the informer cache — idempotent by construction (no drift on scale-in, requeue, status-write retry, or controller restart) and O(N) instead of O(N²). The stray-resource filter is
hoisted to a map[string]struct{} lookup with slices.DeleteFunc, making it O(M+E) instead of O(M·E).

The deprecated RollingUpdateProgress mirror is preserved with the new bounded shape so existing consumers keep working at the field-presence level. Slice sub-fields are gone from the mirror as well.

Adds four kubectl get pcs printer columns and two kubectl get pcsg printer columns surfacing the new counts.

Which issue(s) this PR fixes:

Fixes #567

Special notes for your reviewer:

  • Breaking change to status fields. Any consumer that read pcs.Status.UpdateProgress.UpdatedPodCliques (or the deprecated mirror's slice copy) will now find the field absent. The replacement is UpdatedPodCliquesCount / TotalPodCliquesCount. Justified at v0.1.0-alpha.8;
    called out in release notes.
  • Semantic preserved from the original. PCS-level counts cover standalone PCLQs only (matching the old slice's semantics — pri.pclqs was always loaded with KindPodCliqueSet owner). PCSG-owned PCLQs are tracked on their owning PCSG via
    PodCliqueScalingGroupUpdateProgress.UpdatedPodCliquesCount. We discussed during review whether to aggregate at the PCS level; current implementation keeps existing semantics.
  • Mirror change. RollingUpdateProgress is not removed (already deprecated; deprecation contract honored). Its inner slice fields are gone but timestamps + CurrentlyUpdating + the new counts are mirrored. Consumers iterating the old slices will see nil/empty and should
    migrate.
  • Tests. New unit tests cover computeAvailableAndUpdatedReplicas count outputs, mutateReplicas write paths (PCS + PCSG), and the count helpers (countUpdatedPCLQs, countUpdatedPCSGs, countPCSGReplicaUpdatedPCLQs, flattenNamesToSet). All target the new canonical
    fields, not the deprecated mirror.
  • Generated artifacts regenerated. zz_generated.deepcopy.go, both CRD YAMLs (api/core/v1alpha1/crds/ + charts/crds/), and docs/api-reference/operator-api.md are regenerated via make generate and hack/generate-api-docs.sh.
  • make lint and go vet (default + -tags=e2e) clean. Full unit-test suite green except for one pre-existing TestDecodeOperatorConfig failure in api/config/v1alpha1 that is unrelated to this change and reproduces on main.

Does this PR introduce a API change?

action required: PodCliqueSet.Status.UpdateProgress.UpdatedPodCliques and UpdatedPodCliqueScalingGroups (and PodCliqueScalingGroup.Status.UpdateProgress.UpdatedPodCliques) are removed. Use the new bounded int32 counts instead: UpdatedPodCliquesCount/TotalPodCliquesCount and
UpdatedPodCliqueScalingGroupsCount/TotalPodCliqueScalingGroupsCount on the PodCliqueSet, and UpdatedPodCliquesCount/TotalPodCliquesCount on the PodCliqueScalingGroup. The deprecated RollingUpdateProgress mirror is preserved with the new bounded shape; its inner slice fields are
removed.

Additional documentation e.g., enhancement proposals, usage docs, etc.:

docs/api-reference/operator-api.md regenerated to reflect the new count fields and the trimmed RollingUpdateProgress mirror. Four new printer columns on PodCliqueSet (`Replicas`, `Available`, `Updated`, `PCLQS-Updated`, `PCLQS-Total`, `PCSGS-Updated`, `PCSGS-Total`) and two on
PodCliqueScalingGroup (`PCLQS-Updated`, `PCLQS-Total`) surface progress at scale where the previous FQN slices were unreadable.

@copy-pr-bot

copy-pr-bot Bot commented May 3, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Comment thread operator/api/core/v1alpha1/crds/grove.io_podcliquescalinggroups.yaml Outdated
Comment thread docs/api-reference/operator-api.md Outdated
Comment thread operator/e2e/tests/update/utils.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise LGTM.

shayasoolin
shayasoolin previously approved these changes May 7, 2026

@gflarity gflarity left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the counts being added to deprecated status fields (via structs). Let's chat about this briefly during the standup or design sync. Maybe I'm missing something.

Comment thread operator/api/core/v1alpha1/podcliqueset.go Outdated
Comment thread operator/api/core/v1alpha1/scalinggroup.go Outdated
@oleg-kushniriov oleg-kushniriov force-pushed the fix/567-bounded-update-progress branch 2 times, most recently from 94c7cdc to 7b3d718 Compare May 10, 2026 09:03
danbar2
danbar2 previously approved these changes May 10, 2026
@oleg-kushniriov oleg-kushniriov force-pushed the fix/567-bounded-update-progress branch from 7b3d718 to f3f36f6 Compare May 10, 2026 11:48
Comment thread operator/internal/controller/podcliquescalinggroup/reconcilestatus.go Outdated
@oleg-kushniriov oleg-kushniriov merged commit 66b4239 into ai-dynamo:main May 11, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Status field UpdatedPodCliques []string in PCS/PCSG does not scale — unbounded status payload at thousands of PodCliques and O(N²) reconciler CPU

4 participants