Skip to content

Eliminate O(N×M) reconcile hotspots across the operator #577

@oleg-kushniriov

Description

@oleg-kushniriov

What you would like to be added?

Convert nine known O(N×M) loops in the Grove operator's reconcile path to O(N+M) by replacing per-element slices.Contains / slices.IndexFunc / lo.Filter lookups with map[string]struct{} (or map[string]T) sets built once per sync context. The patterns are all the same shape: an inner linear scan over a slice that could be a single map lookup.

The nine call sites, ordered by likely impact (Pod-cardinality first, since Pods have the highest cardinality at scale):

# File Function Current complexity Target complexity
1 operator/internal/controller/podcliqueset/components/podgang/syncflow.go initializeAssignedAndUnassignedPodsForPCS O(totalPods × expectedPodGangs) O(totalPods + expectedPodGangs)
2 operator/internal/controller/podclique/components/pod/syncflow.go checkAndRemovePodSchedulingGates O(podsInPCLQ × podRefsInPodGangForPCLQ) O(podsInPCLQ + podRefsInPodGangForPCLQ)
3 operator/internal/controller/podclique/components/pod/rollingupdate.go getPodNamesPendingUpdate O(oldPods × deletionExpectedPodUIDs) O(oldPods + deletionExpectedPodUIDs)
4 operator/internal/controller/podcliquescalinggroup/components/podclique/sync.go createExpectedPCLQs, createOrUpdatePCLQs O(expectedPCLQs × existingPCLQs) O(expectedPCLQs + existingPCLQs)
5 operator/internal/controller/podcliqueset/components/podclique/podclique.go createOrUpdatePCLQs O(expectedStandalonePCLQs × existingStandalonePCLQs) O(expectedStandalonePCLQs + existingStandalonePCLQs)
6 operator/internal/controller/podcliqueset/components/podcliquescalinggroup/podcliquescalinggroup.go syncPodCliqueScalingGroups O(expectedPCSGs × existingPCSGs) O(expectedPCSGs + existingPCSGs)
7 operator/internal/controller/podcliqueset/components/podgang/syncflow.go getPodsForPodCliquesPendingCreation O(podGangPCLQs × existingPCLQs) O(podGangPCLQs + existingPCLQs)
8 operator/internal/controller/podcliqueset/components/podgang/syncflow.go getPodGangNamesPendingCreation, getExcessPodGangNames O(expectedPodGangs × existingPodGangs) O(expectedPodGangs + existingPodGangs)
9 operator/internal/controller/podcliqueset/components/podgang/syncflow.go getPodCliques O(podGangPCLQs × existingPCLQs) O(podGangPCLQs + existingPCLQs)

The fix shape is identical for all of them: hoist the lookup target into a map[string]struct{} (or map[string]T where the value is needed) once at the start of the sync context, then replace each slices.Contains / slices.IndexFunc with an O(1) map lookup. Where the slice is mutable, prefer slices.DeleteFunc over lo.Filter — it is in-place and stdlib.

A representative example of the same pattern, fixed recently as part of #567, lives in operator/internal/controller/podcliqueset/reconcilestatus.go (computeAvailableAndUpdatedReplicas): see flattenNamesToSet + slices.DeleteFunc for a working template.

Why is this needed?

Grove aims to support almost any Pods amount per PodCliqueSet, distributable in any shape (one PodClique with 10 000 Pods, 5 000 PodCliques with 2 Pods, anything in between). The point of this issue is not that the operator falls over at 10 000 — it is that the quadratic shape of these loops makes the cost balloon long before the upper bound, and the wins from linearizing them are already meaningful at modest scale (~1 000 Pods).

Per-reconcile comparison counts

For two representative fleet shapes, here is the worst per-PCS-reconcile work for the affected hotspots:

Shape N=1 000 Pods N=2 000 Pods N=10 000 Pods
Pod-heavy (1 PCLQ × N Pods) — hits hotspots 2, 3 ~10⁶ per PCLQ sync ~4×10⁶ ~10⁸
PCLQ-heavy (N/2 PCLQs × 2 Pods) — hits hotspots 1, 4, 5, 7, 8, 9 each ~2.5×10⁵; aggregate ~1.5×10⁶ each ~10⁶; aggregate ~6×10⁶ each ~2.5×10⁷; aggregate ~1.5×10⁸

A few observations:

  • Even at N=1 000 — well below the upper target — the PCLQ-heavy shape produces low-millions of comparisons per reconcile, plus per-element slice allocations (lo.Map, lo.FilterMap, lambda captures) on every hot path. With multi-second requeue cadences across multiple PCSes, this is already a measurable source of CPU and GC pressure; it just doesn't appear in CI/dev because dev clusters typically run at N≈10.
  • N=10 000 is where it gets ugly. The same patterns produce 10⁸-class per-reconcile work — that is the regime where reconcile latencies climb into seconds and watch-event storms cascade across observers.
  • The fix gets cheaper as scale grows. Each O(N²) loop becomes an O(N+M) map lookup, taking total work from ~10⁸ to ~10⁴ at N=10 000 — four orders of magnitude — for a one-shot map build per sync context.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions