Skip to content

perf: reduce reconcile CPU on the 1000-pod scale test#545

Merged
danbar2 merged 1 commit into
ai-dynamo:mainfrom
danbar2:perf/reduce-reconcile-cpu
Apr 28, 2026
Merged

perf: reduce reconcile CPU on the 1000-pod scale test#545
danbar2 merged 1 commit into
ai-dynamo:mainfrom
danbar2:perf/reduce-reconcile-cpu

Conversation

@danbar2

@danbar2 danbar2 commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

Summary

Solves #408

Six optimizations in the PCS / PCSG / PodClique reconcile paths. Measured on the 500-replica / 1000-pod scale test (k3d + 100 KWOK nodes, operator concurrency=20).

An earlier revision of this PR included a per-PodClique / per-PCSG spec-hash short-circuit (annotation grove.io/spec-hash). It was dropped after CR feedback flagged the hash as fragile (hand-picked field selection, MNNVL exclusion that depends on a webhook invariant, drift risk if buildResource changes silently). The remaining changes capture the same wall-clock improvement on this benchmark; steady-state CPU regresses ~10 ms/sec versus the short-circuited variant but stays ~50% under baseline.

Wall-clock

Phase Baseline This branch Δ
Total 128.0s 110.9s −13%
pcs-deleted 50.0s 37.9s −24%
pods-ready 47.8s 42.8s −10%

CPU — deploy phase (full pprof, 500 PodCliques + 500 PodGangs being created)

Function Baseline This branch Δ
Total CPU samples 14.98s ≈10s ≈−30%
GetPCLQPods (list + label-match) 4.58s 1.54s −66%
podclique reconcileSpec 3.68s 2.73s −26%
podclique reconcileStatus 3.57s below top-10 ≈−100% (no-op skip)
syncPCLQResources 2.99s 1.88s −37%
GetPodCliqueSet ≈1–2s 60ms −97%

CPU — steady-state phase (30s window, single no-op reconcile triggered by annotation patch)

The common production case: a watch event fires but nothing actually changed.

Function Baseline This branch Δ
Total CPU samples 680ms 370ms −46%
controllerutil.CreateOrPatch (PCS path) 330ms 180ms −45%
PCS PodClique doCreateOrUpdate 260ms 130ms −50%
PCS Reconcile 140ms below top-10 ≈−100%

A note on wall-clock vs. CPU

The wall-clock reduction is somewhat smaller than the CPU reduction because this test is not CPU-bound: the operator spends most of its wall-clock time waiting on kube-apiserver latency, informer cache syncs, and KWOK's own pod-scheduling delays. Those don't shrink just because our client-side code got faster. Where the CPU savings pay off in production:

  • Operator headroom: lower baseline CPU means less resource pressure under bursts and tighter container limits
  • Multi-tenant density: at N concurrent PCSes the per-reconcile savings scale with N
  • Event storms: steady-state −46% on a no-op reconcile is exactly the common case where an unrelated watch event wakes the controller
  • Larger clusters: at 10k+ PodCliques the CPU-per-reconcile would start gating wall-clock

All unit tests pass. No semantic changes to reconcile behavior (behavior is asserted via equality.Semantic.DeepEqual before skipping status writes, ownership still validated by metav1.IsControlledBy, etc.).

What's in the PR

Parallel component sync — three dependency-ordered groups in PCS reconcilespec.go:

  1. RBAC + infra (ServiceAccount, Role, RoleBinding, SATokenSecret, HeadlessService, HPA, PCSReplica, ComputeDomain, ResourceClaim)
  2. PodClique
  3. PCSG + PodGang

Instead of 12 sequential Syncs. Mostly benefits the delete phase (delete wall-clock −24%) where cleanup tasks overlap.

GetPCLQPods narrowed selectorMatchingLabels{LabelPodClique} only. That label is unique cluster-wide, so the parent managed-by/part-of labels only added per-pod labels.Set.Lookup work for no filtering benefit. Ownership is still validated by metav1.IsControlledBy.

ctx-scoped memoizationWithPCLQPodsCache and WithPodCliqueSetCache in internal/controller/common/component/utils/. PodClique's reconcileSpec and reconcileStatus share one pod list; GetPodCliqueSet drops from 4 calls per PodClique reconcile (and 3 per PCSG reconcile) to 1.

No-op status skip — in podclique/pcs/pcsg reconcileStatus, snapshot status before mutations, skip the API round-trip when equality.Semantic.DeepEqual says nothing changed.

Test additions

To make the steady-state CPU measurable in a scale test:

  • operator/e2e/measurement/condition/timer.goTimerCondition holds the measurement window open for a fixed duration after a trigger phase fires.
  • WorkloadManager.TriggerPCSReconcile in operator/e2e/grove/workload/workload.go — bumps grove.io/reconcile-trigger annotation to force a no-op reconcile cycle without touching the spec.
  • New steady-state-reconcile phase in operator/e2e/tests/scale/scale_test.go — patches the annotation, waits 30s, and pprof captured during that window isolates the cache-hit reconcile cost.

🤖 Generated with Claude Code

@copy-pr-bot

copy-pr-bot Bot commented Apr 20, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch 3 times, most recently from 6ada630 to f0dd6da Compare April 21, 2026 05:33
@danbar2 danbar2 marked this pull request as ready for review April 21, 2026 07:22
@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch 2 times, most recently from 6c80bca to 8368b09 Compare April 26, 2026 08:07
Comment thread operator/internal/controller/podcliqueset/components/podclique/podclique.go Outdated
Comment thread operator/internal/controller/podcliqueset/components/podclique/podclique.go Outdated
@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch 2 times, most recently from 8ce635d to 3cd6db8 Compare April 27, 2026 14:05
shayasoolin
shayasoolin previously approved these changes Apr 27, 2026
@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch from 3cd6db8 to 3638551 Compare April 28, 2026 06:09
Comment thread operator/e2e/measurement/condition/timer.go
Comment thread operator/e2e/measurement/condition/timer.go
Comment thread operator/e2e/tests/scale/scale_test.go
Comment thread operator/internal/controller/common/component/utils/pod.go
Comment thread operator/internal/controller/common/component/utils/podcliqueset.go
Comment thread operator/internal/controller/podcliqueset/reconcilestatus.go
@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch from a772930 to 48fb64d Compare April 28, 2026 07:56
Comment thread operator/internal/controller/podcliqueset/reconcilespec.go
Comment thread operator/internal/controller/podcliqueset/reconcilespec.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconcilespec.go Outdated
Comment thread operator/internal/controller/podcliqueset/reconcilespec.go Outdated

@shmuel-runai shmuel-runai left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work
Fix it, Ship it

Measured on the 500-replica / 1000-pod scale test (k3d + 100 KWOK nodes,
operator concurrency 20). Wall-clock −10-13% on a fresh run, deploy CPU
−21% on prior runs, GetPodCliqueSet −97%, GetPCLQPods −66%. Per-call
cost in the PodClique update path roughly halved by the cache + label
changes.

Changes:
* Parallel component sync in PCS reconcileSpec: three dependency-ordered
  groups (RBAC+infra, PodClique, PCSG+PodGang) instead of twelve
  sequential Syncs. Benefits the delete phase where component cleanups
  can run in parallel.
* GetPCLQPods narrowed to MatchingLabels{LabelPodClique} — that label is
  unique cluster-wide, so the parent managed-by / part-of labels only
  added per-pod labels.Set.Lookup work for no filtering benefit.
  Ownership is still validated by metav1.IsControlledBy.
* GetPCLQPods and GetPodCliqueSet ctx-memoized for the duration of one
  reconcile (WithPCLQPodsCache, WithPodCliqueSetCache). reconcileSpec
  and reconcileStatus share one informer list; GetPodCliqueSet drops
  from four calls per PodClique reconcile (and three per PCSG reconcile)
  to one.
* No-op status skip in podclique/pcs/pcsg reconcileStatus: snapshot the
  status before mutate* calls, skip the API round-trip when
  equality.Semantic.DeepEqual says nothing changed.

Test additions:
* TimerCondition in operator/e2e/measurement/condition/timer.go —
  holds the measurement window open for a fixed duration after a trigger
  phase fires.
* WorkloadManager.TriggerPCSReconcile — bumps grove.io/reconcile-trigger
  annotation to force a no-op reconcile cycle without changing the spec.
* steady-state-reconcile phase in operator/e2e/tests/scale/scale_test.go
  exercises the cache-hit path so the reconcile cost can be measured in
  isolation.

Scale test (1 PCS × 500 replicas × 2 pods = 1000 pods):

  Baseline total:   128.0s   This branch: 110.9s   (−13%)
  pcs-deleted:       50.0s              37.9s     (−24%)
  deploy CPU total: 14.98s             ≈10s        (−30%)
  GetPCLQPods:       4.57s              1.54s     (−66%)
  GetPodCliqueSet:  ≈1–2s               60ms      (−97%)
  PCS PodClique.doCreateOrUpdate (steady-state): 260ms → 130ms (−50%)

Earlier revisions of this branch added a per-PodClique / per-PCSG
spec-hash short-circuit guarded by a grove.io/spec-hash annotation. That
optimization was dropped after CR feedback flagged the hash as fragile
(MNNVL exclusion, drift risk if buildResource changes silently). The
remaining changes alone capture roughly the same wall-clock improvement
on this benchmark; the steady-state CPU regresses ~10ms/sec versus the
short-circuited variant but stays ~50% under baseline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@danbar2 danbar2 force-pushed the perf/reduce-reconcile-cpu branch from 48fb64d to 9275b81 Compare April 28, 2026 08:25
@danbar2 danbar2 merged commit 6b854c6 into ai-dynamo:main Apr 28, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants