perf: reduce reconcile CPU on the 1000-pod scale test by danbar2 · Pull Request #545 · ai-dynamo/grove

danbar2 · 2026-04-20T09:58:45Z

Summary

Solves #408

Six optimizations in the PCS / PCSG / PodClique reconcile paths. Measured on the 500-replica / 1000-pod scale test (k3d + 100 KWOK nodes, operator concurrency=20).

An earlier revision of this PR included a per-PodClique / per-PCSG spec-hash short-circuit (annotation grove.io/spec-hash). It was dropped after CR feedback flagged the hash as fragile (hand-picked field selection, MNNVL exclusion that depends on a webhook invariant, drift risk if buildResource changes silently). The remaining changes capture the same wall-clock improvement on this benchmark; steady-state CPU regresses ~10 ms/sec versus the short-circuited variant but stays ~50% under baseline.

Wall-clock

Phase	Baseline	This branch	Δ
Total	128.0s	110.9s	−13%
pcs-deleted	50.0s	37.9s	−24%
pods-ready	47.8s	42.8s	−10%

CPU — deploy phase (full pprof, 500 PodCliques + 500 PodGangs being created)

Function	Baseline	This branch	Δ
Total CPU samples	14.98s	≈10s	≈−30%
`GetPCLQPods` (list + label-match)	4.58s	1.54s	−66%
podclique `reconcileSpec`	3.68s	2.73s	−26%
podclique `reconcileStatus`	3.57s	below top-10	≈−100% (no-op skip)
`syncPCLQResources`	2.99s	1.88s	−37%
`GetPodCliqueSet`	≈1–2s	60ms	−97%

CPU — steady-state phase (30s window, single no-op reconcile triggered by annotation patch)

The common production case: a watch event fires but nothing actually changed.

Function	Baseline	This branch	Δ
Total CPU samples	680ms	370ms	−46%
`controllerutil.CreateOrPatch` (PCS path)	330ms	180ms	−45%
PCS PodClique `doCreateOrUpdate`	260ms	130ms	−50%
PCS `Reconcile`	140ms	below top-10	≈−100%

A note on wall-clock vs. CPU

The wall-clock reduction is somewhat smaller than the CPU reduction because this test is not CPU-bound: the operator spends most of its wall-clock time waiting on kube-apiserver latency, informer cache syncs, and KWOK's own pod-scheduling delays. Those don't shrink just because our client-side code got faster. Where the CPU savings pay off in production:

Operator headroom: lower baseline CPU means less resource pressure under bursts and tighter container limits
Multi-tenant density: at N concurrent PCSes the per-reconcile savings scale with N
Event storms: steady-state −46% on a no-op reconcile is exactly the common case where an unrelated watch event wakes the controller
Larger clusters: at 10k+ PodCliques the CPU-per-reconcile would start gating wall-clock

All unit tests pass. No semantic changes to reconcile behavior (behavior is asserted via equality.Semantic.DeepEqual before skipping status writes, ownership still validated by metav1.IsControlledBy, etc.).

What's in the PR

Parallel component sync — three dependency-ordered groups in PCS reconcilespec.go:

RBAC + infra (ServiceAccount, Role, RoleBinding, SATokenSecret, HeadlessService, HPA, PCSReplica, ComputeDomain, ResourceClaim)
PodClique
PCSG + PodGang

Instead of 12 sequential Syncs. Mostly benefits the delete phase (delete wall-clock −24%) where cleanup tasks overlap.

GetPCLQPods narrowed selector — MatchingLabels{LabelPodClique} only. That label is unique cluster-wide, so the parent managed-by/part-of labels only added per-pod labels.Set.Lookup work for no filtering benefit. Ownership is still validated by metav1.IsControlledBy.

ctx-scoped memoization — WithPCLQPodsCache and WithPodCliqueSetCache in internal/controller/common/component/utils/. PodClique's reconcileSpec and reconcileStatus share one pod list; GetPodCliqueSet drops from 4 calls per PodClique reconcile (and 3 per PCSG reconcile) to 1.

No-op status skip — in podclique/pcs/pcsg reconcileStatus, snapshot status before mutations, skip the API round-trip when equality.Semantic.DeepEqual says nothing changed.

Test additions

To make the steady-state CPU measurable in a scale test:

operator/e2e/measurement/condition/timer.go — TimerCondition holds the measurement window open for a fixed duration after a trigger phase fires.
WorkloadManager.TriggerPCSReconcile in operator/e2e/grove/workload/workload.go — bumps grove.io/reconcile-trigger annotation to force a no-op reconcile cycle without touching the spec.
New steady-state-reconcile phase in operator/e2e/tests/scale/scale_test.go — patches the annotation, waits 30s, and pprof captured during that window isolates the cache-hit reconcile cost.

🤖 Generated with Claude Code

copy-pr-bot · 2026-04-20T09:58:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shmuel-runai

Great work
Fix it, Ship it

Measured on the 500-replica / 1000-pod scale test (k3d + 100 KWOK nodes, operator concurrency 20). Wall-clock −10-13% on a fresh run, deploy CPU −21% on prior runs, GetPodCliqueSet −97%, GetPCLQPods −66%. Per-call cost in the PodClique update path roughly halved by the cache + label changes. Changes: * Parallel component sync in PCS reconcileSpec: three dependency-ordered groups (RBAC+infra, PodClique, PCSG+PodGang) instead of twelve sequential Syncs. Benefits the delete phase where component cleanups can run in parallel. * GetPCLQPods narrowed to MatchingLabels{LabelPodClique} — that label is unique cluster-wide, so the parent managed-by / part-of labels only added per-pod labels.Set.Lookup work for no filtering benefit. Ownership is still validated by metav1.IsControlledBy. * GetPCLQPods and GetPodCliqueSet ctx-memoized for the duration of one reconcile (WithPCLQPodsCache, WithPodCliqueSetCache). reconcileSpec and reconcileStatus share one informer list; GetPodCliqueSet drops from four calls per PodClique reconcile (and three per PCSG reconcile) to one. * No-op status skip in podclique/pcs/pcsg reconcileStatus: snapshot the status before mutate* calls, skip the API round-trip when equality.Semantic.DeepEqual says nothing changed. Test additions: * TimerCondition in operator/e2e/measurement/condition/timer.go — holds the measurement window open for a fixed duration after a trigger phase fires. * WorkloadManager.TriggerPCSReconcile — bumps grove.io/reconcile-trigger annotation to force a no-op reconcile cycle without changing the spec. * steady-state-reconcile phase in operator/e2e/tests/scale/scale_test.go exercises the cache-hit path so the reconcile cost can be measured in isolation. Scale test (1 PCS × 500 replicas × 2 pods = 1000 pods): Baseline total: 128.0s This branch: 110.9s (−13%) pcs-deleted: 50.0s 37.9s (−24%) deploy CPU total: 14.98s ≈10s (−30%) GetPCLQPods: 4.57s 1.54s (−66%) GetPodCliqueSet: ≈1–2s 60ms (−97%) PCS PodClique.doCreateOrUpdate (steady-state): 260ms → 130ms (−50%) Earlier revisions of this branch added a per-PodClique / per-PCSG spec-hash short-circuit guarded by a grove.io/spec-hash annotation. That optimization was dropped after CR feedback flagged the hash as fragile (MNNVL exclusion, drift risk if buildResource changes silently). The remaining changes alone capture roughly the same wall-clock improvement on this benchmark; the steady-state CPU regresses ~10ms/sec versus the short-circuited variant but stays ~50% under baseline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

danbar2 added the run-e2e label Apr 20, 2026

danbar2 force-pushed the perf/reduce-reconcile-cpu branch 3 times, most recently from 6ada630 to f0dd6da Compare April 21, 2026 05:33

danbar2 marked this pull request as ready for review April 21, 2026 07:22

danbar2 requested review from Ronkahn21, gflarity, sanjaychatterjee, shayasoolin and unmarshall as code owners April 21, 2026 07:22

danbar2 force-pushed the perf/reduce-reconcile-cpu branch 2 times, most recently from 6c80bca to 8368b09 Compare April 26, 2026 08:07

shayasoolin reviewed Apr 27, 2026

View reviewed changes

danbar2 force-pushed the perf/reduce-reconcile-cpu branch 2 times, most recently from 8ce635d to 3cd6db8 Compare April 27, 2026 14:05

shayasoolin previously approved these changes Apr 27, 2026

View reviewed changes

danbar2 force-pushed the perf/reduce-reconcile-cpu branch from 3cd6db8 to 3638551 Compare April 28, 2026 06:09