Skip to content

add cache options for pods managed by grove only#533

Merged
oleg-kushniriov merged 5 commits into
ai-dynamo:mainfrom
oleg-kushniriov:fix/cache-managed-pods-for-reconciler
Apr 16, 2026
Merged

add cache options for pods managed by grove only#533
oleg-kushniriov merged 5 commits into
ai-dynamo:mainfrom
oleg-kushniriov:fix/cache-managed-pods-for-reconciler

Conversation

@oleg-kushniriov

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a label selector to the Pod informer cache so only pods with app.kubernetes.io/managed-by: grove-operator are listed/watched. On large clusters with tens of thousands of pods, the operator was caching every pod cluster-wide even though only a tiny fraction are grove-managed.
This caused excessive memory usage and OOMKill risk.

The fix adds a cache.ByObject configuration for corev1.Pod in createManagerOptions() that pushes filtering to the API server. Grove CRDs are not filtered because all instances are grove-managed by definition.

Which issue(s) this PR fixes:

Fixes #530

Special notes for your reviewer:

  • Event predicates (e.g. isManagedPod()) were already filtering which cached events trigger reconciliation, but the underlying informer still stored all pods. This change moves filtering server-side.
  • TestCreateManager now uses envtest.Environment instead of a fake REST host because cache.ByObject triggers API discovery during manager creation. It skips gracefully when kubebuilder binaries are unavailable, consistent with TestRegisterControllers.

Does this PR introduce a API change?

release-note

The operator's Pod informer now uses a label-selector cache (`app.kubernetes.io/managed-by: grove-operator`) 
so only grove-managed pods are listed and watched. 
On clusters with tens of thousands of pods, this reduces the operator's memory footprint by orders of magnitude. 

Additional documentation e.g., enhancement proposals, usage docs, etc.:

NONE

@copy-pr-bot

copy-pr-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

enoodle
enoodle previously approved these changes Apr 15, 2026
Comment thread operator/internal/controller/manager_test.go Outdated
Comment thread operator/internal/controller/manager.go
danbar2
danbar2 previously approved these changes Apr 15, 2026
enoodle
enoodle previously approved these changes Apr 15, 2026
danbar2
danbar2 previously approved these changes Apr 16, 2026
@oleg-kushniriov oleg-kushniriov force-pushed the fix/cache-managed-pods-for-reconciler branch from e8d22a8 to 1b141b1 Compare April 16, 2026 07:36
Comment thread operator/e2e/tests/operator_infra_test.go Outdated
Comment thread operator/internal/controller/manager_test.go
@oleg-kushniriov oleg-kushniriov dismissed stale reviews from enoodle and danbar2 via 45be5be April 16, 2026 08:57
@oleg-kushniriov oleg-kushniriov force-pushed the fix/cache-managed-pods-for-reconciler branch from 1b141b1 to 45be5be Compare April 16, 2026 08:57
@oleg-kushniriov oleg-kushniriov merged commit 1fd8a9a into ai-dynamo:main Apr 16, 2026
60 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update grove-operator to use label-filtered cache for the pod informer

5 participants