What happened?
We have a large k8s cluster that has many tens of thousands of pods at all times. Only a very small number of these are Dynamo/Grove PodCliques. Currently, the Grove operator creates an unfiltered cluster-wide pod informer caches all pods in memory, meaning that it needs an excessively large memory reservation or risks getting OOMkilled.
What did you expect to happen?
The PodClique controller watches v1.Pod via .Owns(&corev1.Pod{}) (internal/controller/podclique/register.go:66). While event predicates correctly filter reconciliation to only grove-managed pods, the underlying controller-runtime cache still lists/watches all pods cluster-wide because no Cache config is set in createManagerOptions() (internal/controller/manager.go:87).
What grove could do instead is add a cache configuration with a label selector for the Pod GVK in createManagerOptions():
opts.Cache = cache.Options{
ByObject: map[client.Object]cache.ByObject{
&corev1.Pod{}: {
Label: labels.SelectorFromSet(labels.Set{
"app.kubernetes.io/managed-by": "grove-operator",
}),
},
},
}
This limits the Pod informer to only pods with the app.kubernetes.io/managed-by=grove-operator label, reducing the cache by many orders of magnitude.
Environment
- Kubernetes version
- Grove version: v0.1.0-alpha.7
- Scheduler details
- Cloud provider or hardware configuration
- Tools that you are using Grove together with
- Anything else that is relevant
What happened?
We have a large k8s cluster that has many tens of thousands of pods at all times. Only a very small number of these are Dynamo/Grove PodCliques. Currently, the Grove operator creates an unfiltered cluster-wide pod informer caches all pods in memory, meaning that it needs an excessively large memory reservation or risks getting OOMkilled.
What did you expect to happen?
The PodClique controller watches
v1.Podvia.Owns(&corev1.Pod{})(internal/controller/podclique/register.go:66). While event predicates correctly filter reconciliation to only grove-managed pods, the underlying controller-runtime cache still lists/watches all pods cluster-wide because no Cache config is set increateManagerOptions()(internal/controller/manager.go:87).What grove could do instead is add a cache configuration with a label selector for the Pod GVK in createManagerOptions():
This limits the Pod informer to only pods with the
app.kubernetes.io/managed-by=grove-operatorlabel, reducing the cache by many orders of magnitude.Environment