You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[CONTINT-5186] Use in-place pod resizing in the vertical controller (#47998)
### What does this PR do?
This PR implements IPVPA in the autoscaling vertical controller according to the [RFC](https://datadoghq.atlassian.net/wiki/spaces/CONT/pages/6246498427/In-Place+Vertical+Pod+Resizing+for+Workload+Autoscaling)
See the RFC for the full specification, but key components are:
- In-place resize via pods/resize subresource, with eviction fallback (PDB-aware) and rollout fallback
- API server feature gate check (pods/resize discovery, cached 15min)
- ResizeSuccessful event emitted once
### Motivation
https://datadoghq.atlassian.net/browse/CONTINT-5126
### Describe how you validated your changes
Deployed several workloads and DPAs on an EKS cluster to dddev.
1. Happy path (i.e., in-place resize with no restarts) -> ResizeSuccessful event emitted exactly once and restartCount=0.
2. Trigger rollout (i.e., using `mode:TriggerRollout` on the DPA forces the legacy rollout path): works as expected
3. Memory restart policy (i.e., container has resizePolicy requiring restart on memory limit/req changes): Verified restartCount > 0 on pods after a memory recommendation change.
4. Sidecar (i.e., DPA with `constraints.containers: [{name: server}]`). Only the server container is resized.
Cluster/workloads are still available for inspection: https://dddev.datadoghq.com/orchestration/scaling/workload?query=kube_cluster_name%3Ajrosario-ipvpa-final%20-kube_cluster_name%3Ajrosario-ipvpa3-mar18&workload_scaling_tab=optimized-workloads
### Additional Notes
This change is also related to/relies on:
- [datadog-operator](DataDog/datadog-operator#2743). For local testing I used `go.work` entry to point to local operator.
- helm-charts [RBAC for pods/resize](DataDog/helm-charts#2493) (patch verb on pods subresource).
Co-authored-by: cedric.lamoriniere <cedric.lamoriniere@datadoghq.com>
0 commit comments