fix(probe,readiness): improve resilience to transient API server connectivity issues#9148
fix(probe,readiness): improve resilience to transient API server connectivity issues#9148
Conversation
|
❗ By default, the pull request is configured to backport to all release branches.
|
c787ca7 to
efff778
Compare
local cache when the apiserver is unavailable
|
/test limit=local |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19298987488 |
efff778 to
b0ee34b
Compare
local cache when the apiserver is unavailable|
/test limit=local |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19328577759 |
892bce2 to
cd78c40
Compare
|
/test limit=local |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19337811527 |
|
/test limit=local |
|
@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19424631451 |
1e8ea81 to
daa0555
Compare
…ilable Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
893fbfa to
b8c0b68
Compare
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
c08e607 to
e511e2c
Compare
|
/test |
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19742489229 |
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
f5ae1d3 to
5322108
Compare
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
|
/test |
|
@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19765080284 |
…ectivity issues (#9148) This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready. The improvement introduces a unified cluster caching mechanism that: - Creates a **single shared cache** instance used across all three probe types (liveness, readiness, startup) to reduce memory usage and ensure consistency - Implements **thread-safe** cache operations with proper mutex locking to support concurrent probe execution - Attempts to fetch the cluster definition with a **500ms timeout** to avoid blocking the probe for too long - **Falls back to a cached cluster definition** if the API server is temporarily unreachable - **Falls back to default probe configuration** if no cached cluster is found - Maintains probe functionality during brief network interruptions or API server unavailability - Uses optimized memory allocation patterns to avoid unnecessary `DeepCopy` operations This ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage. Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 1f11235)
|
suddenly I'm flooded with: readiness probe using cached cluster definition due to API server connectivity issue but my tests barely reach 200ms max |
This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready.
The improvement introduces a unified cluster caching mechanism that:
DeepCopyoperationsThis ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage.