fix(probe,readiness): improve resilience to transient API server connectivity issues by armru · Pull Request #9148 · cloudnative-pg/cloudnative-pg

armru · 2025-11-12T13:22:44Z

This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready.

The improvement introduces a unified cluster caching mechanism that:

Creates a single shared cache instance used across all three probe types (liveness, readiness, startup) to reduce memory usage and ensure consistency
Implements thread-safe cache operations with proper mutex locking to support concurrent probe execution
Attempts to fetch the cluster definition with a 500ms timeout to avoid blocking the probe for too long
Falls back to a cached cluster definition if the API server is temporarily unreachable
Falls back to default probe configuration if no cached cluster is found
Maintains probe functionality during brief network interruptions or API server unavailability
Uses optimized memory allocation patterns to avoid unnecessary DeepCopy operations

This ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage.

github-actions · 2025-11-12T13:22:56Z

❗ By default, the pull request is configured to backport to all release branches.

To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

armru · 2025-11-12T13:25:23Z

/test limit=local

github-actions · 2025-11-12T13:25:32Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19298987488

armru · 2025-11-13T10:33:23Z

/test limit=local

github-actions · 2025-11-13T10:33:33Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19328577759

armru · 2025-11-13T16:05:58Z

/test limit=local

github-actions · 2025-11-13T16:06:10Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19337811527

armru · 2025-11-17T09:25:04Z

/test limit=local

github-actions · 2025-11-17T09:25:16Z

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19424631451

…ilable Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

pkg/management/postgres/webserver/probes/cache.go

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia · 2025-11-27T16:20:10Z

/test

github-actions · 2025-11-27T16:20:21Z

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19742489229

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia · 2025-11-28T13:22:45Z

/test

github-actions · 2025-11-28T13:22:56Z

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19765080284

…ectivity issues (#9148) This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready. The improvement introduces a unified cluster caching mechanism that: - Creates a **single shared cache** instance used across all three probe types (liveness, readiness, startup) to reduce memory usage and ensure consistency - Implements **thread-safe** cache operations with proper mutex locking to support concurrent probe execution - Attempts to fetch the cluster definition with a **500ms timeout** to avoid blocking the probe for too long - **Falls back to a cached cluster definition** if the API server is temporarily unreachable - **Falls back to default probe configuration** if no cached cluster is found - Maintains probe functionality during brief network interruptions or API server unavailability - Uses optimized memory allocation patterns to avoid unnecessary `DeepCopy` operations This ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage. Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com> Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com> (cherry picked from commit 1f11235)

kamikaze · 2026-01-19T22:15:34Z

suddenly I'm flooded with:

readiness probe using cached cluster definition due to API server connectivity issue

but my tests barely reach 200ms max

armru requested a review from a team as a code owner November 12, 2025 13:22

armru added the no-issue label Nov 12, 2025

dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 12, 2025

armru changed the title ~~fix(probe,healthy): use the cluster cache when the api-server is unavailable~~ fix(probe,healthy): use the cluster cache when the apiserver is unavailable Nov 12, 2025

cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Nov 12, 2025

dosubot bot added the bug 🐛 Something isn't working label Nov 12, 2025

armru force-pushed the dev/readiness-probe branch from c787ca7 to efff778 Compare November 12, 2025 13:23

armru changed the title ~~fix(probe,healthy): use the cluster cache when the apiserver is unavailable~~ fix(probe,healthy): use the local cache when the apiserver is unavailable Nov 12, 2025

cnpg-bot added the ok to merge 👌 This PR can be merged label Nov 12, 2025

armru force-pushed the dev/readiness-probe branch from efff778 to b0ee34b Compare November 13, 2025 10:30

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Nov 13, 2025

armru changed the title ~~fix(probe,healthy): use the local cache when the apiserver is unavailable~~ fix(probe,readiness): improve probe resilience to transient API server connectivity issues Nov 13, 2025

armru changed the title ~~fix(probe,readiness): improve probe resilience to transient API server connectivity issues~~ fix(probe,readiness): improve resilience to transient API server connectivity issues Nov 13, 2025

armru force-pushed the dev/readiness-probe branch 2 times, most recently from 892bce2 to cd78c40 Compare November 13, 2025 10:36

armru removed release-1.25 release-1.26 labels Nov 13, 2025

mnencia force-pushed the dev/readiness-probe branch from 1e8ea81 to daa0555 Compare November 26, 2025 16:57

armru and others added 5 commits November 27, 2025 14:49

fix(probe,healthy): use the cluster cache when the apiserver is unava…

c4d3440

…ilable Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

fix: handle first start

cd577fb

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

fix(e2e): log checking

29331fb

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: lint

356ab6d

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>

chore: add comment about thread safety

7ae392b

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia force-pushed the dev/readiness-probe branch 2 times, most recently from 893fbfa to b8c0b68 Compare November 27, 2025 15:05

armru commented Nov 27, 2025

View reviewed changes

pkg/management/postgres/webserver/probes/cache.go Outdated Show resolved Hide resolved

mnencia added 7 commits November 27, 2025 16:44

fix: make clusterCache thread-safe

6c9346a

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

fix: eliminate nil pointer dereference risk in cache API

379a80d

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

refactor: remove unnecessary DeepCopy operations

e16d456

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

fix: add logging for cluster refresh errors

6fb2d36

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

fix: use consistent trace level for probe success logs

c40c449

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

fix: enable cache persistence for readiness and startup probes

91046ab

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

refactor: use shared cache across all probe types

e511e2c

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia force-pushed the dev/readiness-probe branch from c08e607 to e511e2c Compare November 27, 2025 15:44

mnencia added 2 commits November 28, 2025 11:28

perf: avoid double memory allocation

adb5de2

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

perf: avoid double memory allocation in cluster cache operations

5322108

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia force-pushed the dev/readiness-probe branch from f5ae1d3 to 5322108 Compare November 28, 2025 13:08

refactor: flip liveness check if-statement to reduce diff

2380ec8

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>

mnencia approved these changes Nov 28, 2025

View reviewed changes

mnencia merged commit 1f11235 into main Nov 28, 2025
35 checks passed

mnencia deleted the dev/readiness-probe branch November 28, 2025 15:51

Conversation

armru commented Nov 12, 2025 • edited by mnencia Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

armru commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

armru commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

armru commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

armru commented Nov 17, 2025

Uh oh!

github-actions bot commented Nov 17, 2025

Uh oh!

Uh oh!

mnencia commented Nov 27, 2025

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

mnencia commented Nov 28, 2025

Uh oh!

github-actions bot commented Nov 28, 2025

Uh oh!

Uh oh!

kamikaze commented Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

armru commented Nov 12, 2025 •

edited by mnencia

Loading