Skip to content

fix(probe,readiness): improve resilience to transient API server connectivity issues#9148

Merged
mnencia merged 15 commits intomainfrom
dev/readiness-probe
Nov 28, 2025
Merged

fix(probe,readiness): improve resilience to transient API server connectivity issues#9148
mnencia merged 15 commits intomainfrom
dev/readiness-probe

Conversation

@armru
Copy link
Member

@armru armru commented Nov 12, 2025

This change enhances the resilience of all probe types (liveness, readiness, and startup) when facing transient Kubernetes API server connectivity issues. Previously, readiness and startup probes would fail immediately if unable to reach the API server, potentially causing unnecessary pod restarts or preventing pods from becoming ready.

The improvement introduces a unified cluster caching mechanism that:

  • Creates a single shared cache instance used across all three probe types (liveness, readiness, startup) to reduce memory usage and ensure consistency
  • Implements thread-safe cache operations with proper mutex locking to support concurrent probe execution
  • Attempts to fetch the cluster definition with a 500ms timeout to avoid blocking the probe for too long
  • Falls back to a cached cluster definition if the API server is temporarily unreachable
  • Falls back to default probe configuration if no cached cluster is found
  • Maintains probe functionality during brief network interruptions or API server unavailability
  • Uses optimized memory allocation patterns to avoid unnecessary DeepCopy operations

This ensures consistent behavior across all probe types and reduces false positives during transient network issues, while also improving performance through shared resources and optimized memory usage.

@armru armru requested a review from a team as a code owner November 12, 2025 13:22
@armru armru added the no-issue label Nov 12, 2025
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Nov 12, 2025
@armru armru changed the title fix(probe,healthy): use the cluster cache when the api-server is unavailable fix(probe,healthy): use the cluster cache when the apiserver is unavailable Nov 12, 2025
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Nov 12, 2025
@github-actions
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added the bug 🐛 Something isn't working label Nov 12, 2025
@armru armru force-pushed the dev/readiness-probe branch from c787ca7 to efff778 Compare November 12, 2025 13:23
@armru armru changed the title fix(probe,healthy): use the cluster cache when the apiserver is unavailable fix(probe,healthy): use the local cache when the apiserver is unavailable Nov 12, 2025
@armru
Copy link
Member Author

armru commented Nov 12, 2025

/test limit=local

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19298987488

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Nov 12, 2025
@armru armru force-pushed the dev/readiness-probe branch from efff778 to b0ee34b Compare November 13, 2025 10:30
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Nov 13, 2025
@armru armru changed the title fix(probe,healthy): use the local cache when the apiserver is unavailable fix(probe,readiness): improve probe resilience to transient API server connectivity issues Nov 13, 2025
@armru armru changed the title fix(probe,readiness): improve probe resilience to transient API server connectivity issues fix(probe,readiness): improve resilience to transient API server connectivity issues Nov 13, 2025
@armru
Copy link
Member Author

armru commented Nov 13, 2025

/test limit=local

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19328577759

@armru armru force-pushed the dev/readiness-probe branch 2 times, most recently from 892bce2 to cd78c40 Compare November 13, 2025 10:36
@armru
Copy link
Member Author

armru commented Nov 13, 2025

/test limit=local

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19337811527

@armru
Copy link
Member Author

armru commented Nov 17, 2025

/test limit=local

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19424631451

@mnencia mnencia force-pushed the dev/readiness-probe branch from 1e8ea81 to daa0555 Compare November 26, 2025 16:57
armru and others added 5 commits November 27, 2025 14:49
…ilable

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia mnencia force-pushed the dev/readiness-probe branch 2 times, most recently from 893fbfa to b8c0b68 Compare November 27, 2025 15:05
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia mnencia force-pushed the dev/readiness-probe branch from c08e607 to e511e2c Compare November 27, 2025 15:44
@mnencia
Copy link
Member

mnencia commented Nov 27, 2025

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19742489229

Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia mnencia force-pushed the dev/readiness-probe branch from f5ae1d3 to 5322108 Compare November 28, 2025 13:08
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
@mnencia
Copy link
Member

mnencia commented Nov 28, 2025

/test

@github-actions
Copy link
Contributor

@mnencia, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19765080284

@mnencia mnencia merged commit 1f11235 into main Nov 28, 2025
35 checks passed
@mnencia mnencia deleted the dev/readiness-probe branch November 28, 2025 15:51
cnpg-bot pushed a commit that referenced this pull request Nov 28, 2025
…ectivity issues (#9148)

This change enhances the resilience of all probe types (liveness,
readiness, and startup) when facing transient Kubernetes API server
connectivity issues. Previously, readiness and startup probes would fail
immediately if unable to reach the API server, potentially causing
unnecessary pod restarts or preventing pods from becoming ready.

The improvement introduces a unified cluster caching mechanism that:

- Creates a **single shared cache** instance used across all three probe
  types (liveness, readiness, startup) to reduce memory usage and ensure
  consistency
- Implements **thread-safe** cache operations with proper mutex locking
  to support concurrent probe execution
- Attempts to fetch the cluster definition with a **500ms timeout** to
  avoid blocking the probe for too long
- **Falls back to a cached cluster definition** if the API server is
  temporarily unreachable
- **Falls back to default probe configuration** if no cached cluster is
  found
- Maintains probe functionality during brief network interruptions or
  API server unavailability
- Uses optimized memory allocation patterns to avoid unnecessary
  `DeepCopy` operations

This ensures consistent behavior across all probe types and reduces
false positives during transient network issues, while also improving
performance through shared resources and optimized memory usage.

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
Signed-off-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
Co-authored-by: Marco Nenciarini <marco.nenciarini@enterprisedb.com>
(cherry picked from commit 1f11235)
@kamikaze
Copy link

suddenly I'm flooded with:

readiness probe using cached cluster definition due to API server connectivity issue

but my tests barely reach 200ms max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases bug 🐛 Something isn't working lgtm This PR has been approved by a maintainer no-issue ok to merge 👌 This PR can be merged release-1.27 size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants