Skip to content

feat: set default tcp_user_timeout to 5 seconds for replicas#9317

Merged
gbartolini merged 2 commits intomainfrom
dev/9229
Nov 26, 2025
Merged

feat: set default tcp_user_timeout to 5 seconds for replicas#9317
gbartolini merged 2 commits intomainfrom
dev/9229

Conversation

@armru
Copy link
Member

@armru armru commented Nov 26, 2025

The default tcp_user_timeout for standby replication connections has been changed from the system default to 5000ms (5 seconds) for all replicas.

This new default enhances the robustness of CloudNativePG clusters by enabling standby instances to detect and recover from network issues more quickly. Previously, silent network drops could cause standbys to wait up to ~127 seconds (due to TCP SYN retries) before detecting a failure. With the new 5-second timeout, standbys will close unresponsive connections sooner and promptly retry connecting to the primary.

If this default does not meet your requirements, you can override it for all standbys managed by the operator using the STANDBY_TCP_USER_TIMEOUT configuration option.

PRESERVATION GUIDE FOR EXISTING INSTALLATIONS:
If you have an existing CloudNativePG installation where STANDBY_TCP_USER_TIMEOUT was not explicitly set (thus defaulting to 0), and you wish to preserve that behaviour after upgrading, you must now explicitly set it to 0.

Example using a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cnpg-controller-manager-config
  namespace: cnpg-system
data:
  STANDY_TCP_USER_TIMEOUT: "0"

If the variable is not explicitly configured, the new default of 5 seconds will automatically apply after the next operator upgrade or pod restart.

For more information on tcp_user_timeout, see the PostgreSQL documentation:
https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT

Closes #9229

@armru armru requested review from a team and jsilvela as code owners November 26, 2025 10:32
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 26, 2025
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Nov 26, 2025
@github-actions
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@dosubot dosubot bot added documentation 📖 Improvements or additions to documentation enhancement 🪄 New feature or request labels Nov 26, 2025
@armru armru added do not backport This PR must not be backported - it will be in the next minor release and removed backport-requested ◀️ This pull request should be backported to all supported releases release-1.25 release-1.26 release-1.27 labels Nov 26, 2025
The default value for TCP user timeout on standby
replication connections has changed from 0 (system default) to 5000ms
(5 seconds).

This change improves the default behavior of CloudNativePG installations
by ensuring standby instances can detect and recover from network issues
more quickly. Previously, when the network silently dropped packets,
standby instances could take up to 127 seconds (due to TCP SYN retries)
to detect a connection failure. With the new 5-second default, standby
instances will close unresponsive connections much faster and retry
connecting to the primary.

MIGRATION GUIDE FOR EXISTING INSTALLATIONS:
If you have an existing CloudNativePG installation where STANDBY_TCP_USER_TIMEOUT was not
explicitly set (defaulting to 0), and you want to preserve that behavior after upgrading,
you must now explicitly set STANDBY_TCP_USER_TIMEOUT to 0 in the cnpg-controller-manager-config
ConfigMap or Secret.

Example with ConfigMap:

  apiVersion: v1
  kind: ConfigMap
  metadata:
    name: cnpg-controller-manager-config
    namespace: cnpg-system
  data:
    STANDBY_TCP_USER_TIMEOUT: "0"

Note: If you do NOT have STANDBY_TCP_USER_TIMEOUT explicitly configured, the new default
of 5 seconds will be automatically applied on your next operator upgrade or pod restart.

For more details on TCP_USER_TIMEOUT and its behavior, see:
https://www.postgresql.org/docs/current/runtime-config-connection.html#GUC-TCP-USER-TIMEOUT

Closes #9229

Signed-off-by: Armando Ruocco <armando.ruocco@enterprisedb.com>
@armru
Copy link
Member Author

armru commented Nov 26, 2025

/test

@github-actions
Copy link
Contributor

@armru, here's the link to the E2E on CNPG workflow run: https://github.com/cloudnative-pg/cloudnative-pg/actions/runs/19700876458

Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Nov 26, 2025
@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Nov 26, 2025
@gbartolini gbartolini changed the title feat: change default STANDBY_TCP_USER_TIMEOUT to 5 seconds feat: set default tcp_user_timeout to 5 seconds for replicas Nov 26, 2025
@gbartolini gbartolini merged commit e29c97d into main Nov 26, 2025
36 of 39 checks passed
@gbartolini gbartolini deleted the dev/9229 branch November 26, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not backport This PR must not be backported - it will be in the next minor release documentation 📖 Improvements or additions to documentation enhancement 🪄 New feature or request lgtm This PR has been approved by a maintainer ok to merge 👌 This PR can be merged size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Set STANDBY_TCP_USER_TIMEOUT default

4 participants