Skip to content

Fix flaky test_short_disconnection backup/restore tests#96838

Merged
alexey-milovidov merged 1 commit intomasterfrom
fix-flaky-short-disconnection-test
Feb 14, 2026
Merged

Fix flaky test_short_disconnection backup/restore tests#96838
alexey-milovidov merged 1 commit intomasterfrom
fix-flaky-short-disconnection-test

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Fix test_short_disconnection_doesnt_stop_backup and test_short_disconnection_doesnt_stop_restore by limiting ZK connection drop duration when faster_zk_disconnect_detect.xml is used
  • When this config is active (session_timeout_ms=5000), the iptables drop lasting up to 3-4 seconds can cause ZK session expiry due to total heartbeat silence exceeding 5 seconds (drop duration + time since last heartbeat + reconnection overhead)
  • The fix limits drop duration to 1 second with faster detection, keeping original timing otherwise

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96758&sha=d29b41fbe684f8c90ace4fd71828ce0d4ac8b88f&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%204%2F4%29

Closes #80359

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

🤖 Generated with Claude Code

…rt_disconnection_doesnt_stop_restore`

When `faster_zk_disconnect_detect.xml` is randomly chosen (which sets
`session_timeout_ms=5000`), the ZK connection drop via iptables must be
short enough to avoid session expiry. Previously, the drop duration was
up to 3-4 seconds via `random_sleep`. Combined with the time since the
last heartbeat (~1.7s for a 5s session timeout) and reconnection
overhead, the total silence could exceed 5 seconds, causing the ZK
session to expire and the backup/restore to fail regardless of the
30-second `failure_after_host_disconnected_for_seconds` threshold.

The fix limits the drop duration to 1 second when using the faster ZK
disconnect detection config, while keeping the original duration when
using default ZK settings.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96758&sha=d29b41fbe684f8c90ace4fd71828ce0d4ac8b88f&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%204%2F4%29

Closes: #80359

Co-Authored-By: Claude Sonnet 4.5 (1M context) <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 13, 2026

Workflow [PR], commit [c79630b]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-ci label Feb 13, 2026
@alexey-milovidov alexey-milovidov self-assigned this Feb 14, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Feb 14, 2026
Merged via the queue into master with commit 23ce725 Feb 14, 2026
135 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-flaky-short-disconnection-test branch February 14, 2026 01:23
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Several test_backup_restore_on_cluster tests are broken

2 participants