Skip to content

Fix flaky test_ttl_move #94853

Merged
alexey-milovidov merged 2 commits intomasterfrom
arsenmuk/i91845_flaky_check
Feb 10, 2026
Merged

Fix flaky test_ttl_move #94853
alexey-milovidov merged 2 commits intomasterfrom
arsenmuk/i91845_flaky_check

Conversation

@arsenmuk
Copy link
Copy Markdown
Member

@arsenmuk arsenmuk commented Jan 22, 2026

Changelog category (leave one):

  • Not for changelog

Closes #85011
Closes #93340

Motivation:
It was flaky due to several reasons:

  1. timing issues on trivial logic check, e.g.: we do something to trigger a move, then we sleep for move to kick in, then we wait for it to finish and attempting to assert the result. The problem here was that the move could've finished before we entered waiting, or it could've started after we entered waiting - both cases would lead to false negative.
  2. timing issues on complex multi state checks, e.g.: we set a tired policy, and the data is supposed to move to diskX after START+X1 seconds, and then to DISKY after START+X1+Y1 seconds. The reality (especially under pressure as a part of 'targeted', or 'parallel' jobs), the test or the server could've remained paused in between [START, START+X1], and then the test diverges with the server, what leads to a false negative.
  3. TestCancelBackgroundMoving was prone to timing issues - move happens too fast (even with the throttling) before we get to the checking, and the test flaps.

Modifications:

  • [fixes (1)] Modified checking to not only wait for the move to finish, but also to assert whether the expected target state has been achieved. This covers for the variety of timing issues after the move has been triggered and before we entered the check.
  • [fixes (2)] Eliminated multi-state situations in the tests - now an action is performed, leading to either an expected successful state, or either unexpected failed one. When a sequence of actions/states is to be checked, then the execution is controled manually via STOP/START MOVES to eliminate possibility of skipping stages.
  • [addresses (3)] removed the case TestCancelBackgroundMoving entirely, as couldn't find a deterministic way to perform such check under pressure and timing issues. Left a TODO that we might want to re-implement it using FailPoints
  • prettified test parametrization to compact and dedup things
  • merged logical scenarios together (there were several different cases checking for similar scenarios)

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 22, 2026

Workflow [PR], commit [10a06df]

Summary:

job_name test_name status info comment
Integration tests (arm_binary, distributed plan, 3/4) failure
test_totp_auth/test_totp.py::test_interactive_totp_authentication FAIL cidb
Integration tests (amd_asan, targeted) error

@clickhouse-gh clickhouse-gh bot added the pr-improvement Pull request with some product improvements label Jan 22, 2026
@arsenmuk arsenmuk force-pushed the arsenmuk/i91845_flaky_check branch 2 times, most recently from bce4f37 to 7a241ad Compare February 1, 2026 19:30
@arsenmuk arsenmuk changed the title [WIP] Checking if test_ttl_move is flaky [WIP] Fix flaky test_ttl_move Feb 1, 2026
@arsenmuk arsenmuk force-pushed the arsenmuk/i91845_flaky_check branch from c1f9d3c to 2057156 Compare February 4, 2026 13:09
@clickhouse-gh clickhouse-gh bot added pr-not-for-changelog This PR should not be mentioned in the changelog and removed pr-improvement Pull request with some product improvements labels Feb 4, 2026
@arsenmuk arsenmuk changed the title [WIP] Fix flaky test_ttl_move Fix flaky test_ttl_move Feb 4, 2026
@arsenmuk arsenmuk marked this pull request as ready for review February 4, 2026 13:11
@kssenii kssenii requested a review from Copilot February 4, 2026 13:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flakiness in the test_ttl_move integration test caused by timing issues and race conditions in the test logic. The changes focus on making the test more deterministic by fixing wait conditions, eliminating multi-state checks that could skip intermediate states under load, and removing an inherently flaky test case.

Changes:

  • Fixed timing issues in move completion checks by asserting expected target state
  • Eliminated multi-state test scenarios that could skip stages under pressure
  • Removed TestCancelBackgroundMoving test case due to inherent timing issues
  • Refactored test parametrization and merged similar test scenarios

@kssenii kssenii self-assigned this Feb 4, 2026
@arsenmuk arsenmuk force-pushed the arsenmuk/i91845_flaky_check branch from 2057156 to c2f17c9 Compare February 5, 2026 15:14
.strip()
.split("\n")
)
def check_if_should_skip_this_then_skip(node):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def check_if_should_skip_this_then_skip(node):
def skip_test_if_needed(node):

Co-authored-by: Kseniia Sumarokova <kssenii@clickhouse.com>
@alexey-milovidov alexey-milovidov merged commit 34f7a68 into master Feb 10, 2026
129 of 134 checks passed
@alexey-milovidov alexey-milovidov deleted the arsenmuk/i91845_flaky_check branch February 10, 2026 16:16
alexey-milovidov added a commit that referenced this pull request Feb 10, 2026
Cherry-pick of c2f17c9 ("Fix flaky test_ttl_move") by Arsen Muk.

Replaces hardcoded sleeps with proper retry-based waiting, eliminates
multi-state timing issues using `STOP`/`START MOVES`, and restructures
storage policies for clarity.

`TestCancelBackgroundMoving` is removed in this commit (will be
re-added with failpoint approach in next commit).

Co-authored-by: Arsen Muk <arsenmuk@clickhouse.com>
alexey-milovidov added a commit that referenced this pull request Feb 10, 2026
Replace the TODO comment left by PR #94853 with a failpoint-based
implementation of the `test_cancel_background_moving_on_stop_moves_query`
test.

Instead of using disk throughput throttling (which was unreliable and
caused the original flakiness), use the
`stop_moving_part_before_swap_with_active` failpoint to deterministically
pause the move after cloning but before swapping. This guarantees the
move is still in progress when `SYSTEM STOP MOVES` cancels it.

Key improvements over the original:
- No server restart needed (no disk_throttling.xml config changes)
- No race condition: the failpoint guarantees the move is in-flight
- Table-scoped `SYSTEM STOP/START MOVES` instead of global
- Uses `get_disk_names` helper from refactored test infrastructure
- Removed `test_cancel_background_moving_on_table_detach` and
  `test_cancel_background_moving_on_zookeeper_disconnect` which had
  the same throttling-based flakiness issues

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

5 participants