Skip to content

Fix broken check_callback in test_restore_db_replica#96968

Merged
alexey-milovidov merged 3 commits intomasterfrom
fix-test-restore-db-replica-callback
Feb 15, 2026
Merged

Fix broken check_callback in test_restore_db_replica#96968
alexey-milovidov merged 3 commits intomasterfrom
fix-test-restore-db-replica-callback

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

Summary

  • Fix the check_callback in test_query_after_restore_db_replica that compared raw query result (with trailing \n) against the expected table name (without \n), causing the callback to always return False and making retries useless
  • Increase retry_count from 10 to 30 to give more time for the Replicated database to initialize after node restart, especially in ASan builds

Closes #85664

Test plan

  • Existing integration tests pass: test_restore_db_replica/test.py::test_query_after_restore_db_replica
  • Verify with ASan build that the flaky test no longer fails

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

🤖 Generated with Claude Code

alexey-milovidov and others added 3 commits February 15, 2026 03:10
…em.session_log`

The `DELETE FROM system.session_log` cleanup at the start of test
`02835_drop_user_during_session` creates a synchronous lightweight delete
mutation. In ASan stress tests, `system.session_log` receives continuous
writes from all concurrent test sessions, causing the mutation to take
over 12 minutes. Even though the query gets cancelled (`is_cancelled: 1`),
it remains stuck in `StorageMergeTree::waitForMutation`.

Fix by making the cleanup DELETE non-blocking with
`lightweight_deletes_sync = 0`. This is safe because the test user name
includes the PID (`$$`), making collisions from PID reuse extremely
unlikely during a single stress test run.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=96926&sha=724c57e90613f66e2ef897f80a644ba21e98b470&name_0=PR&name_1=Stress%20test%20%28arm_asan%2C%20s3%29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a TOCTOU race in the worker thread loop: after `tryPop`
timed out (confirming the queue was empty), a context switch could
allow the main thread to push the final block and call `finish()`
before the worker checked `isFinished()`. The worker would then see
the queue as finished and exit, leaving the last block unprocessed.

This caused occasional loss of ~2000-3000 rows (one shard's portion
of the last source block) in sharded dictionaries loaded in parallel.

The fix replaces `isFinished()` with `isFinishedAndEmpty()`, which
atomically verifies both that the queue is marked finished AND that
no items remain. If items were pushed between the `tryPop` timeout
and this check, the worker correctly retries and processes them.

Closes #84468

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `check_callback` compared the raw query result (with trailing
newline) against the expected table name (without newline), so the
callback always returned False. This made `query_with_retry` exhaust
all retries uselessly instead of returning early when the table
appeared.

Also increase `retry_count` from 10 to 30 to give more time for the
Replicated database to initialize after node restart, especially in
ASan builds.

#85664

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Feb 15, 2026

Workflow [PR], commit [6272f5c]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-ci label Feb 15, 2026
@alexey-milovidov alexey-milovidov self-assigned this Feb 15, 2026
@alexey-milovidov alexey-milovidov merged commit 804ae8a into master Feb 15, 2026
133 of 134 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-test-restore-db-replica-callback branch February 15, 2026 14:34
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Feb 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_restore_db_replica/test.py::test_query_after_restore_db_replica is flaky

2 participants