Skip to content

Fix TSan "unlock of an unlocked mutex" in CHECK TABLE for ReplicatedMergeTree#94541

Merged
alexey-milovidov merged 1 commit intomasterfrom
fix-tsan-unlock-wrong-thread-check-table
Jan 19, 2026
Merged

Fix TSan "unlock of an unlocked mutex" in CHECK TABLE for ReplicatedMergeTree#94541
alexey-milovidov merged 1 commit intomasterfrom
fix-tsan-unlock-wrong-thread-check-table

Conversation

@alexey-milovidov
Copy link
Copy Markdown
Member

The DataValidationTasks struct stored a std::unique_lock<std::mutex> holding a lock on BackgroundSchedulePoolTaskInfo::exec_mutex. This caused a ThreadSanitizer warning because:

  1. Thread A acquires the lock via pausePartsCheck() -> getExecLock()
  2. DataValidationTasks is stored in a shared_ptr and copied to worker threads during parallel CHECK TABLE execution (via TableCheckTask)
  3. When the last reference is destroyed (potentially on Thread B), the unique_lock destructor tries to unlock the mutex
  4. pthread mutexes must be unlocked by the same thread that locked them, causing the TSan error "unlock of an unlocked mutex (or by a wrong thread)"

The fix replaces pausePartsCheck() (which returned a unique_lock) with temporaryPause() which returns a new TemporaryPause RAII guard. This guard uses deactivate() to pause the background task and activateAndSchedule() in its destructor to resume it. Both operations are thread-safe and can be called from any thread, eliminating the thread-affinity issue.

Changes:

  • Add getTaskInfoPtr() to BackgroundSchedulePoolTaskHolder
  • Add TemporaryPause class to ReplicatedMergeTreePartCheckThread
  • Remove pausePartsCheck() and replace all usages with temporaryPause()
  • Update DataValidationTasks to use TemporaryPause instead of unique_lock

Fixes #87916

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

…ergeTree

The `DataValidationTasks` struct stored a `std::unique_lock<std::mutex>`
holding a lock on `BackgroundSchedulePoolTaskInfo::exec_mutex`. This caused
a ThreadSanitizer warning because:

1. Thread A acquires the lock via `pausePartsCheck()` -> `getExecLock()`
2. `DataValidationTasks` is stored in a `shared_ptr` and copied to worker
   threads during parallel CHECK TABLE execution (via `TableCheckTask`)
3. When the last reference is destroyed (potentially on Thread B), the
   `unique_lock` destructor tries to unlock the mutex
4. pthread mutexes must be unlocked by the same thread that locked them,
   causing the TSan error "unlock of an unlocked mutex (or by a wrong thread)"

The fix replaces `pausePartsCheck()` (which returned a `unique_lock`) with
`temporaryPause()` which returns a new `TemporaryPause` RAII guard. This
guard uses `deactivate()` to pause the background task and `activateAndSchedule()`
in its destructor to resume it. Both operations are thread-safe and can be
called from any thread, eliminating the thread-affinity issue.

Changes:
- Add `getTaskInfoPtr()` to `BackgroundSchedulePoolTaskHolder`
- Add `TemporaryPause` class to `ReplicatedMergeTreePartCheckThread`
- Remove `pausePartsCheck()` and replace all usages with `temporaryPause()`
- Update `DataValidationTasks` to use `TemporaryPause` instead of `unique_lock`

Fixes #87916

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jan 18, 2026

Workflow [PR], commit [b3f0dd0]

Summary:

job_name test_name status info comment
AST fuzzer (amd_msan) failure
Received signal 6 (STID: 2042-0e62) FAIL cidb

@clickhouse-gh clickhouse-gh bot added the pr-ci label Jan 18, 2026
@alexey-milovidov alexey-milovidov self-assigned this Jan 19, 2026
@alexey-milovidov alexey-milovidov merged commit 40089a6 into master Jan 19, 2026
129 of 132 checks passed
@alexey-milovidov alexey-milovidov deleted the fix-tsan-unlock-wrong-thread-check-table branch January 19, 2026 12:53
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jan 19, 2026
azat added a commit to azat/ClickHouse that referenced this pull request Jan 19, 2026
azat added a commit to azat/ClickHouse that referenced this pull request Jan 19, 2026
/// deactivate() waits for any running task execution to finish
/// and prevents new executions from starting.
/// This is safe to call from any thread.
task->deactivate();
Copy link
Copy Markdown
Member

@antonio2368 antonio2368 Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is same and works as intendend.
ReplicatedMergeTreePartCheckThread::run calls task->schedule which will instantly add the task again before releasing the exec_mutex.

https://github.com/ClickHouse/ClickHouse/blob/master/src/Core/BackgroundSchedulePool.cpp#L213

This is causing issues in CI.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it does check for deactivated before proceeding with execution of function 🤔

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I figured it out.
If we have 2 threads executing DROP_RANGE, it will activate the task when it finishes, even though the second DROP_RANGE still didn't finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThreadSanitizer: unlock of an unlocked mutex in test_check_table/test.py

3 participants