-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Server crash (std::terminate) during ReplicatedMergeTree startup due to race in delete_tmp_ directory removal #94891
Description
Company or project name
No response
Describe what's wrong
DataPartStorageOnDiskBase::clearDirectory crashes with std::terminate when another thread concurrently removes the same delete_tmp_* directory.
During ReplicatedMergeTree startup, two operations run concurrently:
- loadOutdatedDataParts (async) — renames duplicate part to delete_tmp_*, then calls clearDirectory
- clearOldTemporaryDirectories (AttachThread) — finds the same delete_tmp_* directory and deletes it
clearDirectory's fallback calls fs::remove_all, which throws filesystem_error (no_such_file_or_directory) when files disappear mid-traversal. This exception propagates to std::terminate.
clearOldTemporaryDirectories already handles this with catch (const fs::filesystem_error & e) (MergeTreeData.cpp:3130), but clearDirectory lacks the same defense.
Does it reproduce on the most recent release?
Yes
How to reproduce
Not deterministically reproducible. Requires ReplicatedMergeTree with duplicate outdated parts on local disk at startup.
Expected behavior
No response
Error message and/or stacktrace
From system.text_log (columns: event_time, message, level, thread_name, thread_id, source_file, source_line).
Paths are anonymized, but timestamps, thread names, and stack traces are from actual logs.
2026-01-23 03:59:23,Remove duplicate part /var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/20260123_0_80494_8/,Error,
ThreadPool,856,"src/Storages/MergeTree/MergeTreeData.cpp; ...",1816
2026-01-23 03:59:23,Removing temporary directory /var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/delete_tmp_20260123
_0_80494_8/,Warning,BgSchPool,609,"src/Storages/MergeTree/MergeTreeData.cpp; size_t DB::MergeTreeData::clearOldTemporaryDirectories(const String &,
size_t, const NameSet &)",2841
2026-01-23 03:59:23,"Cannot quickly remove directory
/var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/delete_tmp_20260123_0_80494_8 by removing files; fallback to
recursive removal. Reason: Code: 458. DB::ErrnoException: Cannot unlink file .../delete_tmp_20260123_0_80494_8/data.bin: , errno: 2, strerror: No
such file or directory. (CANNOT_UNLINK)",Error,ThreadPool,856,"src/Storages/MergeTree/DataPartStorageOnDiskBase.cpp; void
DB::DataPartStorageOnDiskBase::clearDirectory(...)",907
2026-01-23 03:59:28,"Loading of outdated parts failed. Will terminate to avoid undefined behaviour due to inconsistent set of parts. Exception:
std::exception. Code: 1001, type: std::__1::filesystem::filesystem_error, e.what() = filesystem error: in remove_all: No such file or directory
["".../delete_tmp_20260123_0_80494_8/""]",Error,BgSchPool,607,"src/Storages/MergeTree/MergeTreeData.cpp; void
DB::MergeTreeData::loadOutdatedDataParts(bool)",2517
2026-01-23 03:59:28,"(version 25.6.13.41 (official build)) (from thread 607) Terminate called for uncaught
exception:",Fatal,clickhouse-serv,107,"src/Common/SignalHandlers.cpp; void SignalListener::onTerminate(std::string_view, UInt32) const",393
2026-01-23 03:59:28,"(version 25.6.13.41 (official build)) (from thread 607) (query_id: BgSchPool::...) Received signal Aborted
(6)",Fatal,clickhouse-serv,885,"src/Common/SignalHandlers.cpp; void SignalListener::onFault(...)",500
0. std::system_error::system_error(std::error_code, String const&) @ 0x000000001b6225f7
1. std::filesystem::filesystem_error::filesystem_error[abi:ne190107](...) @ 0x000000000fce501f
2. void std::filesystem::__throw_filesystem_error[abi:ne190107](...) @ 0x000000001b5da18d
3. std::filesystem::detail::ErrorHandler<unsigned long>::report(...) @ 0x000000001b5de8e9
4. DB::DiskLocal::removeRecursive(String const&) @ 0x000000001333cfc4
5. DB::DataPartStorageOnDiskBase::clearDirectory(...) @ 0x0000000014db57d1
6. DB::DataPartStorageOnDiskBase::remove(...) @ 0x0000000014db0705
7. DB::IMergeTreeDataPart::remove() @ 0x0000000014dcee4f
8. DB::MergeTreeData::loadOutdatedDataParts(bool)::$_0 @ 0x0000000014f815fc
9. DB::ThreadPoolCallbackRunnerLocal<...>::executeCallback(...) @ 0x00000000113e50f1
10. DB::ThreadPoolCallbackRunnerLocal<...>::operator()(...)::lambda @ 0x00000000113e4edf
11. ThreadPoolImpl<...>::ThreadFromThreadPool::worker() @ 0x000000000fd6f52b
2. ? @ 0x00000000000969fd
3. ? @ 0x0000000000042476
4. ? @ 0x00000000000287f3
5. terminate_handler() @ 0x0000000010012796
6. std::__terminate(void (*)()) @ 0x000000001b644203
7. std::terminate() @ 0x000000001b6441ef
8. DB::MergeTreeData::loadOutdatedDataParts(bool) @ 0x0000000014ed9582
9. DB::BackgroundSchedulePool::threadFunction() @ 0x0000000012cf483e
10. BackgroundSchedulePool::$_1::lambda @ 0x0000000012cf61e2
11. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x000000000fd6c752
Additional context
The race was introduced in #42181 when loadOutdatedDataParts became asynchronous, enabling concurrent execution with clearOldTemporaryDirectories in the attach thread (targeting delete_tmp_* since #37906).