Skip to content

Server crash (std::terminate) during ReplicatedMergeTree startup due to race in delete_tmp_ directory removal #94891

@myeongjjun

Description

@myeongjjun

Company or project name

No response

Describe what's wrong

DataPartStorageOnDiskBase::clearDirectory crashes with std::terminate when another thread concurrently removes the same delete_tmp_* directory.

During ReplicatedMergeTree startup, two operations run concurrently:

  • loadOutdatedDataParts (async) — renames duplicate part to delete_tmp_*, then calls clearDirectory
  • clearOldTemporaryDirectories (AttachThread) — finds the same delete_tmp_* directory and deletes it

clearDirectory's fallback calls fs::remove_all, which throws filesystem_error (no_such_file_or_directory) when files disappear mid-traversal. This exception propagates to std::terminate.

clearOldTemporaryDirectories already handles this with catch (const fs::filesystem_error & e) (MergeTreeData.cpp:3130), but clearDirectory lacks the same defense.

Does it reproduce on the most recent release?

Yes

How to reproduce

Not deterministically reproducible. Requires ReplicatedMergeTree with duplicate outdated parts on local disk at startup.

Expected behavior

No response

Error message and/or stacktrace

From system.text_log (columns: event_time, message, level, thread_name, thread_id, source_file, source_line).
Paths are anonymized, but timestamps, thread names, and stack traces are from actual logs.

2026-01-23 03:59:23,Remove duplicate part /var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/20260123_0_80494_8/,Error,
ThreadPool,856,"src/Storages/MergeTree/MergeTreeData.cpp; ...",1816

2026-01-23 03:59:23,Removing temporary directory /var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/delete_tmp_20260123
_0_80494_8/,Warning,BgSchPool,609,"src/Storages/MergeTree/MergeTreeData.cpp; size_t DB::MergeTreeData::clearOldTemporaryDirectories(const String &,
size_t, const NameSet &)",2841

2026-01-23 03:59:23,"Cannot quickly remove directory
/var/lib/clickhouse/disks/disk5/store/ab0/ab067610-de3a-41bf-a1ac-1afe6f249236/delete_tmp_20260123_0_80494_8 by removing files; fallback to
recursive removal. Reason: Code: 458. DB::ErrnoException: Cannot unlink file .../delete_tmp_20260123_0_80494_8/data.bin: , errno: 2, strerror: No
such file or directory. (CANNOT_UNLINK)",Error,ThreadPool,856,"src/Storages/MergeTree/DataPartStorageOnDiskBase.cpp; void
DB::DataPartStorageOnDiskBase::clearDirectory(...)",907

2026-01-23 03:59:28,"Loading of outdated parts failed. Will terminate to avoid undefined behaviour due to inconsistent set of parts. Exception:
std::exception. Code: 1001, type: std::__1::filesystem::filesystem_error, e.what() = filesystem error: in remove_all: No such file or directory
["".../delete_tmp_20260123_0_80494_8/""]",Error,BgSchPool,607,"src/Storages/MergeTree/MergeTreeData.cpp; void
DB::MergeTreeData::loadOutdatedDataParts(bool)",2517

2026-01-23 03:59:28,"(version 25.6.13.41 (official build)) (from thread 607) Terminate called for uncaught
exception:",Fatal,clickhouse-serv,107,"src/Common/SignalHandlers.cpp; void SignalListener::onTerminate(std::string_view, UInt32) const",393

2026-01-23 03:59:28,"(version 25.6.13.41 (official build)) (from thread 607) (query_id: BgSchPool::...) Received signal Aborted
(6)",Fatal,clickhouse-serv,885,"src/Common/SignalHandlers.cpp; void SignalListener::onFault(...)",500
0. std::system_error::system_error(std::error_code, String const&) @ 0x000000001b6225f7
1. std::filesystem::filesystem_error::filesystem_error[abi:ne190107](...) @ 0x000000000fce501f
2. void std::filesystem::__throw_filesystem_error[abi:ne190107](...) @ 0x000000001b5da18d
3. std::filesystem::detail::ErrorHandler<unsigned long>::report(...) @ 0x000000001b5de8e9
4. DB::DiskLocal::removeRecursive(String const&) @ 0x000000001333cfc4
5. DB::DataPartStorageOnDiskBase::clearDirectory(...) @ 0x0000000014db57d1
6. DB::DataPartStorageOnDiskBase::remove(...) @ 0x0000000014db0705
7. DB::IMergeTreeDataPart::remove() @ 0x0000000014dcee4f
8. DB::MergeTreeData::loadOutdatedDataParts(bool)::$_0 @ 0x0000000014f815fc
9. DB::ThreadPoolCallbackRunnerLocal<...>::executeCallback(...) @ 0x00000000113e50f1
10. DB::ThreadPoolCallbackRunnerLocal<...>::operator()(...)::lambda @ 0x00000000113e4edf
11. ThreadPoolImpl<...>::ThreadFromThreadPool::worker() @ 0x000000000fd6f52b

2. ? @ 0x00000000000969fd
3. ? @ 0x0000000000042476
4. ? @ 0x00000000000287f3
5. terminate_handler() @ 0x0000000010012796
6. std::__terminate(void (*)()) @ 0x000000001b644203
7. std::terminate() @ 0x000000001b6441ef
8. DB::MergeTreeData::loadOutdatedDataParts(bool) @ 0x0000000014ed9582
9. DB::BackgroundSchedulePool::threadFunction() @ 0x0000000012cf483e
10. BackgroundSchedulePool::$_1::lambda @ 0x0000000012cf61e2
11. ThreadPoolImpl<std::thread>::ThreadFromThreadPool::worker() @ 0x000000000fd6c752

Additional context

The race was introduced in #42181 when loadOutdatedDataParts became asynchronous, enabling concurrent execution with clearOldTemporaryDirectories in the attach thread (targeting delete_tmp_* since #37906).

Metadata

Metadata

Assignees

No one assigned

    Labels

    potential bugTo be reviewed by developers and confirmed/rejected.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions