[RFC] Remove covered parts for fetched part by azat · Pull Request #39737 · ClickHouse/ClickHouse

azat · 2022-07-30T11:07:35Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Remove covered parts for fetched part (to avoid possible replication delay grows)

Here is an example that I found on production, simplified.

Consider the following queue (nothing of this had been processed on this
replica):

GET_PART all_0_0_0 (queue-0000000001)
GET_PART all_1_1_0 (queue-0000000002)
...
GET_PART all_0_1_1 (queue-0000000003)
GET_PART all_2_2_0 (queue-0000000004)
...
MERGE_PARTS from [all_0_1_1, all_2_2_0] to all_0_2_2 (queue-0000000005)

And now queue-0000000005 started to executing (either because
of reording, or because at that time GET_PART fails), and it
does not have any required parts, so it will fetch them, but
not all_0_0_0 and all_1_1_0, so this replica delay will set to
the time of min(queue-0000000001, queue-0000000002), while it
is not true, since it already have parts that covers those
parts.

and since MERGE_PARTS takes 30min, it increased the replica delay
eventually to 30min, for the time range of 30min, which is pretty huge.

Cc: @tavplubix (what do you think about this?)

azat · 2022-08-05T06:24:58Z

@tavplubix can you please take a look and comment on this?

tavplubix · 2022-08-08T16:38:04Z

src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

I'm not sure that it will work correctly with "actual part name" optimization...

You mean that here actual_new_part_name should be take into account?

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 1037 in 8d1d2ec

bool simple_op_covered = is_simple_producing_op && part_info.contains(MergeTreePartInfo::fromPartName((*it)->new_part_name, format_version));

Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?

You mean that here actual_new_part_name should be take into account?

Not precisely, I mean that logic that "upgrades" currently executing entry to covering entry probably should be taken into account. See addFuturePartIfNotCoveredByThem and related methods. Seems like deadlock is possible, because setActualPartName may wait for older entries ("age" is determined by znode_name), but removePartProducingOpsInRange now may wait for all entries covered by part_info. And usually covered entries are older, but not when "actual part name" logic is involved.

Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?

No, because currently removePartProducingOpsInRange can be called by DROP_RANGE or REPLACE_RANGE entries only and actual_new_part_name can be set only for simple part-producing operation (GET_PART, MERGE_PARTS, MUTATE_PART)

So I've looked at this again, and let me describe possible scenarios (that may lead to a possible deadlock):

queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
T2 will execute fetch of 1_1_2_0, no 1_0_2_1 in future_parts yet
This should be safe because T2 will not fetch anything because of

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Lines 1116 to 1125 in d4a7e18

if (covered_entries_to_wait)

{

if (entry.znode_name < future_part_elem.second->znode_name)

{

out_reason = fmt::format(

"Not executing log entry {} for part {} "

"because it is not disjoint with part {} that is currently executing and another entry {} is newer.",

entry.znode_name, new_part_name, future_part_elem.first, future_part_elem.second->znode_name);

LOG_TRACE(log, fmt::runtime(out_reason));

return true;

queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
T2 will execute fetch of 1_1_2_0, and also fetch 1_0_2_1 instead
This should be safe because of

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Lines 1069 to 1084 in d4a7e18

bool ReplicatedMergeTreeQueue::isCoveredByFuturePartsImpl(const LogEntry & entry, const String & new_part_name,

String & out_reason, std::unique_lock<std::mutex> & /* queue_lock */,

std::vector<LogEntryPtr> * covered_entries_to_wait) const

{

/// Let's check if the same part is now being created by another action.

auto entry_for_same_part_it = future_parts.find(new_part_name);

if (entry_for_same_part_it != future_parts.end())

{

const LogEntry & another_entry = *entry_for_same_part_it->second;

out_reason = fmt::format(

"Not executing log entry {} of type {} for part {} "

"because another log entry {} of type {} for the same part ({}) is being processed. This shouldn't happen often.",

entry.znode_name, entry.type, entry.new_part_name,

another_entry.znode_name, another_entry.type, another_entry.new_part_name);

LOG_INFO(log, fmt::runtime(out_reason));

return true;

queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
T2 will execute fetch of 1_0_2_1
This should be safe, because actual_part_name is also added into

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 1492 in d4a7e18

if (!queue.future_parts.emplace(entry.actual_new_part_name, entry.shared_from_this()).second)

So I don't see any possible deadlock, @tavplubix maybe you can suggest some example of it?

@tavplubix friendly ping

You're right, seems like deadlock is not possible.

Potential problem is that queue entries may be reordered during recovery of stale replica and in rare cases covering entry may have less znode_name (you can check it by running test 993):

2022.11.08 20:37:42.819521 [ 2601779 ] {} <Trace> test_zc23mw5c.alter_table_0 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000000261 for part -1_5_21_1_47 because it is not disjoint with part -1_10_10_0_46 that is currently executing and another entry queue-0000000262 is newer.

And in this case -1_10_10_0_46 (queue-0000000262) could wait for -1_5_21_1_47 (queue-0000000261) that is currently executing. And -1_5_21_1_47 could wait for covered 1_10_10_0_46 in removePartProducingOpsInRange.

But currently it's not possible for two reasons:

We fill covered_entries_to_wait and wait for these entries only when calling isCoveredByFuturePartsImpl from addFuturePartIfNotCoveredByThem from findReplicaHavingCoveringPart. We do it only for "upgrading" a part to covering part and setting "actual part name". But findReplicaHavingCoveringPart will never call addFuturePartIfNotCoveredByThem for covered part (it can only "upgrade", not "downgrade"), so actually setActualPartName will wait only for covered parts. And seems like the comparison of znode_names is redundant.

Get, attach, merge and mutate entries for not-disjoint parts cannot work concurrently. The only exception is when some entry "upgrades" and sets "actual part name": it can wait for covered entries in setActualPartName. But when it will reach removePartProducingOpsInRange not-disjoint entries do not work concurrently.

I will verify that by adding some assertions and will try to simplify these functions.

P.S. I noticed that after 5ffc38d DROP_RANGEs are postponed if some covered entries are currently executing, but it should not work this way and DROP_RANGEs should have priority. Will try to fix it too.

Hm, no, p.1 is not true: If we are executing all_1_1_0 and trying to upgrade it to all_1_5_2 and all_1_3_1 is currently executing, then all_1_1_0 will wait for all_1_3_1. So it can wait for covering parts, not only for covered. But p.2 seems to be true.

@tavplubix thanks for the details and for finish the patch!

Maybe you can shed some light on the following?

Potential problem is that queue entries may be reordered during recovery of stale replica

But why can't they be reordered without recovery?

and in rare cases covering entry may have less znode_name (you can check it by running test 993):

And if this is possible then it seems that this check is indeed redundant, no?

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 1147 in 0b4e643

if (entry.znode_name < future_part_elem.second->znode_name)

But why can't they be reordered without recovery?

It depends of what "order" is. There are at least 3 orders:

total order based on znode_name/sequential numbers of log/queue entries

partial (or total (?) within partition) order based on MergeTreePartInfo of parts produced by queue entries

"order" of entries execution (there's no order actually, it can be almost arbitrary)

This statement was about znode_name order. Entries for bigger (in terms of MergeTreePartInfo) parts usually get greater sequential numbers (and greater znode_names). Just because when we create an entry for a covering part, entries for all source parts were already created in the log. And entries are pulled from log to queue one by one in the same order.

But recovery of stale replica is tricky: it tries to copy queue of source replica (while this queue is concurrently changing) and additionally creates GET_PART entries for all parts that source replica had at the moment after copying queue entries. And an entry for bigger part can get less znode_name even if it had greater znode_name in source replica queue (just like in the log message above). Something similar may happen due to removePartAndEnqueueFetch without recovery.

And if this is possible then it seems that this check is indeed redundant, no?

No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks.

Something similar may happen due to removePartAndEnqueueFetch without recovery.

Oh, I see, thanks. Didn't know that queue can be filled not from the zookeeper log.

No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks

So let me describe how I understood it, and please correct me if I'm wrong.

Server (example for the link you provided) tries to fetch 1_417_417_0 (queue-0000004322), but it had been "upgraded" to fetch 1_300_433_10 instead, but it still waiting for:

1_381_386_1 (queue-0000004859)

1_381_394_2 (queue-0000004868)

1_381_412_3 (queue-0000004886)

1_388_388_0 (queue-0000004008)

1_417_417_0 (queue-0000004322) -- will not try to wait it

After your patch it will wait only for 1_388_388_0, but actually it will not start execution until it will be executed, and so deadlock is not possible after this?

Looks like this is happening because of the same "actual_part_name" optimization? And it is safe because this optimization can be done only locally, it is not stored in zookeeper, and so no issues if the queue entries will be reordered during replica clone. And it is also safe for the case with removePartAndEnqueueFetch, since it can create only parts with greater znode.

Maybe it worth putting a comment there (for znode comparison why it is safe, and what guarantees it has)?

Also maybe just for the sake of consistency it worth to make this check regardless covered_entries_to_wait was passed or not? (for the shouldExecuteLogEntry)

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Lines 1147 to 1155 in 0b4e643

if (entry.znode_name < future_part_elem.second->znode_name)

{

out_reason = fmt::format(

"Not executing log entry {} for part {} "

"because it is not disjoint with part {} that is currently executing and another entry {} is newer.",

entry.znode_name, new_part_name, future_part_elem.first, future_part_elem.second->znode_name);

LOG_TRACE(log, fmt::runtime(out_reason));

return true;

}

tavplubix · 2022-08-08T16:58:37Z

It makes sense to remove entries covered by fetched part, but I'm just afraid to touch replication queue unless I really has to :)
Btw, maybe we also should change the way replica delay is calculated or introduce more smart metric

azat · 2022-08-13T09:41:19Z

It makes sense to remove entries covered by fetched part, but I'm just afraid to touch replication queue unless I really has to :)

Completely understand, and kind of agree, but it makes delay inaccurate sometimes.
So what will be the final decision?

Btw, maybe we also should change the way replica delay is calculated or introduce more smart metric

I thought about this too, but it seems that it may complicate things even more...
Plus if such entries (that will be removed from the queue with this patch) will be in the queue, they will be retried on and on, with noisy messages in the log.

Here is an example that I found on production, simplified. Consider the following queue (nothing of this had been processed on this replica): - GET_PART all_0_0_0 (queue-0000000001) - GET_PART all_1_1_0 (queue-0000000002) ... - GET_PART all_0_1_1 (queue-0000000003) - GET_PART all_2_2_0 (queue-0000000004) ... - MERGE_PARTS from [all_0_1_1, all_2_2_0] to all_0_2_2 (queue-0000000005) And now queue-0000000005 started to executing (either because of reording, or because at that time GET_PART fails), and it does not have any required parts, so it will fetch them, but not all_0_0_0 and all_1_1_0, so this replica delay will set to the time of min(queue-0000000001, queue-0000000002), while it is not true, since it already have parts that covers those parts. and since MERGE_PARTS takes 30min, it increased the replica delay eventually to 30min, for the time range of 30min, which is pretty huge. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>

tavplubix · 2022-11-14T15:20:53Z

ClickHouse special build check - #43084
Stress test (msan) - clearOldPartsFromFilesystem

alexey-milovidov · 2023-02-12T06:23:25Z

@azat this change introduced O(n^2) complexity of fetching parts, and most of the time is spent under a mutex.
If the number of data parts is large (e.g. 50 000), it becomes extremely slow to recover.

[RFC] Revert #39737

* upstream/master: Slightly improve error message for required Yasm assembler Update docs/en/operations/server-configuration-parameters/settings.md Instrumentation of callbacks for distributed queries Add an option for watchdog to restart the child process Fixed build Remove mutex from replicas_status [RFC] Revert ClickHouse#39737 Update MergeTreeData.cpp Revert "Beter diagnostics from http in clickhouse-test" Fix macOs compilation due to sprintf Min Function Doc Fix Remove noisy logging Update RabbitMQProducer.cpp Simplify ATTACH MergeTree table FROM S3 in tests Fix style Try prevent flaky test Update test 00965_shard_unresolvable_addresses Fix build suppressing test inaccuracy more review fixes More reliable Add logging Update clickhouse-test add setting to disable vertical merges from compact parts Remove logging because of test 00002_log_and_exception_messages_formatting Try to make 02346_full_text_search less flaky Update clickhouse-test Update zstd to 1.5.4 fix data race between check table query and background part checker thread Revert "Merge pull request ClickHouse#46236 from ClickHouse/revert-45681-compact-parts-vertical-merge" docs for logger stream_compression fix Fix FreeBSD build Update src/Storages/MergeTree/MergeTreeData.cpp Update src/Storages/MergeTree/MergeTreeData.cpp Update README.md Add upcoming Events ASTFunction: avoid rewriting tuple function as literal when not safe Fix possible out of bounds error while reading LowCardinality(Nullable) in Arrow format Fix clang-tidy Fix build Update run.sh More spelling fixes Review comments fix Update run.sh Fix duplicate includes in poco Fix spelling Update src/Storages/MergeTree/MergeTreeData.h rename API rename API Early return fix build error add API to MergeTreeData Add a test review fixes Update Curl to 7.87.0 Reject DoS-prone hyperscan regexes Revert "Allow vertical merges from compact to wide parts" increase a time gap between insert and ttl move fix build error beauty Fix tests Fix write buffer destruction order for vertical merge. Fix ARM and PPC builds Update preciseExp10.cpp remove useless code add passes for rewriting arrayexists Automatic style fix Prefer explicitly defined creds for S3 Implement ClickHouse#16733 Fix something fix bugs Add a test for ClickHouse#30421 Make it better add perf test Return note about function name in preciseExp10.h add visitor files rewrite array exists to has beauty reformat fix style fix bug fix bug Add setting check_referential_table_dependencies to check referential dependencies on DROP TABLE. Resubmit prefetches More warning exclusions in poco odbc fix style add test fix style Fix FreeBSD build Fix warnings in poco odbc fix style fix style fix style fix style fix style fix style fix pointer bug delete extra API and use dynamic_cast to compute Remove excessive license notices from preciseExp10.cpp add parts, active_parts total_marks to system.tables Fix whitespaces Add docs Fix warnings in poco json Fix warnings in poco xml Fix warnings in poco netssl_openssl Fix warnings in poco mongodb Fix warnings in poco utils Fix warnings in poco data Fix warnings in poco net Fix warnings in poco crypto Fix warnings in poco redis Fix warnings in poco foundation More fine-granular warning relaxation Add in code doc to sipHash128Reference Fix style Add a test Fix strange trash Remove bits of trash Add a test Allow accurate comparison of Big Int with other integers tests: add tests for sipHash128Reference{,Keyed} docs: functions: hash: add doc for sipHash126Reference{,Keyed} docs: functions: hash: fix wrong doc docs: functions: hash: add warning about sipHash128{,Keyed} Functions: Hashing: add sipHash128Reference{,Keyed} Common: SipHash: add original 128-bit output implementation Fix endian issue in CityHash for s390x Cleanup Fix integration test Cleanup Fix integration test Fix build Forbid wrong create view syntax Use new label from the event instead of all PR labels Improve readability by "elif"=>"if" Add early exit on freshly opened PR to not process labels Adding custom field continued Revert changes with complex SettingsChanges Add field CustomType Fix test after incorrect merge with master Properly detect changes in Rust code and recompile Rust libraries fix typos fix integrate IO scheduler with buffers DELETE + WHERE in TTL Update tests Add tags to test Support FixedSizeBinary type in Parquet/Arrow Update test Resolve some review comments Fix build Commit forgotten files Get rid of ASTPtr in settings changes Revert "Revert "Better code around SettingsChanges"" Revert "Not better" Not better Revert "Better code around SettingsChanges" CLICKHOUSE-2375 Add interserver DNS retries Fix test Fix previous commit changes Fix Fix after merge Fix style check Better code around SettingsChanges Add local disk path restriction style Attempt to fix 'Local: No offset stored message' from Kafka Fix integration test Update 02454_create_table_with_custom_disk.sql Try fix test Fix test Avoid incorrect warning during update of DiskSelector disks Allow custom disk configuration Support single disk instead of storage policy

azat · 2023-03-01T12:22:04Z

This patch has another, more major problem, it can lead to intersecting parts.

Consider the following example, you have the following in the queue:

all_1_1_0 (GET_PART)
...
all_40_40_0 (GET_PART)
all_20_37_1 (MERGE_PARTS)

Now (T1 - thread1, T2 - thread2):

T1: fetch all_15_15_0
T1: fetch all_40_40_0
T2: assign merge, copy virtual_parts to prev_virtual_parts

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 2071 in 5b42383

prev_virtual_parts = queue.virtual_parts;
T2: cache current active data_parts

ClickHouse/src/Storages/MergeTree/MergeTreeDataMergerMutator.cpp

Line 193 in 5b42383

data_parts = data.getDataPartsVectorForInternalUsage();
T1: instead of all_26_26_0, all_20_37_1 will be fetched
T1: commit all_20_37_1 (added to active data parts)

ClickHouse/src/Storages/StorageReplicatedMergeTree.cpp

Line 4151 in 5b42383

replaced_parts = checkPartChecksumsAndCommit(transaction, part, hardlinked_files);
T1: remove 18 entries from the queue (queue.virtual_parts)

ClickHouse/src/Storages/StorageReplicatedMergeTree.cpp

Line 4078 in cc65a19

queue.removePartProducingOpsInRange(zookeeper, part->info, /* covering_entry= */ {}, entry_znode);
T2: check can it merge all_15_15_0, all_40_40_0 - yes, because the gap check does not work, because those 18 entries had been removed

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 2266 in 5b42383

Strings covered = queue.virtual_parts.getPartsCoveredBy(gap_part_info);
T2: assign merge all_15_40_1

@tavplubix can you please take a look at this example? (ReplicatedMergeTree is not a trivial thing)

And query of the day: find intersecting parts (thought it is not 100% accurate, at least it does not takes into account errors during INSERT):

SELECT
    event_time,
    event_type,
    part_name,
    exception,
    merged_from,
    arrayFlatten(arrayMap(x -> range((arrayMap(x -> toUInt64(x), arraySlice(splitByChar('_', x), 2, 2)) AS arr_)[1], (arr_[2]) + 1), merged_from)) AS blocks,
    blocks != range(arrayMin(blocks), arrayMax(blocks) + 1) AS intersecting_parts
FROM clusterAllReplicas('drt', system.part_log)
WHERE (length(merged_from) > 0) AND (event_type = 'MergeParts') AND (table_uuid IN (
    SELECT uuid
    FROM system.tables
    WHERE engine LIKE '%Replicated%'
)) AND (length(arrayFilter(x -> endsWith(x, '_0'), merged_from)) = length(merged_from)) AND (intersecting_parts = 1)
LIMIT 1 BY part_name
LIMIT 10

Query id: b71ce22a-6d0f-4fb5-83a1-50ab66db51db

Row 1:
──────
event_time:         2023-02-26 16:06:34
event_type:         MergeParts
part_name:          f64ee14d51832253179b7471fb3b20e0_15_40_1
exception:          Code: 49. DB::Exception: Part f64ee14d51832253179b7471fb3b20e0_15_40_1 intersects part f64ee14d51832253179b7471fb3b20e0_20_37_1 (state Active) It is a bug. (LOGICAL_ERROR) (version 23.1.3.5 (official build))
merged_from:        ['f64ee14d51832253179b7471fb3b20e0_15_15_0','f64ee14d51832253179b7471fb3b20e0_16_16_0','f64ee14d51832253179b7471fb3b20e0_17_17_0','f64ee14d51832253179b7471fb3b20e0_18_18_0','f64ee14d51832253179b7471fb3b20e0_19_19_0','f64ee14d51832253179b7471fb3b20e0_38_38_0','f64ee14d51832253179b7471fb3b20e0_39_39_0','f64ee14d51832253179b7471fb3b20e0_40_40_0']
blocks:             [15,16,17,18,19,38,39,40]
intersecting_parts: 1

tavplubix · 2023-03-10T20:29:03Z

T1: remove 18 entries from the queue (queue.virtual_parts)

AFAIR removePartProducingOpsInRange does not remove anything from virtual_parts, it only removes covered entries from the queue. So it could not lead to intersecting parts.

azat · 2023-03-15T10:45:09Z

AFAIR removePartProducingOpsInRange does not remove anything from virtual_parts, it only removes covered entries from the queue. So it could not lead to intersecting parts.

AFAICS it does:

ReplicatedMergeTreeQueue::removePartProducingOpsInRange
- updateStateOnQueueEntryRemoval(is_success=false)
  - virtual_parts.remove -
    
    ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp
    
    Line 401 in 00c05f7
    
    virtual_parts.remove(virtual_part_name);

@tavplubix What did I miss?

And if it will not remove the entry from the virtual_queue, who will?

Also maybe you have some thoughts about how this patch introduced intersecting parts?

tavplubix · 2023-03-15T11:40:33Z

Hm, yes, you're right, it may remove virtual parts in some unusual cases.

But it will not remove it in the case you described above because ActiveDataPartSet contains only "the most top-level" parts (covered parts are replaced with a covering one in ActiveDataPartSet::add), and ActiveDataPartSet::remove looks for exact match:

ClickHouse/src/Storages/MergeTree/ActiveDataPartSet.h

Lines 46 to 49 in fa467b3

    
           bool remove(const MergeTreePartInfo & part_info) 
        
           { 
        
               return part_info_to_name.erase(part_info) > 0; 
        
           }

So, in your example, T1: commit all_20_37_1 (added to active data parts) replaces all_20_20_0 ... all_37_37_0 with all_20_37_1. After that, vurtual_parts.remove("all_20_20_0") will do nothing and return false because there's no such part in virtual_parts. So usually it's safe to call removePartProducingOpsInRange after fetching a part.

It may actually remove something if we have two queue entries with the same virtual part, for example, duplicate GET_PART entries (they can appear after cloneReplica). A part covers itself, so after finishing the first entry we will find the second one and will try to remove it. And in this case, ActiveDataPartSet::remove will find the part and remove it.

azat · 2023-03-15T14:54:34Z

Yeah, looks like you are right!

So, in your example, T1: commit all_20_37_1 (added to active data parts) replaces all_20_20_0 ... all_37_37_0 with all_20_37_1

Actually that line (

ClickHouse/src/Storages/StorageReplicatedMergeTree.cpp

Line 4151 in 5b42383

    
           replaced_parts = checkPartChecksumsAndCommit(transaction, part, hardlinked_files);

) doesn't do this, and my comment was wrong, but this will be done anyway in removeProcessedEntry

And it will remove all_20_20_0 - all_37_37_0 from virtual_parts when it will add all_20_37_1 to it (when it will fetch this entry from the queue) in ReplicatedMergeTreeQueue::insertUnlocked

And all_20_37_1 should be in virtual parts, because it was set as actual_part_name -

ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp

Line 1259 in 5b42383

    
           reject_reason = fmt::format("Log entry for part {} or covering part is not pulled from log to queue yet.", part_name);

But what I found in logs is that all_20_37_1 had been processed before all_15_15_0 - all_40_40_0 (insertUnlocked call), and with this order looks like it was indeed possible to create intersecting part all_15_40_1, what do you think?

robot-clickhouse added the pr-improvement Pull request with some product improvements label Jul 30, 2022

tavplubix self-assigned this Aug 5, 2022

tavplubix reviewed Aug 8, 2022

View reviewed changes

azat marked this pull request as draft August 18, 2022 14:27

azat force-pushed the fetch-remove-covered branch from a3f12a4 to d4bbf92 Compare September 30, 2022 08:03

azat force-pushed the fetch-remove-covered branch from d4bbf92 to 8db31be Compare October 26, 2022 08:10

Merge branch 'master' into fetch-remove-covered

2bd4a71

tavplubix marked this pull request as ready for review November 8, 2022 11:48

tavplubix added 2 commits November 10, 2022 20:53

priority for DROP_RANGEs, assertion for others

bb00a24

fix

dbf64e2

tavplubix approved these changes Nov 14, 2022

View reviewed changes

Merge branch 'master' into fetch-remove-covered

c64109a

tavplubix merged commit cc65a19 into ClickHouse:master Nov 14, 2022

azat deleted the fetch-remove-covered branch November 15, 2022 18:23

tavplubix mentioned this pull request Nov 30, 2022

Deadlock in ReplicatedMergeTreeQueue/MergeTreeBackgroundExecutor when waiting for concurrently executing entry #43816

Closed

alexey-milovidov added a commit that referenced this pull request Feb 12, 2023

[RFC] Revert #39737

e2079b7

alexey-milovidov mentioned this pull request Feb 12, 2023

[RFC] Revert #39737 #46309

Merged

alexey-milovidov added a commit that referenced this pull request Feb 13, 2023

Merge pull request #46309 from ClickHouse/revert-39737

8265db8

[RFC] Revert #39737

tavplubix mentioned this pull request Dec 13, 2023

Slower replica: non-efficient execution of replication_queue (since 22.7) #57702

Closed

azat mentioned this pull request Feb 12, 2025

Remove covered parts for fetched part during ReplicatedMergeTree initialization #75970

Open

	if (covered_entries_to_wait)
	{
	if (entry.znode_name < future_part_elem.second->znode_name)
	{
	out_reason = fmt::format(
	"Not executing log entry {} for part {} "
	"because it is not disjoint with part {} that is currently executing and another entry {} is newer.",
	entry.znode_name, new_part_name, future_part_elem.first, future_part_elem.second->znode_name);
	LOG_TRACE(log, fmt::runtime(out_reason));
	return true;

	bool ReplicatedMergeTreeQueue::isCoveredByFuturePartsImpl(const LogEntry & entry, const String & new_part_name,
	String & out_reason, std::unique_lock<std::mutex> & /* queue_lock */,
	std::vector<LogEntryPtr> * covered_entries_to_wait) const
	{
	/// Let's check if the same part is now being created by another action.
	auto entry_for_same_part_it = future_parts.find(new_part_name);
	if (entry_for_same_part_it != future_parts.end())
	{
	const LogEntry & another_entry = *entry_for_same_part_it->second;
	out_reason = fmt::format(
	"Not executing log entry {} of type {} for part {} "
	"because another log entry {} of type {} for the same part ({}) is being processed. This shouldn't happen often.",
	entry.znode_name, entry.type, entry.new_part_name,
	another_entry.znode_name, another_entry.type, another_entry.new_part_name);
	LOG_INFO(log, fmt::runtime(out_reason));
	return true;

Conversation

azat commented Jul 30, 2022

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

azat commented Aug 5, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azat Nov 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

azat Nov 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tavplubix commented Aug 8, 2022

Uh oh!

azat commented Aug 13, 2022

Uh oh!

tavplubix commented Nov 14, 2022

Uh oh!

alexey-milovidov commented Feb 12, 2023

Uh oh!

azat commented Mar 1, 2023

Uh oh!

tavplubix commented Mar 10, 2023

Uh oh!

azat commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tavplubix commented Mar 15, 2023

Uh oh!

azat commented Mar 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

azat Nov 19, 2022 •

edited

Loading

azat Nov 19, 2022 •

edited

Loading

azat commented Mar 15, 2023 •

edited

Loading

azat commented Mar 15, 2023 •

edited

Loading