[RFC] Remove covered parts for fetched part#39737
[RFC] Remove covered parts for fetched part#39737tavplubix merged 5 commits intoClickHouse:masterfrom
Conversation
|
@tavplubix can you please take a look and comment on this? |
There was a problem hiding this comment.
I'm not sure that it will work correctly with "actual part name" optimization...
There was a problem hiding this comment.
You mean that here actual_new_part_name should be take into account?
Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?
There was a problem hiding this comment.
You mean that here actual_new_part_name should be take into account?
Not precisely, I mean that logic that "upgrades" currently executing entry to covering entry probably should be taken into account. See addFuturePartIfNotCoveredByThem and related methods. Seems like deadlock is possible, because setActualPartName may wait for older entries ("age" is determined by znode_name), but removePartProducingOpsInRange now may wait for all entries covered by part_info. And usually covered entries are older, but not when "actual part name" logic is involved.
Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?
No, because currently removePartProducingOpsInRange can be called by DROP_RANGE or REPLACE_RANGE entries only and actual_new_part_name can be set only for simple part-producing operation (GET_PART, MERGE_PARTS, MUTATE_PART)
There was a problem hiding this comment.
So I've looked at this again, and let me describe possible scenarios (that may lead to a possible deadlock):
-
queue: [
1_0_1_0,1_1_2_0,1_0_2_1]
T1will execute fetch of1_0_1_0, and fetch1_0_2_1instead
T2will execute fetch of1_1_2_0, no1_0_2_1infuture_partsyet
This should be safe becauseT2will not fetch anything because ofClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp
Lines 1116 to 1125 in d4a7e18
-
queue: [
1_0_1_0,1_1_2_0,1_0_2_1]
T1will execute fetch of1_0_1_0, and fetch1_0_2_1instead
T2will execute fetch of1_1_2_0, and also fetch1_0_2_1instead
This should be safe because ofClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp
Lines 1069 to 1084 in d4a7e18
-
queue: [
1_0_1_0,1_1_2_0,1_0_2_1]
T1will execute fetch of1_0_1_0, and fetch1_0_2_1instead
T2will execute fetch of1_0_2_1
This should be safe, becauseactual_part_nameis also added into
So I don't see any possible deadlock, @tavplubix maybe you can suggest some example of it?
There was a problem hiding this comment.
You're right, seems like deadlock is not possible.
Potential problem is that queue entries may be reordered during recovery of stale replica and in rare cases covering entry may have less znode_name (you can check it by running test 993):
2022.11.08 20:37:42.819521 [ 2601779 ] {} <Trace> test_zc23mw5c.alter_table_0 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000000261 for part -1_5_21_1_47 because it is not disjoint with part -1_10_10_0_46 that is currently executing and another entry queue-0000000262 is newer.
And in this case -1_10_10_0_46 (queue-0000000262) could wait for -1_5_21_1_47 (queue-0000000261) that is currently executing. And -1_5_21_1_47 could wait for covered 1_10_10_0_46 in removePartProducingOpsInRange.
But currently it's not possible for two reasons:
- We fill
covered_entries_to_waitand wait for these entries only when callingisCoveredByFuturePartsImplfromaddFuturePartIfNotCoveredByThemfromfindReplicaHavingCoveringPart. We do it only for "upgrading" a part to covering part and setting "actual part name". ButfindReplicaHavingCoveringPartwill never calladdFuturePartIfNotCoveredByThemfor covered part (it can only "upgrade", not "downgrade"), so actuallysetActualPartNamewill wait only for covered parts. And seems like the comparison ofznode_names is redundant. - Get, attach, merge and mutate entries for not-disjoint parts cannot work concurrently. The only exception is when some entry "upgrades" and sets "actual part name": it can wait for covered entries in
setActualPartName. But when it will reachremovePartProducingOpsInRangenot-disjoint entries do not work concurrently.
I will verify that by adding some assertions and will try to simplify these functions.
P.S. I noticed that after 5ffc38d DROP_RANGEs are postponed if some covered entries are currently executing, but it should not work this way and DROP_RANGEs should have priority. Will try to fix it too.
There was a problem hiding this comment.
Hm, no, p.1 is not true: If we are executing all_1_1_0 and trying to upgrade it to all_1_5_2 and all_1_3_1 is currently executing, then all_1_1_0 will wait for all_1_3_1. So it can wait for covering parts, not only for covered. But p.2 seems to be true.
There was a problem hiding this comment.
@tavplubix thanks for the details and for finish the patch!
Maybe you can shed some light on the following?
Potential problem is that queue entries may be reordered during recovery of stale replica
But why can't they be reordered without recovery?
and in rare cases covering entry may have less znode_name (you can check it by running test 993):
And if this is possible then it seems that this check is indeed redundant, no?
There was a problem hiding this comment.
But why can't they be reordered without recovery?
It depends of what "order" is. There are at least 3 orders:
- total order based on
znode_name/sequential numbers of log/queue entries - partial (or total (?) within partition) order based on
MergeTreePartInfoof parts produced by queue entries - "order" of entries execution (there's no order actually, it can be almost arbitrary)
This statement was about znode_name order. Entries for bigger (in terms of MergeTreePartInfo) parts usually get greater sequential numbers (and greater znode_names). Just because when we create an entry for a covering part, entries for all source parts were already created in the log. And entries are pulled from log to queue one by one in the same order.
But recovery of stale replica is tricky: it tries to copy queue of source replica (while this queue is concurrently changing) and additionally creates GET_PART entries for all parts that source replica had at the moment after copying queue entries. And an entry for bigger part can get less znode_name even if it had greater znode_name in source replica queue (just like in the log message above). Something similar may happen due to removePartAndEnqueueFetch without recovery.
And if this is possible then it seems that this check is indeed redundant, no?
No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks.
There was a problem hiding this comment.
Something similar may happen due to removePartAndEnqueueFetch without recovery.
Oh, I see, thanks. Didn't know that queue can be filled not from the zookeeper log.
No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks
So let me describe how I understood it, and please correct me if I'm wrong.
Server (example for the link you provided) tries to fetch 1_417_417_0 (queue-0000004322), but it had been "upgraded" to fetch 1_300_433_10 instead, but it still waiting for:
1_381_386_1(queue-0000004859)1_381_394_2(queue-0000004868)1_381_412_3(queue-0000004886)1_388_388_0(queue-0000004008)1_417_417_0(queue-0000004322) -- will not try to wait it
After your patch it will wait only for 1_388_388_0, but actually it will not start execution until it will be executed, and so deadlock is not possible after this?
Looks like this is happening because of the same "actual_part_name" optimization? And it is safe because this optimization can be done only locally, it is not stored in zookeeper, and so no issues if the queue entries will be reordered during replica clone. And it is also safe for the case with removePartAndEnqueueFetch, since it can create only parts with greater znode.
Maybe it worth putting a comment there (for znode comparison why it is safe, and what guarantees it has)?
Also maybe just for the sake of consistency it worth to make this check regardless covered_entries_to_wait was passed or not? (for the shouldExecuteLogEntry)
ClickHouse/src/Storages/MergeTree/ReplicatedMergeTreeQueue.cpp
Lines 1147 to 1155 in 0b4e643
|
It makes sense to remove entries covered by fetched part, but I'm just afraid to touch replication queue unless I really has to :) |
Completely understand, and kind of agree, but it makes delay inaccurate sometimes.
I thought about this too, but it seems that it may complicate things even more... |
a3f12a4 to
d4bbf92
Compare
Here is an example that I found on production, simplified. Consider the following queue (nothing of this had been processed on this replica): - GET_PART all_0_0_0 (queue-0000000001) - GET_PART all_1_1_0 (queue-0000000002) ... - GET_PART all_0_1_1 (queue-0000000003) - GET_PART all_2_2_0 (queue-0000000004) ... - MERGE_PARTS from [all_0_1_1, all_2_2_0] to all_0_2_2 (queue-0000000005) And now queue-0000000005 started to executing (either because of reording, or because at that time GET_PART fails), and it does not have any required parts, so it will fetch them, but not all_0_0_0 and all_1_1_0, so this replica delay will set to the time of min(queue-0000000001, queue-0000000002), while it is not true, since it already have parts that covers those parts. and since MERGE_PARTS takes 30min, it increased the replica delay eventually to 30min, for the time range of 30min, which is pretty huge. Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
d4bbf92 to
8db31be
Compare
|
ClickHouse special build check - #43084 |
|
@azat this change introduced O(n^2) complexity of fetching parts, and most of the time is spent under a mutex. |
* upstream/master: Slightly improve error message for required Yasm assembler Update docs/en/operations/server-configuration-parameters/settings.md Instrumentation of callbacks for distributed queries Add an option for watchdog to restart the child process Fixed build Remove mutex from replicas_status [RFC] Revert ClickHouse#39737 Update MergeTreeData.cpp Revert "Beter diagnostics from http in clickhouse-test" Fix macOs compilation due to sprintf Min Function Doc Fix Remove noisy logging Update RabbitMQProducer.cpp Simplify ATTACH MergeTree table FROM S3 in tests Fix style Try prevent flaky test Update test 00965_shard_unresolvable_addresses Fix build suppressing test inaccuracy more review fixes More reliable Add logging Update clickhouse-test add setting to disable vertical merges from compact parts Remove logging because of test 00002_log_and_exception_messages_formatting Try to make 02346_full_text_search less flaky Update clickhouse-test Update zstd to 1.5.4 fix data race between check table query and background part checker thread Revert "Merge pull request ClickHouse#46236 from ClickHouse/revert-45681-compact-parts-vertical-merge" docs for logger stream_compression fix Fix FreeBSD build Update src/Storages/MergeTree/MergeTreeData.cpp Update src/Storages/MergeTree/MergeTreeData.cpp Update README.md Add upcoming Events ASTFunction: avoid rewriting tuple function as literal when not safe Fix possible out of bounds error while reading LowCardinality(Nullable) in Arrow format Fix clang-tidy Fix build Update run.sh More spelling fixes Review comments fix Update run.sh Fix duplicate includes in poco Fix spelling Update src/Storages/MergeTree/MergeTreeData.h rename API rename API Early return fix build error add API to MergeTreeData Add a test review fixes Update Curl to 7.87.0 Reject DoS-prone hyperscan regexes Revert "Allow vertical merges from compact to wide parts" increase a time gap between insert and ttl move fix build error beauty Fix tests Fix write buffer destruction order for vertical merge. Fix ARM and PPC builds Update preciseExp10.cpp remove useless code add passes for rewriting arrayexists Automatic style fix Prefer explicitly defined creds for S3 Implement ClickHouse#16733 Fix something fix bugs Add a test for ClickHouse#30421 Make it better add perf test Return note about function name in preciseExp10.h add visitor files rewrite array exists to has beauty reformat fix style fix bug fix bug Add setting check_referential_table_dependencies to check referential dependencies on DROP TABLE. Resubmit prefetches More warning exclusions in poco odbc fix style add test fix style Fix FreeBSD build Fix warnings in poco odbc fix style fix style fix style fix style fix style fix style fix pointer bug delete extra API and use dynamic_cast to compute Remove excessive license notices from preciseExp10.cpp add parts, active_parts total_marks to system.tables Fix whitespaces Add docs Fix warnings in poco json Fix warnings in poco xml Fix warnings in poco netssl_openssl Fix warnings in poco mongodb Fix warnings in poco utils Fix warnings in poco data Fix warnings in poco net Fix warnings in poco crypto Fix warnings in poco redis Fix warnings in poco foundation More fine-granular warning relaxation Add in code doc to sipHash128Reference Fix style Add a test Fix strange trash Remove bits of trash Add a test Allow accurate comparison of Big Int with other integers tests: add tests for sipHash128Reference{,Keyed} docs: functions: hash: add doc for sipHash126Reference{,Keyed} docs: functions: hash: fix wrong doc docs: functions: hash: add warning about sipHash128{,Keyed} Functions: Hashing: add sipHash128Reference{,Keyed} Common: SipHash: add original 128-bit output implementation Fix endian issue in CityHash for s390x Cleanup Fix integration test Cleanup Fix integration test Fix build Forbid wrong create view syntax Use new label from the event instead of all PR labels Improve readability by "elif"=>"if" Add early exit on freshly opened PR to not process labels Adding custom field continued Revert changes with complex SettingsChanges Add field CustomType Fix test after incorrect merge with master Properly detect changes in Rust code and recompile Rust libraries fix typos fix integrate IO scheduler with buffers DELETE + WHERE in TTL Update tests Add tags to test Support FixedSizeBinary type in Parquet/Arrow Update test Resolve some review comments Fix build Commit forgotten files Get rid of ASTPtr in settings changes Revert "Revert "Better code around SettingsChanges"" Revert "Not better" Not better Revert "Better code around SettingsChanges" CLICKHOUSE-2375 Add interserver DNS retries Fix test Fix previous commit changes Fix Fix after merge Fix style check Better code around SettingsChanges Add local disk path restriction style Attempt to fix 'Local: No offset stored message' from Kafka Fix integration test Update 02454_create_table_with_custom_disk.sql Try fix test Fix test Avoid incorrect warning during update of DiskSelector disks Allow custom disk configuration Support single disk instead of storage policy
|
This patch has another, more major problem, it can lead to intersecting parts. Consider the following example, you have the following in the queue:
Now (T1 - thread1, T2 - thread2):
@tavplubix can you please take a look at this example? ( And query of the day: find intersecting parts (thought it is not 100% accurate, at least it does not takes into account errors during INSERT): SELECT
event_time,
event_type,
part_name,
exception,
merged_from,
arrayFlatten(arrayMap(x -> range((arrayMap(x -> toUInt64(x), arraySlice(splitByChar('_', x), 2, 2)) AS arr_)[1], (arr_[2]) + 1), merged_from)) AS blocks,
blocks != range(arrayMin(blocks), arrayMax(blocks) + 1) AS intersecting_parts
FROM clusterAllReplicas('drt', system.part_log)
WHERE (length(merged_from) > 0) AND (event_type = 'MergeParts') AND (table_uuid IN (
SELECT uuid
FROM system.tables
WHERE engine LIKE '%Replicated%'
)) AND (length(arrayFilter(x -> endsWith(x, '_0'), merged_from)) = length(merged_from)) AND (intersecting_parts = 1)
LIMIT 1 BY part_name
LIMIT 10
Query id: b71ce22a-6d0f-4fb5-83a1-50ab66db51db
Row 1:
──────
event_time: 2023-02-26 16:06:34
event_type: MergeParts
part_name: f64ee14d51832253179b7471fb3b20e0_15_40_1
exception: Code: 49. DB::Exception: Part f64ee14d51832253179b7471fb3b20e0_15_40_1 intersects part f64ee14d51832253179b7471fb3b20e0_20_37_1 (state Active) It is a bug. (LOGICAL_ERROR) (version 23.1.3.5 (official build))
merged_from: ['f64ee14d51832253179b7471fb3b20e0_15_15_0','f64ee14d51832253179b7471fb3b20e0_16_16_0','f64ee14d51832253179b7471fb3b20e0_17_17_0','f64ee14d51832253179b7471fb3b20e0_18_18_0','f64ee14d51832253179b7471fb3b20e0_19_19_0','f64ee14d51832253179b7471fb3b20e0_38_38_0','f64ee14d51832253179b7471fb3b20e0_39_39_0','f64ee14d51832253179b7471fb3b20e0_40_40_0']
blocks: [15,16,17,18,19,38,39,40]
intersecting_parts: 1 |
AFAIR |
AFAICS it does:
@tavplubix What did I miss? And if it will not remove the entry from the Also maybe you have some thoughts about how this patch introduced intersecting parts? |
|
Hm, yes, you're right, it may remove virtual parts in some unusual cases. But it will not remove it in the case you described above because ClickHouse/src/Storages/MergeTree/ActiveDataPartSet.h Lines 46 to 49 in fa467b3 So, in your example, It may actually remove something if we have two queue entries with the same virtual part, for example, duplicate |
|
Yeah, looks like you are right!
Actually that line ( ) doesn't do this, and my comment was wrong, but this will be done anyway inremoveProcessedEntry
And it will remove And But what I found in logs is that |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Remove covered parts for fetched part (to avoid possible replication delay grows)
Here is an example that I found on production, simplified.
Consider the following queue (nothing of this had been processed on this
replica):
...
...
And now queue-0000000005 started to executing (either because
of reording, or because at that time GET_PART fails), and it
does not have any required parts, so it will fetch them, but
not all_0_0_0 and all_1_1_0, so this replica delay will set to
the time of min(queue-0000000001, queue-0000000002), while it
is not true, since it already have parts that covers those
parts.
and since MERGE_PARTS takes 30min, it increased the replica delay
eventually to 30min, for the time range of 30min, which is pretty huge.
Cc: @tavplubix (what do you think about this?)