Skip to content

[RFC] Remove covered parts for fetched part#39737

Merged
tavplubix merged 5 commits intoClickHouse:masterfrom
azat:fetch-remove-covered
Nov 14, 2022
Merged

[RFC] Remove covered parts for fetched part#39737
tavplubix merged 5 commits intoClickHouse:masterfrom
azat:fetch-remove-covered

Conversation

@azat
Copy link
Copy Markdown
Member

@azat azat commented Jul 30, 2022

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Remove covered parts for fetched part (to avoid possible replication delay grows)

Here is an example that I found on production, simplified.

Consider the following queue (nothing of this had been processed on this
replica):

  • GET_PART all_0_0_0 (queue-0000000001)
  • GET_PART all_1_1_0 (queue-0000000002)
    ...
  • GET_PART all_0_1_1 (queue-0000000003)
  • GET_PART all_2_2_0 (queue-0000000004)
    ...
  • MERGE_PARTS from [all_0_1_1, all_2_2_0] to all_0_2_2 (queue-0000000005)

And now queue-0000000005 started to executing (either because
of reording, or because at that time GET_PART fails), and it
does not have any required parts, so it will fetch them, but
not all_0_0_0 and all_1_1_0, so this replica delay will set to
the time of min(queue-0000000001, queue-0000000002), while it
is not true, since it already have parts that covers those
parts.

and since MERGE_PARTS takes 30min, it increased the replica delay
eventually to 30min, for the time range of 30min, which is pretty huge.

Cc: @tavplubix (what do you think about this?)

@robot-clickhouse robot-clickhouse added the pr-improvement Pull request with some product improvements label Jul 30, 2022
@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 5, 2022

@tavplubix can you please take a look and comment on this?

@tavplubix tavplubix self-assigned this Aug 5, 2022
Comment on lines 1032 to 1042
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that it will work correctly with "actual part name" optimization...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that here actual_new_part_name should be take into account?

bool simple_op_covered = is_simple_producing_op && part_info.contains(MergeTreePartInfo::fromPartName((*it)->new_part_name, format_version));

Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean that here actual_new_part_name should be take into account?

Not precisely, I mean that logic that "upgrades" currently executing entry to covering entry probably should be taken into account. See addFuturePartIfNotCoveredByThem and related methods. Seems like deadlock is possible, because setActualPartName may wait for older entries ("age" is determined by znode_name), but removePartProducingOpsInRange now may wait for all entries covered by part_info. And usually covered entries are older, but not when "actual part name" logic is involved.

Doesn't this means that removePartProducingOpsInRange already works not correctly when actual_new_part_name is not empty?

No, because currently removePartProducingOpsInRange can be called by DROP_RANGE or REPLACE_RANGE entries only and actual_new_part_name can be set only for simple part-producing operation (GET_PART, MERGE_PARTS, MUTATE_PART)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I've looked at this again, and let me describe possible scenarios (that may lead to a possible deadlock):

  • queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
    T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
    T2 will execute fetch of 1_1_2_0, no 1_0_2_1 in future_parts yet
    This should be safe because T2 will not fetch anything because of

    if (covered_entries_to_wait)
    {
    if (entry.znode_name < future_part_elem.second->znode_name)
    {
    out_reason = fmt::format(
    "Not executing log entry {} for part {} "
    "because it is not disjoint with part {} that is currently executing and another entry {} is newer.",
    entry.znode_name, new_part_name, future_part_elem.first, future_part_elem.second->znode_name);
    LOG_TRACE(log, fmt::runtime(out_reason));
    return true;

  • queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
    T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
    T2 will execute fetch of 1_1_2_0, and also fetch 1_0_2_1 instead
    This should be safe because of

    bool ReplicatedMergeTreeQueue::isCoveredByFuturePartsImpl(const LogEntry & entry, const String & new_part_name,
    String & out_reason, std::unique_lock<std::mutex> & /* queue_lock */,
    std::vector<LogEntryPtr> * covered_entries_to_wait) const
    {
    /// Let's check if the same part is now being created by another action.
    auto entry_for_same_part_it = future_parts.find(new_part_name);
    if (entry_for_same_part_it != future_parts.end())
    {
    const LogEntry & another_entry = *entry_for_same_part_it->second;
    out_reason = fmt::format(
    "Not executing log entry {} of type {} for part {} "
    "because another log entry {} of type {} for the same part ({}) is being processed. This shouldn't happen often.",
    entry.znode_name, entry.type, entry.new_part_name,
    another_entry.znode_name, another_entry.type, another_entry.new_part_name);
    LOG_INFO(log, fmt::runtime(out_reason));
    return true;

  • queue: [1_0_1_0, 1_1_2_0, 1_0_2_1]
    T1 will execute fetch of 1_0_1_0, and fetch 1_0_2_1 instead
    T2 will execute fetch of 1_0_2_1
    This should be safe, because actual_part_name is also added into

    if (!queue.future_parts.emplace(entry.actual_new_part_name, entry.shared_from_this()).second)

So I don't see any possible deadlock, @tavplubix maybe you can suggest some example of it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavplubix friendly ping

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, seems like deadlock is not possible.

Potential problem is that queue entries may be reordered during recovery of stale replica and in rare cases covering entry may have less znode_name (you can check it by running test 993):

2022.11.08 20:37:42.819521 [ 2601779 ] {} <Trace> test_zc23mw5c.alter_table_0 (ReplicatedMergeTreeQueue): Not executing log entry queue-0000000261 for part -1_5_21_1_47 because it is not disjoint with part -1_10_10_0_46 that is currently executing and another entry queue-0000000262 is newer.

And in this case -1_10_10_0_46 (queue-0000000262) could wait for -1_5_21_1_47 (queue-0000000261) that is currently executing. And -1_5_21_1_47 could wait for covered 1_10_10_0_46 in removePartProducingOpsInRange.

But currently it's not possible for two reasons:

  1. We fill covered_entries_to_wait and wait for these entries only when calling isCoveredByFuturePartsImpl from addFuturePartIfNotCoveredByThem from findReplicaHavingCoveringPart. We do it only for "upgrading" a part to covering part and setting "actual part name". But findReplicaHavingCoveringPart will never call addFuturePartIfNotCoveredByThem for covered part (it can only "upgrade", not "downgrade"), so actually setActualPartName will wait only for covered parts. And seems like the comparison of znode_names is redundant.
  2. Get, attach, merge and mutate entries for not-disjoint parts cannot work concurrently. The only exception is when some entry "upgrades" and sets "actual part name": it can wait for covered entries in setActualPartName. But when it will reach removePartProducingOpsInRange not-disjoint entries do not work concurrently.

I will verify that by adding some assertions and will try to simplify these functions.

P.S. I noticed that after 5ffc38d DROP_RANGEs are postponed if some covered entries are currently executing, but it should not work this way and DROP_RANGEs should have priority. Will try to fix it too.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, no, p.1 is not true: If we are executing all_1_1_0 and trying to upgrade it to all_1_5_2 and all_1_3_1 is currently executing, then all_1_1_0 will wait for all_1_3_1. So it can wait for covering parts, not only for covered. But p.2 seems to be true.

Copy link
Copy Markdown
Member Author

@azat azat Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tavplubix thanks for the details and for finish the patch!

Maybe you can shed some light on the following?

Potential problem is that queue entries may be reordered during recovery of stale replica

But why can't they be reordered without recovery?

and in rare cases covering entry may have less znode_name (you can check it by running test 993):

And if this is possible then it seems that this check is indeed redundant, no?

if (entry.znode_name < future_part_elem.second->znode_name)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why can't they be reordered without recovery?

It depends of what "order" is. There are at least 3 orders:

  • total order based on znode_name/sequential numbers of log/queue entries
  • partial (or total (?) within partition) order based on MergeTreePartInfo of parts produced by queue entries
  • "order" of entries execution (there's no order actually, it can be almost arbitrary)

This statement was about znode_name order. Entries for bigger (in terms of MergeTreePartInfo) parts usually get greater sequential numbers (and greater znode_names). Just because when we create an entry for a covering part, entries for all source parts were already created in the log. And entries are pulled from log to queue one by one in the same order.

But recovery of stale replica is tricky: it tries to copy queue of source replica (while this queue is concurrently changing) and additionally creates GET_PART entries for all parts that source replica had at the moment after copying queue entries. And an entry for bigger part can get less znode_name even if it had greater znode_name in source replica queue (just like in the log message above). Something similar may happen due to removePartAndEnqueueFetch without recovery.

And if this is possible then it seems that this check is indeed redundant, no?

No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks.

Copy link
Copy Markdown
Member Author

@azat azat Nov 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something similar may happen due to removePartAndEnqueueFetch without recovery.

Oh, I see, thanks. Didn't know that queue can be filled not from the zookeeper log.

No, this check is needed to enforce some order here, comparing znode_name was the simplest way to avoid deadlocks

So let me describe how I understood it, and please correct me if I'm wrong.

Server (example for the link you provided) tries to fetch 1_417_417_0 (queue-0000004322), but it had been "upgraded" to fetch 1_300_433_10 instead, but it still waiting for:

  • 1_381_386_1 (queue-0000004859)
  • 1_381_394_2 (queue-0000004868)
  • 1_381_412_3 (queue-0000004886)
  • 1_388_388_0 (queue-0000004008)
  • 1_417_417_0 (queue-0000004322) -- will not try to wait it

After your patch it will wait only for 1_388_388_0, but actually it will not start execution until it will be executed, and so deadlock is not possible after this?

Looks like this is happening because of the same "actual_part_name" optimization? And it is safe because this optimization can be done only locally, it is not stored in zookeeper, and so no issues if the queue entries will be reordered during replica clone. And it is also safe for the case with removePartAndEnqueueFetch, since it can create only parts with greater znode.

Maybe it worth putting a comment there (for znode comparison why it is safe, and what guarantees it has)?

Also maybe just for the sake of consistency it worth to make this check regardless covered_entries_to_wait was passed or not? (for the shouldExecuteLogEntry)

if (entry.znode_name < future_part_elem.second->znode_name)
{
out_reason = fmt::format(
"Not executing log entry {} for part {} "
"because it is not disjoint with part {} that is currently executing and another entry {} is newer.",
entry.znode_name, new_part_name, future_part_elem.first, future_part_elem.second->znode_name);
LOG_TRACE(log, fmt::runtime(out_reason));
return true;
}

@tavplubix
Copy link
Copy Markdown
Member

It makes sense to remove entries covered by fetched part, but I'm just afraid to touch replication queue unless I really has to :)
Btw, maybe we also should change the way replica delay is calculated or introduce more smart metric

@azat
Copy link
Copy Markdown
Member Author

azat commented Aug 13, 2022

It makes sense to remove entries covered by fetched part, but I'm just afraid to touch replication queue unless I really has to :)

Completely understand, and kind of agree, but it makes delay inaccurate sometimes.
So what will be the final decision?

Btw, maybe we also should change the way replica delay is calculated or introduce more smart metric

I thought about this too, but it seems that it may complicate things even more...
Plus if such entries (that will be removed from the queue with this patch) will be in the queue, they will be retried on and on, with noisy messages in the log.

@azat azat marked this pull request as draft August 18, 2022 14:27
@azat azat force-pushed the fetch-remove-covered branch from a3f12a4 to d4bbf92 Compare September 30, 2022 08:03
Here is an example that I found on production, simplified.

Consider the following queue (nothing of this had been processed on this
replica):

- GET_PART all_0_0_0 (queue-0000000001)
- GET_PART all_1_1_0 (queue-0000000002)
...
- GET_PART all_0_1_1 (queue-0000000003)
- GET_PART all_2_2_0 (queue-0000000004)
...
- MERGE_PARTS from [all_0_1_1, all_2_2_0] to all_0_2_2 (queue-0000000005)

And now queue-0000000005 started to executing (either because
of reording, or because at that time GET_PART fails), and it
does not have any required parts, so it will fetch them, but
not all_0_0_0 and all_1_1_0, so this replica delay will set to
the time of min(queue-0000000001, queue-0000000002), while it
is not true, since it already have parts that covers those
parts.

and since MERGE_PARTS takes 30min, it increased the replica delay
eventually to 30min, for the time range of 30min, which is pretty huge.

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
@azat azat force-pushed the fetch-remove-covered branch from d4bbf92 to 8db31be Compare October 26, 2022 08:10
@tavplubix tavplubix marked this pull request as ready for review November 8, 2022 11:48
@tavplubix
Copy link
Copy Markdown
Member

ClickHouse special build check - #43084
Stress test (msan) - clearOldPartsFromFilesystem

@alexey-milovidov
Copy link
Copy Markdown
Member

@azat this change introduced O(n^2) complexity of fetching parts, and most of the time is spent under a mutex.
If the number of data parts is large (e.g. 50 000), it becomes extremely slow to recover.
Screenshot_20230212_072022

alexey-milovidov added a commit that referenced this pull request Feb 12, 2023
alexey-milovidov added a commit that referenced this pull request Feb 13, 2023
azat added a commit to azat/ClickHouse that referenced this pull request Feb 14, 2023
* upstream/master:
  Slightly improve error message for required Yasm assembler
  Update docs/en/operations/server-configuration-parameters/settings.md
  Instrumentation of callbacks for distributed queries
  Add an option for watchdog to restart the child process
  Fixed build
  Remove mutex from replicas_status
  [RFC] Revert ClickHouse#39737
  Update MergeTreeData.cpp
  Revert "Beter diagnostics from http in clickhouse-test"
  Fix macOs compilation due to sprintf
  Min Function Doc Fix
  Remove noisy logging
  Update RabbitMQProducer.cpp
  Simplify ATTACH MergeTree table FROM S3 in tests
  Fix style
  Try prevent flaky test
  Update test 00965_shard_unresolvable_addresses
  Fix build
  suppressing test inaccuracy
  more review fixes
  More reliable
  Add logging
  Update clickhouse-test
  add setting to disable vertical merges from compact parts
  Remove logging because of test 00002_log_and_exception_messages_formatting
  Try to make 02346_full_text_search less flaky
  Update clickhouse-test
  Update zstd to 1.5.4
  fix data race between check table query and background part checker thread
  Revert "Merge pull request ClickHouse#46236 from ClickHouse/revert-45681-compact-parts-vertical-merge"
  docs for logger stream_compression
  fix
  Fix FreeBSD build
  Update src/Storages/MergeTree/MergeTreeData.cpp
  Update src/Storages/MergeTree/MergeTreeData.cpp
  Update README.md
  Add upcoming Events
  ASTFunction: avoid rewriting tuple function as literal when not safe
  Fix possible out of bounds error while reading LowCardinality(Nullable) in Arrow format
  Fix clang-tidy
  Fix build
  Update run.sh
  More spelling fixes
  Review comments fix
  Update run.sh
  Fix duplicate includes in poco
  Fix spelling
  Update src/Storages/MergeTree/MergeTreeData.h
  rename API
  rename API
  Early return
  fix build error
  add API to MergeTreeData
  Add a test
  review fixes
  Update Curl to 7.87.0
  Reject DoS-prone hyperscan regexes
  Revert "Allow vertical merges from compact to wide parts"
  increase a time gap between insert and ttl move
  fix build error
  beauty
  Fix tests
  Fix write buffer destruction order for vertical merge.
  Fix ARM and PPC builds
  Update preciseExp10.cpp
  remove useless code
  add passes for rewriting arrayexists
  Automatic style fix
  Prefer explicitly defined creds for S3
  Implement ClickHouse#16733
  Fix something
  fix bugs
  Add a test for ClickHouse#30421
  Make it better
  add perf test
  Return note about function name in preciseExp10.h
  add visitor files
  rewrite array exists to has
  beauty
  reformat
  fix style
  fix bug
  fix bug
  Add setting check_referential_table_dependencies to check referential dependencies on DROP TABLE.
  Resubmit prefetches
  More warning exclusions in poco odbc
  fix style
  add test
  fix style
  Fix FreeBSD build
  Fix warnings in poco odbc
  fix style
  fix style
  fix style
  fix style
  fix style
  fix style
  fix pointer bug
  delete extra API and use dynamic_cast to compute
  Remove excessive license notices from preciseExp10.cpp
  add parts, active_parts total_marks to system.tables
  Fix whitespaces
  Add docs
  Fix warnings in poco json
  Fix warnings in poco xml
  Fix warnings in poco netssl_openssl
  Fix warnings in poco mongodb
  Fix warnings in poco utils
  Fix warnings in poco data
  Fix warnings in poco net
  Fix warnings in poco crypto
  Fix warnings in poco redis
  Fix warnings in poco foundation
  More fine-granular warning relaxation
  Add in code doc to sipHash128Reference
  Fix style
  Add a test
  Fix strange trash
  Remove bits of trash
  Add a test
  Allow accurate comparison of Big Int with other integers
  tests: add tests for sipHash128Reference{,Keyed}
  docs: functions: hash: add doc for sipHash126Reference{,Keyed}
  docs: functions: hash: fix wrong doc
  docs: functions: hash: add warning about sipHash128{,Keyed}
  Functions: Hashing: add sipHash128Reference{,Keyed}
  Common: SipHash: add original 128-bit output implementation
  Fix endian issue in CityHash for s390x
  Cleanup
  Fix integration test
  Cleanup
  Fix integration test
  Fix build
  Forbid wrong create view syntax
  Use new label from the event instead of all PR labels
  Improve readability by "elif"=>"if"
  Add early exit on freshly opened PR to not process labels
  Adding custom field continued
  Revert changes with complex SettingsChanges
  Add field CustomType
  Fix test after incorrect merge with master
  Properly detect changes in Rust code and recompile Rust libraries
  fix typos
  fix
  integrate IO scheduler with buffers
  DELETE + WHERE in TTL
  Update tests
  Add tags to test
  Support FixedSizeBinary type in Parquet/Arrow
  Update test
  Resolve some review comments
  Fix build
  Commit forgotten files
  Get rid of ASTPtr in settings changes
  Revert "Revert "Better code around SettingsChanges""
  Revert "Not better"
  Not better
  Revert "Better code around SettingsChanges"
  CLICKHOUSE-2375 Add interserver DNS retries
  Fix test
  Fix previous commit changes
  Fix
  Fix after merge
  Fix style check
  Better code around SettingsChanges
  Add local disk path restriction
  style
  Attempt to fix 'Local: No offset stored message' from Kafka
  Fix integration test
  Update 02454_create_table_with_custom_disk.sql
  Try fix test
  Fix test
  Avoid incorrect warning during update of DiskSelector disks
  Allow custom disk configuration
  Support single disk instead of storage policy
@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 1, 2023

This patch has another, more major problem, it can lead to intersecting parts.

Consider the following example, you have the following in the queue:

  • all_1_1_0 (GET_PART)
  • ...
  • all_40_40_0 (GET_PART)
  • all_20_37_1 (MERGE_PARTS)

Now (T1 - thread1, T2 - thread2):

@tavplubix can you please take a look at this example? (ReplicatedMergeTree is not a trivial thing)

And query of the day: find intersecting parts (thought it is not 100% accurate, at least it does not takes into account errors during INSERT):

SELECT
    event_time,
    event_type,
    part_name,
    exception,
    merged_from,
    arrayFlatten(arrayMap(x -> range((arrayMap(x -> toUInt64(x), arraySlice(splitByChar('_', x), 2, 2)) AS arr_)[1], (arr_[2]) + 1), merged_from)) AS blocks,
    blocks != range(arrayMin(blocks), arrayMax(blocks) + 1) AS intersecting_parts
FROM clusterAllReplicas('drt', system.part_log)
WHERE (length(merged_from) > 0) AND (event_type = 'MergeParts') AND (table_uuid IN (
    SELECT uuid
    FROM system.tables
    WHERE engine LIKE '%Replicated%'
)) AND (length(arrayFilter(x -> endsWith(x, '_0'), merged_from)) = length(merged_from)) AND (intersecting_parts = 1)
LIMIT 1 BY part_name
LIMIT 10

Query id: b71ce22a-6d0f-4fb5-83a1-50ab66db51db

Row 1:
──────
event_time:         2023-02-26 16:06:34
event_type:         MergeParts
part_name:          f64ee14d51832253179b7471fb3b20e0_15_40_1
exception:          Code: 49. DB::Exception: Part f64ee14d51832253179b7471fb3b20e0_15_40_1 intersects part f64ee14d51832253179b7471fb3b20e0_20_37_1 (state Active) It is a bug. (LOGICAL_ERROR) (version 23.1.3.5 (official build))
merged_from:        ['f64ee14d51832253179b7471fb3b20e0_15_15_0','f64ee14d51832253179b7471fb3b20e0_16_16_0','f64ee14d51832253179b7471fb3b20e0_17_17_0','f64ee14d51832253179b7471fb3b20e0_18_18_0','f64ee14d51832253179b7471fb3b20e0_19_19_0','f64ee14d51832253179b7471fb3b20e0_38_38_0','f64ee14d51832253179b7471fb3b20e0_39_39_0','f64ee14d51832253179b7471fb3b20e0_40_40_0']
blocks:             [15,16,17,18,19,38,39,40]
intersecting_parts: 1

@tavplubix
Copy link
Copy Markdown
Member

T1: remove 18 entries from the queue (queue.virtual_parts)

AFAIR removePartProducingOpsInRange does not remove anything from virtual_parts, it only removes covered entries from the queue. So it could not lead to intersecting parts.

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 15, 2023

AFAIR removePartProducingOpsInRange does not remove anything from virtual_parts, it only removes covered entries from the queue. So it could not lead to intersecting parts.

AFAICS it does:

@tavplubix What did I miss?

And if it will not remove the entry from the virtual_queue, who will?

Also maybe you have some thoughts about how this patch introduced intersecting parts?

@tavplubix
Copy link
Copy Markdown
Member

Hm, yes, you're right, it may remove virtual parts in some unusual cases.

But it will not remove it in the case you described above because ActiveDataPartSet contains only "the most top-level" parts (covered parts are replaced with a covering one in ActiveDataPartSet::add), and ActiveDataPartSet::remove looks for exact match:

bool remove(const MergeTreePartInfo & part_info)
{
return part_info_to_name.erase(part_info) > 0;
}

So, in your example, T1: commit all_20_37_1 (added to active data parts) replaces all_20_20_0 ... all_37_37_0 with all_20_37_1. After that, vurtual_parts.remove("all_20_20_0") will do nothing and return false because there's no such part in virtual_parts. So usually it's safe to call removePartProducingOpsInRange after fetching a part.

It may actually remove something if we have two queue entries with the same virtual part, for example, duplicate GET_PART entries (they can appear after cloneReplica). A part covers itself, so after finishing the first entry we will find the second one and will try to remove it. And in this case, ActiveDataPartSet::remove will find the part and remove it.

@azat
Copy link
Copy Markdown
Member Author

azat commented Mar 15, 2023

Yeah, looks like you are right!

So, in your example, T1: commit all_20_37_1 (added to active data parts) replaces all_20_20_0 ... all_37_37_0 with all_20_37_1

Actually that line (

replaced_parts = checkPartChecksumsAndCommit(transaction, part, hardlinked_files);
) doesn't do this, and my comment was wrong, but this will be done anyway in removeProcessedEntry

And it will remove all_20_20_0 - all_37_37_0 from virtual_parts when it will add all_20_37_1 to it (when it will fetch this entry from the queue) in ReplicatedMergeTreeQueue::insertUnlocked

And all_20_37_1 should be in virtual parts, because it was set as actual_part_name -

reject_reason = fmt::format("Log entry for part {} or covering part is not pulled from log to queue yet.", part_name);

But what I found in logs is that all_20_37_1 had been processed before all_15_15_0 - all_40_40_0 (insertUnlocked call), and with this order looks like it was indeed possible to create intersecting part all_15_40_1, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants