-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Cleanup the deduplication in materialized views. #60008
Description
Remove these settings:
deduplicate_blocks_in_dependent_materialized_views
update_insert_deduplication_token_in_dependent_materialized_views
They will become no-ops.
Instead, the deduplication should work in a similar way to when an explicit insert_deduplication_token is specified,
but using the hash of the inserted blocks of data as a token.
How an INSERT query works:
-
Each INSERT from a user is represented as a stream of data. This stream of data is represented by a sequence of one or more blocks (
max_insert_block_size). These blocks can be squashed together to form larger blocks (min_insert_block_size_rows,min_insert_block_size_bytes), which also represent a sequence of one or more blocks. The blocks can also be formed by the data sent from different concurrent clients, which is the case ofasync_insert. -
Each block from a sequence is processed by the table (
IStorage::writemethod), either sequentially or by multiple blocks in parallel (max_insert_threads), and by materialized views attached to this table, and all materialized views in the chain. Each materialized view does some processing on each block and writes it to its underlying table using the (IStorage::writemethod). -
The table's write method splits each block by partition key, forming smaller blocks. These blocks are then committed (one by one) atomically to Keeper, and to the filesystem, and to the state of the table in memory. During the atomic commit of each resulting block, it is checked for duplicates, using a set of hashes stored in Keeper.
How a deduplication token works:
A user can provide a deduplication token along with an INSERT query using the insert_deduplication_token setting.
-
This token is combined with the deterministic sequential number of a block received by INSERT, possibly after squashing (as the squashing process is also deterministic), but before combining the blocks from different clients in the case of
async_insert(in this case, the tokens will correspond to the ranges in the combined block). -
Then it is combined with the table identifier for each participating table in the chain of materialized views (the table identifier is immutable and persistent UUID for
AtomicandReplicateddatabases; or represented by the database and table name in the obsoleteOrdinarydatabases). Then it is combined with the partition identifier of every table's partition. Finally, this combination is hashed to make a hash, and this hash (or multiple hashes in the case ofasync_insert) is matched to the set of hashes in Keeper.
How the deduplication should work without a deduplication token:
Calculate a hash of every block (range of block in the case of async_insert) received by INSERT after step 1, and use it in the same way as an explicit deduplication token in the later steps.
Benefits
- Insertion into a table with materialized views becomes idempotent (it is safe to repeat the insertion in the case of partial errors or any other case).
- We take out the calculation of block hash from the MergeTree engine, and it becomes unified with the way how we do it for
async_insert. - The hash will be calculated for inserts into all table engines and can be reused by other table engines, such as
Memory, or evenDistributed. However, we can omit this calculation if there are no tables supporting deduplication in the chain. - We will remove obscure settings
deduplicate_blocks_in_dependent_materialized_viewsandupdate_insert_deduplication_token_in_dependent_materialized_views, that cannot be used by anyone except their authors.