Cleanup the deduplication in materialized views.

Remove these settings:

```
deduplicate_blocks_in_dependent_materialized_views
update_insert_deduplication_token_in_dependent_materialized_views
```

They will become no-ops.

Instead, the deduplication should work in a similar way to when an explicit `insert_deduplication_token` is specified,
but using the hash of the inserted blocks of data as a token.

**How an INSERT query works:**

1. Each INSERT from a user is represented as a stream of data. This stream of data is represented by a sequence of one or more blocks (`max_insert_block_size`). These blocks can be squashed together to form larger blocks (`min_insert_block_size_rows`, `min_insert_block_size_bytes`), which also represent a sequence of one or more blocks. The blocks can also be formed by the data sent from different concurrent clients, which is the case of `async_insert`.

2. Each block from a sequence is processed by the table (`IStorage::write` method), either sequentially or by multiple blocks in parallel (`max_insert_threads`), and by materialized views attached to this table, and all materialized views in the chain. Each materialized view does some processing on each block and writes it to its underlying table using the (`IStorage::write` method).

3. The table's write method splits each block by partition key, forming smaller blocks. These blocks are then committed (one by one) atomically to Keeper, and to the filesystem, and to the state of the table in memory. During the atomic commit of each resulting block, it is checked for duplicates, using a set of hashes stored in Keeper.

**How a deduplication token works:**

A user can provide a deduplication token along with an INSERT query using the `insert_deduplication_token` setting.

1. This token is combined with the deterministic sequential number of a block received by INSERT, possibly after squashing (as the squashing process is also deterministic), but before combining the blocks from different clients in the case of `async_insert` (in this case, the tokens will correspond to the ranges in the combined block). 

2. Then it is combined with the table identifier for each participating table in the chain of materialized views (the table identifier is immutable and persistent UUID for `Atomic` and `Replicated` databases; or represented by the database and table name in the obsolete `Ordinary` databases). Then it is combined with the partition identifier of every table's partition. Finally, this combination is hashed to make a hash, and this hash (or multiple hashes in the case of `async_insert`) is matched to the set of hashes in Keeper.

**How the deduplication should work without a deduplication token:**

Calculate a hash of every block (range of block in the case of `async_insert`) received by INSERT after step 1, and use it in the same way as an explicit deduplication token in the later steps.

**Benefits**

1. Insertion into a table with materialized views becomes idempotent (it is safe to repeat the insertion in the case of partial errors or any other case).
2. We take out the calculation of block hash from the MergeTree engine, and it becomes unified with the way how we do it for `async_insert`.
3. The hash will be calculated for inserts into all table engines and can be reused by other table engines, such as `Memory`, or even `Distributed`. However, we can omit this calculation if there are no tables supporting deduplication in the chain.
4. We will remove obscure settings `deduplicate_blocks_in_dependent_materialized_views` and `update_insert_deduplication_token_in_dependent_materialized_views`, that cannot be used by anyone except their authors.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup the deduplication in materialized views. #60008

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cleanup the deduplication in materialized views. #60008

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions