-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Description
Hi!
Describe the unexpected behaviour
I found out a problem with how setting async_load_databases works with materialized views. Restart with async_load_databases enabled can lead to data loss processed by materialized views.
For example, if we have pipeline source_table -> mat_view -> target_table with async load source_table can be loaded before mat_view and target_table, and data inserts can be completed successfully as soon as source_table loaded, and loaded only in source_table.
We faced this problem, after enabling async_load_databases setting and in query_log during restart saw that target_table not
appeared in inserts queries, but after some delay (possible because at that moment mat_view and target_table were loaded) it reappeared.
How to reproduce
We use 24.8.4.13 LTS but it is possibly reproducing since setting was introduced and especially after 24.6 where async_load_databases enabled by default (somehow in documentation the default is false)
Our case
We have source table simi2_cdc.metadata_H for storing history data:
CREATE TABLE simi2_cdc.metadata_H
(
`id` String,
`document_created` DateTime64(6),
`document_class_id` Int32,
...
`meta.csn` Nullable(String),
`meta.deleted` UInt8,
`meta.loaded` DateTime64(6) DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/simi2_cdc/metadata_H', '{replica}')
PARTITION BY (document_class_id, toYear(document_created))
PRIMARY KEY id
ORDER BY idtarget table simi2_cdc.metadata_S with ReplacingMergeTree engine for storing state:
CREATE TABLE simi2_cdc.metadata_S
(
`id` String,
`document_created` DateTime64(6),
`document_class_id` Int32,
...
`meta.csn` UInt128,
`meta.deleted` UInt8,
`meta.loaded` DateTime64(6),
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/simi2_cdc/metadata_S', '{replica}', `meta.csn`, `meta.deleted`)
PARTITION BY (document_class_id, toYear(document_created))
PRIMARY KEY id
ORDER BY (id, document_created)and materialized view:
CREATE MATERIALIZED VIEW simi2_cdc.metadata_mv TO simi2_cdc.metadata_S
AS SELECT
`id`,
`document_created`,
`document_class_id`,
...
toUInt128(coalesce(`meta.csn`, '0')) AS `meta.csn`,
`meta.deleted`,
`meta.loaded`
FROM simi2_cdc.metadata_HIn query_log we saw:
SELECT
query_start_time,
substring(query, 1, 40),
tables
FROM system.query_log
WHERE ((query_start_time >= '2024-11-18 21:36:00') AND (query_start_time <= '2024-11-18 22:09:00')) AND (query_kind = 'Insert') AND has(tables, 'simi2_cdc.metadata_H')
ORDER BY query_start_time ASC
FORMAT TSVWithNamesbefore restart:
query_start_time substring(query, 1, 40) tables
...
2024-11-18 21:36:14 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:36:14 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:36:39 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
during restart:
...
2024-11-18 21:54:09 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:54:09 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 22:06:15 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H']
2024-11-18 22:06:15 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H']
...
and some time after all tables are loaded:
2024-11-18 22:08:26 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H']
2024-11-18 22:08:26 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H']
2024-11-18 22:08:38 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 22:08:38 INSERT INTO simi2_cdc.metadata_H(`id`,`s ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
Expected behavior
Maybe this is unexpected behavior because it can lead to data loss and users might not be aware of this.