Skip to content

Possible loss materialized view data with async_load_databases setting enabled #72589

@hulk8

Description

@hulk8

Hi!

Describe the unexpected behaviour
I found out a problem with how setting async_load_databases works with materialized views. Restart with async_load_databases enabled can lead to data loss processed by materialized views.

For example, if we have pipeline source_table -> mat_view -> target_table with async load source_table can be loaded before mat_view and target_table, and data inserts can be completed successfully as soon as source_table loaded, and loaded only in source_table.
We faced this problem, after enabling async_load_databases setting and in query_log during restart saw that target_table not
appeared in inserts queries, but after some delay (possible because at that moment mat_view and target_table were loaded) it reappeared.

How to reproduce
We use 24.8.4.13 LTS but it is possibly reproducing since setting was introduced and especially after 24.6 where async_load_databases enabled by default (somehow in documentation the default is false)

Our case
We have source table simi2_cdc.metadata_H for storing history data:

CREATE TABLE simi2_cdc.metadata_H
(
    `id` String,
    `document_created` DateTime64(6),
    `document_class_id` Int32,
    ...
    `meta.csn` Nullable(String),
    `meta.deleted` UInt8,
    `meta.loaded` DateTime64(6) DEFAULT now()
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/simi2_cdc/metadata_H', '{replica}')
PARTITION BY (document_class_id, toYear(document_created))
PRIMARY KEY id
ORDER BY id

target table simi2_cdc.metadata_S with ReplacingMergeTree engine for storing state:

CREATE TABLE simi2_cdc.metadata_S
(
    `id` String,
    `document_created` DateTime64(6),
    `document_class_id` Int32,
    ...
    `meta.csn` UInt128,
    `meta.deleted` UInt8,
    `meta.loaded` DateTime64(6),
)
ENGINE = ReplicatedReplacingMergeTree('/clickhouse/tables/{shard}/simi2_cdc/metadata_S', '{replica}', `meta.csn`, `meta.deleted`)
PARTITION BY (document_class_id, toYear(document_created))
PRIMARY KEY id
ORDER BY (id, document_created)

and materialized view:

CREATE MATERIALIZED VIEW simi2_cdc.metadata_mv TO simi2_cdc.metadata_S
AS SELECT
    `id`,
    `document_created`,
    `document_class_id`,
    ...
    toUInt128(coalesce(`meta.csn`, '0')) AS `meta.csn`,
    `meta.deleted`,
    `meta.loaded`
FROM simi2_cdc.metadata_H

In query_log we saw:

SELECT
    query_start_time,
    substring(query, 1, 40),
    tables
FROM system.query_log
WHERE ((query_start_time >= '2024-11-18 21:36:00') AND (query_start_time <= '2024-11-18 22:09:00')) AND (query_kind = 'Insert') AND has(tables, 'simi2_cdc.metadata_H')
ORDER BY query_start_time ASC
FORMAT TSVWithNames

before restart:

query_start_time        substring(query, 1, 40) tables
...
2024-11-18 21:36:14     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:36:14     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:36:39     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']

during restart:

...
2024-11-18 21:54:09     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 21:54:09     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 22:06:15     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H']
2024-11-18 22:06:15     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H']
...

and some time after all tables are loaded:

2024-11-18 22:08:26     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H']
2024-11-18 22:08:26     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H']
2024-11-18 22:08:38     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']
2024-11-18 22:08:38     INSERT INTO simi2_cdc.metadata_H(`id`,`s        ['simi2_cdc.metadata_H','simi2_cdc.metadata_S']

Expected behavior
Maybe this is unexpected behavior because it can lead to data loss and users might not be aware of this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    unexpected behaviourResult is unexpected, but not entirely wrong at the same time.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions