Proper Materialized views startup dependencies by ilejn · Pull Request #72123 · ClickHouse/ClickHouse

ilejn · 2024-11-20T08:58:52Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

A materialized view can start too late, e.g. after the Kafka table that streams to it.

ilejn · 2024-11-20T09:04:45Z

Hello @serxa , could you have a look please since you worked on async table loading?

The change really cures the issue, although I am still working on it because I am not sure why

TablesLoader and DataBaseCatalog coexist and keep same or similar dependencies
we need checkDependencies() for Kafka (StorageKafkaUtil.cpp) and other streaming engines, instead of having proper load dependencies. In other words, why "loading dependencies are a subset of referential dependencies". What prevents us from treating referential dependencies as loading dependencies?

robot-clickhouse · 2024-11-20T15:05:51Z

This is an automated comment for commit cd1cc0c with description of existing statuses. It's updated for the latest CI running

❌ Click here to open a full report in a separate page

Check name	Description	Status
Builds	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	❌ failure
Integration tests	The integration tests report. In parenthesis the package type is given, and in square brackets are the optional part/total tests	❌ failure

Successful checks

Check name	Description	Status
AST fuzzer	Runs randomly generated queries to catch program errors. The build type is optionally given in parenthesis. If it fails, ask a maintainer for help	✅ success
BuzzHouse (asan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (debug)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (msan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (tsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
BuzzHouse (ubsan)	There's no description for the check yet, please add it to tests/ci/ci_config.py:CHECK_DESCRIPTIONS	✅ success
ClickBench	Runs ClickBench with instant-attach table	✅ success
Compatibility check	Checks that clickhouse binary runs on distributions with old libc versions. If it fails, ask a maintainer for help	✅ success
Docker keeper image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docker server image	The check to build and optionally push the mentioned image to docker hub	✅ success
Docs check	Builds and tests the documentation	✅ success
Fast test	Normally this is the first check that is ran for a PR. It builds ClickHouse and runs most of stateless functional tests, omitting some. If it fails, further checks are not started until it is fixed. Look at the report to see which tests fail, then reproduce the failure locally as described here	✅ success
Install packages	Checks that the built packages are installable in a clear environment	✅ success
Performance Comparison	Measure changes in query performance. The performance test report is described in detail here. In square brackets are the optional part/total tests	✅ success
Stateless tests	Runs stateless functional tests for ClickHouse binaries built in various configurations -- release, debug, with sanitizers, etc	✅ success
Stress test	Runs stateless functional tests concurrently from several clients to detect concurrency-related errors	✅ success
Style check	Runs a set of checks to keep the code style clean. If some of tests failed, see the related log from the report	✅ success
Unit tests	Runs the unit tests for different release types	✅ success
Upgrade check	Runs stress tests on server version from last release and then tries to upgrade it to the version from the PR. It checks if the new server can successfully startup without any errors, crashes or sanitizer asserts	✅ success

ilejn · 2024-11-21T07:38:21Z

Test failures:

test_s3_zero_copy_replication/test.py::test_move_shared_lock_fail_keeper_unavailable (not unique, but rare https://play.clickhouse.com/play?user=play#c2VsZWN0IGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcHVsbF9yZXF1ZXN0X251bWJlciwgdGVzdF9zdGF0dXMKZnJvbSBjaGVja3MgCndoZXJlICcyMDI0LTEwLTEwJyA8PSBjaGVja19zdGFydF90aW1lICBhbmQgdGVzdF9uYW1lIGlsaWtlICclOnRlc3RfbW92ZV9zaGFyZWRfbG9ja19mYWlsX2tlZXBlcl91bmF2YWlsYWJsZSUnIGFuZCBjaGVja19uYW1lIGlsaWtlICclaW50ZWdyJScgYW5kIHRlc3Rfc3RhdHVzICE9ICdPSycKb3JkZXIgYnkgdGVzdF9kdXJhdGlvbl9tcyBkZXNjCmxpbWl0IDEwMDsKCg==
01086_window_view_cleanup is definitely flaky https://play.clickhouse.com/play?user=play#c2VsZWN0IGNoZWNrX25hbWUsIHRlc3RfbmFtZSwgcHVsbF9yZXF1ZXN0X251bWJlciwgdGVzdF9zdGF0dXMKZnJvbSBjaGVja3MgCndoZXJlICcyMDI0LTEwLTEwJyA8PSBjaGVja19zdGFydF90aW1lICBhbmQgdGVzdF9uYW1lIGlsaWtlICclMDEwODZfd2luZG93X3ZpZXdfY2xlYW51cCUnIGFuZCB0ZXN0X3N0YXR1cyAhPSAnT0snCm9yZGVyIGJ5IHRlc3RfZHVyYXRpb25fbXMgZGVzYwpsaW1pdCAxMDA7Cgo=

src/Storages/IStorage.h

ilejn · 2024-11-22T13:10:34Z

Hello @serxa , could you have a look, please.

serxa · 2024-11-27T13:15:48Z

TablesLoader and DataBaseCatalog coexist and keep same or similar dependencies

They serve different purposes. DataBaseCatalog is a thing that stores everything consistently, a source of truth if you will. TablesLoader is just a simple helper class that exists to extract common code from two different places: (a) server startup and (b) CREATE/ATTACH TABLE queries. This helper arranges "loading" in a way that is legitimate for the catalog to process without issues. This is the main reason for them to share similar things like dependency graphs. The catalog is just not smart enough to plan the chain (the order) of actions needed to load a database. Previously there was a synchronous loading-by-levels algorithm which was replaced by AsyncLoader. And now, probably, TablesLoader might be a part of DataBaseCatalog, because it is now much easier to track all the dependencies. But still, it would be a large refactoring to do with no clear objectives.

we need checkDependencies() for Kafka (StorageKafkaUtil.cpp) and other streaming engines, instead of having proper load dependencies. In other words, why "loading dependencies are a subset of referential dependencies". What prevents us from treating referential dependencies as loading dependencies?

Any unnecessary load dependency may potentially slow down server startup because more tables will be loaded sequentially instead of in parallel. So we tried our best to separate different kinds of dependencies. IIRC referential dependencies exist only to avoid dropping a table that is mentioned in the definition of another table, but such tables could be loaded in parallel. cc @tavplubix

ilejn · 2024-11-27T21:42:14Z

@serxa, thanks for reacting and for clarification.

I have to admit that I am not sure that current solution is optimal/most elegant.
Basically what we do is carefully detect dependencies, then ignore them starting up streaming tables too early (before their dependencies are ready) only to make all streaming tables to call checkDependencies() from their setup methods to wait proper moment to start.

I see following options

limit the scope by fixing the bug, make this PR suitable for merge (cleanup + rename pushDependencies to preSetup() or postLoad()) as it currently is.
move addViewDependency call out of MaterializedView, somewhere to TablesLoader, probably to DDLDependencyVisitor (not sure if it is possible, but should be)
include view dependencies to load dependencies for tables with a streaming engine.

(3) eliminates need of checkDependencies, while these methods are needed for (1) and (2), but since the code is the same, it can be a single method of DataBaseCatalog.

Suggestions are appreciated.

ilejn · 2024-11-28T08:12:50Z

BTW, if I am not mistaken, check for ErrorCodes::TOO_MANY_MATERIALIZED_VIEWS
is not reliable.

serxa · 2024-12-03T14:10:49Z

@ilejn

Basically what we do is carefully detect dependencies, then ignore them starting up streaming tables too early (before their dependencies are ready) only to make all streaming tables to call checkDependencies() from their setup methods to wait proper moment to start.

You are completely right, this is nonsense. I think option (3) is almost the right way to do it. We should get rid of checkDependencies(). And add correct dependencies in AsyncLoader for views. Now we have the following order of execution in case of load dependency (note that dependencies point backward to arrows in the diagram):

flowchart TD
    load_table1["load table1"] --> startup_table1["startup table1"]
    load_table2["load table2"] --> startup_table2["startup table2"]
    load_table1 --> load_table2

Which prevents execution parallel loading. But for views, we could instead use dependencies between startup jobs to make the following order of execution:

flowchart TD
    load_table1["load table1"] --> startup_table1["startup table1"]
    load_table2["load table2"] --> startup_table2["startup table2"]
    startup_table1 --> startup_table2

This won't prevent parallel execution of load jobs. And startup jobs are not long and it would not be a problem. Whenever a table is accessed by INSERT query it waits (blocks) until the startup job is done. So this logic remains the same and will automagically resolve the issue instead of the checkDependencies() hack.

To implement this we should pass view_dependencies into TablesLoader and use it for job construction in TablesLoader::startupTablesAsync() method. It could be done in a way similar to TablesLoader::loadTablesAsync(). Iteration should be done in view_dependencies.getTablesSortedByDependency() order as well.

Seems not too hard at the first glance. Would you like to give it a try in this PR?

serxa · 2024-12-03T14:44:56Z

BTW, if I am not mistaken, check for ErrorCodes::TOO_MANY_MATERIALIZED_VIEWS
is not reliable.

What do you mean?

ilejn · 2024-12-03T17:08:04Z

BTW, if I am not mistaken, check for ErrorCodes::TOO_MANY_MATERIALIZED_VIEWS
is not reliable.

What do you mean?

We do this check in StorageMaterializedView ctor based on data collected in StorageMaterializedView::startup .
Switch to creating view dependencies in TablesLoader::loadTablesAsync solves this issues among others.

ilejn · 2024-12-03T17:28:31Z

To implement this we should pass view_dependencies into TablesLoader

Yes.
Retrieving view dependencies by DDLDependencyVisitorData is not perfect, but ok for me.

and use it for job construction in TablesLoader::startupTablesAsync() method. It could be done in a way similar to TablesLoader::loadTablesAsync(). Iteration should be done in view_dependencies.getTablesSortedByDependency() order as well.

Should we limit the scope of this extra loop by streaming alike tables only?

Seems not too hard at the first glance.

Will see ;)

Would you like to give it a try in this PR?

Yes, sure.

BTW, I have an impression that in case of streaming tables and their MVs we don't have an implicit synchronization point, "Whenever a table is accessed by INSERT query it waits (blocks) until the startup job is done" does not work for them, that's why checkDependencies() are needed. Does it make sense?

serxa · 2024-12-03T17:57:12Z

I have an impression that in case of streaming tables and their MVs we don't have an implicit synchronization point

I'm not very deep in that streaming code, but yes, looks like that is true.

Should we limit the scope of this extra loop by streaming alike tables only?

No, I believe it is needed for all tables. So it will be not an extra loop, but improvement of existing one that is already present there (current loop just iterates tables in some unspecified order). In #72589 streaming is not mentioned, but async_load_tables was identified as the root cause. It is just because without async loading there was no way to do INSERT in the middle (when the source table is started, but target is not started yet). With async load of tables it is now possible.

So we will fix both issues by introducing these new dependencies directly into AsyncLoader DAG.

ilejn · 2024-12-04T07:39:44Z

@serxa, I have several assorted concerns/questions/assumptions (probably not all of them make sense) based on your graphs and explanations.

all loads precede all startups, right? (would be nice to make timeline more verbose on the graphs)
I used to think that startup is much more time consuming than load.
Your second picture illustrates CREATE MATERIALIZED VIEW mv TO table1 AS SELECT * FROM table2 leaving mv out of the scope, right?
view_dependencies we have in master are from view to table (view depends on its target table), I am not sure that current view dependencies by themselves are suitable to track dependencies from MV's target table to source(s) tables(s)
What we probably need - MV' source table(s) depends on target table
The whole point of the suggested change is to add more dependencies for startup (besides current load_table[table_id.getFullTableName()]->goals() we plan to introduce some extra ones that come from view dependencies)
Why do we need extra dependencies for normal tables? Implicit synchronization upon INSERT/SELECT works fine for them. Of course, we have to avoid race condition creating view dependencies, and it should be enough.
For streaming tables we have to have startup of all dependent (child) tables completed at the moment parent table' startup is called. For normal tables A and B that have a MV between, we can run startup methods in parallel. That is why I think that we have to treat streaming tables in a special way.

It might be more convenient to communicate via Telegram. May I write you (I know username)?

ilejn · 2025-02-05T09:18:19Z

Looking into test_storage_s3_queue/test.py::test_registry failure.
Despite the fact it is known to be flaky.

Update: matter of delay. I've managed to reproduce this on my laptop, artificially slowing it down.

@kssenii , I suggest increasing timeout.

--- a/tests/integration/test_storage_s3_queue/test.py
+++ b/tests/integration/test_storage_s3_queue/test.py
@@ -2612,7 +2612,7 @@ def test_registry(started_cluster):
         )
 
     expected_rows = files_to_generate
-    for _ in range(20):
+    for _ in range(40):
         if expected_rows == get_count():
             break
         time.sleep(1)

serxa · 2025-03-27T15:09:18Z

I resolved new conflicts. It seems now mergeable again. Tests are green except the flaky one.

PR / Bugfix validation (pull_request)Failing after 60m

This thing I do not understand:

Run action failed for: [Bugfix validation] with exit code [-15]
Job timed out: [Bugfix validation] exit code [-15]
ERROR: Run action failed with timeout and did not generate JobReport - update dummy report with execution time
Command `python3 /home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/tests/ci/bugfix_validate_check.py` has failed, timeout 2400s is exceededRun action done for: [Bugfix validation]
ERROR: Job was killed - generate evidence
INFO:botocore.credentials:Found credentials from IAM Role: ec2_admin
Posting slack message, dry_run [False]
{}
ERROR: Run failed with exit code [241]

serxa · 2025-03-27T15:19:18Z

This thing I do not understand

It seems to be the same problem of long tests, but it showed an empty report, so I did not realize it at the first glance. Let's override mergeable check one more time and wait for private CI to finish.

ilejn · 2025-03-27T15:23:16Z

This thing I do not understand

It seems to be the same problem of long tests, but it showed an empty report, so I did not realize it at the first glance. Let's override mergeable check one more time and wait for private CI to finish.

Seems reasonable unfortunately.
The test behaves like flaky check does.

ilejn · 2025-04-07T08:30:54Z

Hello @serxa ,
what can I do to have it merged?

serxa · 2025-04-07T11:27:20Z

I think I messed up while resolving conflicts in private branch. And now there are failed tests. I was distracted, but I'll try to fix it now.

serxa · 2025-04-08T19:37:40Z

src/Databases/DDLDependencyVisitor.cpp

+                        if (!table_id.table_name.empty())
+                        {
+                            mv_to_dependency = table_id;
+                            if (mv_to_dependency->getDatabaseName().empty())


There are two places where getDatabaseName().empty() is called. But getDatabaseName() throws an exception on an empty database. I dont know why it has not been found by public CI, but some private tests found this problem.

Great finding, I'll fix it by EOD.

@serxa , could you share anything that facilitates reproduction?
Cannot manage yet.

Probably this code is 'inspired' by

ClickHouse/src/Databases/DDLDependencyVisitor.cpp

Line 160 in 9e59957

if (qualified_name.database.empty())

Well, I believe the problem is specific to private code because it uses catalog for tracking dependencies and got exception because it uses uuid instead of db.table DB::Exception: Received from 172.16.2.10:9000. DB::Exception: Database name is empty: While processing CREATE MATERIALIZED VIEW _ UUID 'ddfb88c6-d551-4467-bc53-b9244f3a8ef0' REFRESH EVERY 1 YEAR SETTINGS all_replicas = 1 APPEND TO INNER UUID '8e252499-6086-443b-8b65-a466db04ccb6' (x Int64) ENGINE = SharedMergeTree('/clickhouse/tables/{uuid}/{shard}', '{replica}') ORDER BY x SETTINGS index_granularity = 8192 DEFINER = default SQL SECURITY DEFINER AS SELECT rand() AS x

Let's just make it similar to what it was. I think it should do the trick

QualifiedTableName target_name{table_id.database_name, table_id.table_name}; if (target_name.database.empty()) target_name.database = current_database;

serxa · 2025-04-16T13:52:30Z

Private tests should be fixed now. At least local check does not reveal any problems. But public CI is in bad shape at the moment...

serxa · 2025-04-17T17:18:16Z

@ilejn Let's merge master into this branch one more time

serxa · 2025-04-23T11:07:38Z

Private Sync is finally green, but for some reason, CI here does not recognize this fact here.

serxa · 2025-04-23T13:48:08Z

I'm so glad it is finally merged. Thanks to everybody involved!

evillique · 2025-06-19T17:17:12Z

For visibility: #82222

bobelev · 2025-11-04T12:00:35Z

Any chance of backporting it to 25.3?

serxa · 2025-11-04T13:17:22Z

Any chance of backporting it to 25.3?

No chances, this rework is too big

ilejn marked this pull request as draft November 20, 2024 08:59

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Nov 20, 2024

robot-clickhouse-ci-2 added the pr-not-for-changelog This PR should not be mentioned in the changelog label Nov 20, 2024

ilejn force-pushed the mv_dependencies branch from f63342c to 85631db Compare November 20, 2024 20:26

Enmk reviewed Nov 21, 2024

View reviewed changes

src/Storages/IStorage.h Outdated Show resolved Hide resolved

ilejn mentioned this pull request Nov 21, 2024

Make 01086_window_view_cleanup more stable #72232

Merged

serxa self-assigned this Nov 27, 2024

ilejn changed the title ~~Proper Materailzied views startup dependencies~~ Proper Materialized views startup dependencies Nov 27, 2024

UnamedRus mentioned this pull request Nov 28, 2024

Possible loss materialized view data with async_load_databases setting enabled #72589

Closed

ilejn force-pushed the mv_dependencies branch from 76491d0 to a6eac5a Compare December 22, 2024 23:46

ilejn force-pushed the mv_dependencies branch 3 times, most recently from 088a82b to 618a436 Compare January 10, 2025 09:41

ilejn force-pushed the mv_dependencies branch from 877c24c to 0b39ba8 Compare January 29, 2025 20:48

ilejn force-pushed the mv_dependencies branch 2 times, most recently from 8e82023 to cd1cc0c Compare February 6, 2025 13:11

Merge branch 'master' into mv_dependencies

1bed10e

serxa enabled auto-merge March 27, 2025 15:29

serxa reviewed Apr 8, 2025

View reviewed changes

serxa and others added 2 commits April 8, 2025 20:41

Merge branch 'master' into mv_dependencies

7abbd2d

mv_dependencies: no StorageID::getDatabaseName because of exception

5e77d50

auto-merge was automatically disabled April 11, 2025 23:08
Head branch was pushed to by a user without write access

ilejn and others added 2 commits April 12, 2025 02:20

Merge remote-tracking branch 'origin/master' into mv_dependencies

770c55d

Merge branch 'master' into mv_dependencies

029e43b

serxa enabled auto-merge April 16, 2025 13:53

Merge remote-tracking branch 'origin/master' into mv_dependencies

0a69d6d

serxa added this pull request to the merge queue Apr 23, 2025

Merged via the queue into ClickHouse:master with commit c7357e5 Apr 23, 2025
113 of 122 checks passed

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 23, 2025

ilejn mentioned this pull request May 2, 2025

S3Queue consuming data too early, when not all MVs attached to it #70728

Closed

filimonov mentioned this pull request May 2, 2025

KafkaEngine is consuming data too early, "Database XXX doesn't exist" error #8262

Closed

azat mentioned this pull request Jan 16, 2026

Remove redunant logging in DDLDependencyVisitor about "visitTableExpression for" #94407

Merged

serxa mentioned this pull request Mar 17, 2026

delay processing until server has finished loading all the tables #99700

Merged

Conversation

ilejn commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

ilejn commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robot-clickhouse commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilejn commented Nov 21, 2024

Uh oh!

Uh oh!

ilejn commented Nov 22, 2024

Uh oh!

serxa commented Nov 27, 2024

Uh oh!

ilejn commented Nov 27, 2024

Uh oh!

ilejn commented Nov 28, 2024

Uh oh!

serxa commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serxa commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ilejn commented Dec 3, 2024

Uh oh!

ilejn commented Dec 3, 2024

Uh oh!

serxa commented Dec 3, 2024

Uh oh!

ilejn commented Dec 4, 2024

Uh oh!

ilejn commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serxa commented Mar 27, 2025

Uh oh!

serxa commented Mar 27, 2025

Uh oh!

ilejn commented Mar 27, 2025

Uh oh!

ilejn commented Apr 7, 2025

Uh oh!

serxa commented Apr 7, 2025

Uh oh!

serxa Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

ilejn Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

ilejn Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

serxa Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

serxa Apr 11, 2025

Choose a reason for hiding this comment

Uh oh!

serxa commented Apr 16, 2025

Uh oh!

serxa commented Apr 17, 2025

Uh oh!

serxa commented Apr 23, 2025

Uh oh!

Uh oh!

serxa commented Apr 23, 2025

Uh oh!

evillique commented Jun 19, 2025

Uh oh!

bobelev commented Nov 4, 2025

Uh oh!

serxa commented Nov 4, 2025

Uh oh!

ilejn commented Nov 20, 2024 •

edited

Loading

ilejn commented Nov 20, 2024 •

edited

Loading

robot-clickhouse commented Nov 20, 2024 •

edited

Loading

serxa commented Dec 3, 2024 •

edited

Loading

serxa commented Dec 3, 2024 •

edited

Loading

ilejn commented Feb 5, 2025 •

edited

Loading