Parallel processing right after reading FROM file()#48525
Parallel processing right after reading FROM file()#48525devcrafter merged 15 commits intomasterfrom
FROM file()#48525Conversation
|
Before: EXPLAIN PIPELINE
SELECT sum(length(base58Encode(URL)))
FROM file('hits_*.parquet')
Query id: daaea006-eee5-4561-b293-2e07a224dd9d
┌─explain──────────────────────────┐
│ (Expression) │
│ ExpressionTransform × 18 │
│ (Aggregating) │
│ Resize 1 → 18 │
│ AggregatingTransform │
│ (Expression) │
│ ExpressionTransform │
│ (ReadFromPreparedSource) │
│ NullSource 0 → 1 │
└──────────────────────────────────┘
SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')
Query id: 960c63d0-7e11-4c5b-8d47-4a5800cf6429
┌─sum(length(base58Encode(URL)))─┐
│ 942426048 │
└────────────────────────────────┘
1 row in set. Elapsed: 124.852 sec. Processed 8.87 million rows, 776.83 MB (71.08 thousand rows/s., 6.22 MB/s.)
After: EXPLAIN PIPELINE
SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')
┌─explain──────────────────────────┐
│ (Expression) │
│ ExpressionTransform × 18 │
│ (Aggregating) │
│ Resize 18 → 18 │
│ AggregatingTransform × 18 │
│ StrictResize 18 → 18 │
│ (Expression) │
│ ExpressionTransform × 18 │
│ (ReadFromStorage) │
│ Resize 1 → 18 │
│ File 0 → 1 │
└──────────────────────────────────┘
SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')
Query id: 63fd4df3-a180-4961-be32-244e231db9c9
┌─sum(length(base58Encode(URL)))─┐
│ 942426048 │
└────────────────────────────────┘
1 row in set. Elapsed: 7.917 sec. Processed 8.87 million rows, 776.83 MB (1.12 million rows/s., 98.12 MB/s.)
|
|
Let's also add a performance test (tests/performance). |
|
Around 50 tests potentially can become flaky due to this change. Testsrg --text -e 'from.*file\(' . | rg select | rg -v insert | rg -v Exception | rg -v serverError | rg -v max_threads=1 | cut -d ':' -f1 | cut -d'/' -f2- | uniq | sort
Will check them. Probably a rule of thumb to change a test:
|
|
Should we do the same for ClickHouse/src/Storages/StorageS3.cpp Line 1098 in 1520f3e Is it smth similar? I don't complitely understand what this function does. |
Hope we'll see it in #48727 UPD: it's redundant since we already create a number of sources by number of streams. See d5eb65b |
|
It should be marked as a backward incompatible change, just in case. |
It sounds somewhat controversial. The query on top of |
|
It's just for upgrade notes. We rarely have backward incompatible changes, but something that could break a strange use case will be highlighted in the changelog. |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Query processing is parallelized right after reading
FROM file(...). Related to #38755