Parallel processing right after reading `FROM file()` by devcrafter · Pull Request #48525 · ClickHouse/ClickHouse

devcrafter · 2023-04-06T22:05:50Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Query processing is parallelized right after reading FROM file(...). Related to #38755

devcrafter · 2023-04-06T22:15:11Z

Before:

EXPLAIN PIPELINE
SELECT sum(length(base58Encode(URL)))
FROM file('hits_*.parquet')

Query id: daaea006-eee5-4561-b293-2e07a224dd9d

┌─explain──────────────────────────┐
│ (Expression)                     │
│ ExpressionTransform × 18         │
│   (Aggregating)                  │
│   Resize 1 → 18                  │
│     AggregatingTransform         │
│       (Expression)               │
│       ExpressionTransform        │
│         (ReadFromPreparedSource) │
│         NullSource 0 → 1         │
└──────────────────────────────────┘

SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')

Query id: 960c63d0-7e11-4c5b-8d47-4a5800cf6429

┌─sum(length(base58Encode(URL)))─┐
│                      942426048 │
└────────────────────────────────┘

1 row in set. Elapsed: 124.852 sec. Processed 8.87 million rows, 776.83 MB (71.08 thousand rows/s., 6.22 MB/s.)

After:

EXPLAIN PIPELINE
SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')

┌─explain──────────────────────────┐
│ (Expression)                     │
│ ExpressionTransform × 18         │
│   (Aggregating)                  │
│   Resize 18 → 18                 │
│     AggregatingTransform × 18    │
│       StrictResize 18 → 18       │
│         (Expression)             │
│         ExpressionTransform × 18 │
│           (ReadFromStorage)      │
│           Resize 1 → 18          │
│             File 0 → 1           │
└──────────────────────────────────┘

SELECT sum(length(base58Encode(URL)))
FROM file('hits.parquet')

Query id: 63fd4df3-a180-4961-be32-244e231db9c9

┌─sum(length(base58Encode(URL)))─┐
│                      942426048 │
└────────────────────────────────┘

1 row in set. Elapsed: 7.917 sec. Processed 8.87 million rows, 776.83 MB (1.12 million rows/s., 98.12 MB/s.)

…expands to empty set)

alexey-milovidov · 2023-04-07T16:31:38Z

Let's also add a performance test (tests/performance).

devcrafter · 2023-04-08T17:12:19Z

Avogar · 2023-04-12T13:22:21Z

Should we do the same for s3/url/hdfs table functions?
Also for s3 table function I see using narrowPipes:

ClickHouse/src/Storages/StorageS3.cpp

Line 1098 in 1520f3e

narrowPipe(pipe, num_streams);

Is it smth similar? I don't complitely understand what this function does.

devcrafter · 2023-04-12T20:31:54Z

Should we do the same for s3/url/hdfs table functions? Also for s3 table function I see using narrowPipes:

ClickHouse/src/Storages/StorageS3.cpp

Line 1098 in 1520f3e

narrowPipe(pipe, num_streams);

Is it smth similar? I don't complitely understand what this function does.

Hope we'll see it in #48727

UPD: it's redundant since we already create a number of sources by number of streams. See d5eb65b

alexey-milovidov · 2023-04-16T22:11:02Z

It should be marked as a backward incompatible change, just in case.
For example, it broke ClickHouse/NoiSQL#4 slightly.

devcrafter · 2023-04-17T20:23:45Z

It should be marked as a backward incompatible change, just in case. For example, it broke ClickHouse/NoiSQL#4 slightly.

It sounds somewhat controversial. The query on top of file() can be parallelized in other places in the query pipeline. Will be such changes backward incompatible?

alexey-milovidov · 2023-04-19T09:15:08Z

It's just for upgrade notes. We rarely have backward incompatible changes, but something that could break a strange use case will be highlighted in the changelog.

Parallel reading in FROM file()

2e139c2

Fix header

96213fa

devcrafter added the pr-performance Pull request with some performance improvements label Apr 6, 2023

devcrafter added 3 commits April 7, 2023 07:41

Fix test

f3e3117

Fix 01548_parallel_parsing_max_memory.sh

bea9468

Fix: do not resize pipeline when there is no files to process (globs …

78038a3

…expands to empty set)

alexey-milovidov approved these changes Apr 7, 2023

View reviewed changes

alexey-milovidov self-assigned this Apr 7, 2023

Perf test

8fdc2b3

devcrafter added 5 commits April 8, 2023 17:29

Fix 02293_formats_json_columns

0dc3193

Fix test which can become flaky due to file() parallelization

32ac238

Fix 02286_mysql_dump_input_format

0fbb05b

A try to prevent possible tests flakyness due to file() parallelization

bcb913e

Fix tests: truncate output file on insert

d80364f

alexey-milovidov approved these changes Apr 9, 2023

View reviewed changes

devcrafter and others added 4 commits April 9, 2023 19:55

Adopt tests to run in parallel or mark as no-parallel (for flaky check)

6fe6e1f

Simpler way to resize pipeline

1187534

Fix tests for flaky check

e1fa279

Merge branch 'master' into parallel-reading-from-file

c279516

alexey-milovidov mentioned this pull request Apr 10, 2023

Take advantage of more CPU cores when converting JSONL to Parquet? #45014

Closed

devcrafter merged commit e3b5072 into master Apr 10, 2023

devcrafter deleted the parallel-reading-from-file branch April 10, 2023 16:30

novikd mentioned this pull request Apr 13, 2023

Try to fix flaky 02455_one_row_from_csv_memory_usage #48756

Merged

alexey-milovidov mentioned this pull request Apr 16, 2023

Fix compatibility with ClickHouse 23.4 ClickHouse/NoiSQL#4

Merged

devcrafter added the backward compatibility label Apr 19, 2023

devcrafter mentioned this pull request Apr 24, 2023

Parallelize query processing right after reading FROM ... #48727

Merged

filimonov mentioned this pull request Jun 28, 2023

parallelize_output_from_storages: number of threads is not capped by max_threads setting #51565

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel processing right after reading `FROM file()`#48525

Parallel processing right after reading `FROM file()`#48525
devcrafter merged 15 commits intomasterfrom
parallel-reading-from-file

devcrafter commented Apr 6, 2023 •

edited

Loading

Uh oh!

devcrafter commented Apr 6, 2023 •

edited

Loading

Uh oh!

alexey-milovidov commented Apr 7, 2023

Uh oh!

devcrafter commented Apr 8, 2023 •

edited

Loading

Uh oh!

Avogar commented Apr 12, 2023 •

edited

Loading

Uh oh!

devcrafter commented Apr 12, 2023 •

edited

Loading

Uh oh!

alexey-milovidov commented Apr 16, 2023

Uh oh!

devcrafter commented Apr 17, 2023

Uh oh!

alexey-milovidov commented Apr 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

devcrafter commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

devcrafter commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Apr 7, 2023

Uh oh!

devcrafter commented Apr 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avogar commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

devcrafter commented Apr 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexey-milovidov commented Apr 16, 2023

Uh oh!

devcrafter commented Apr 17, 2023

Uh oh!

alexey-milovidov commented Apr 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devcrafter commented Apr 6, 2023 •

edited

Loading

devcrafter commented Apr 6, 2023 •

edited

Loading

devcrafter commented Apr 8, 2023 •

edited

Loading

Avogar commented Apr 12, 2023 •

edited

Loading

devcrafter commented Apr 12, 2023 •

edited

Loading