Add support for parallel order-preserving CSV write by Mytherin · Pull Request #7368 · duckdb/duckdb

Mytherin · 2023-05-04T21:48:19Z

This PR adds support for parallel order-preserving CSV writing using batch indexes. This is achieved through a new operator (PhysicalBatchCopyToFile) and two new callbacks in the CopyFunction. In addition, the parallel callback is replaced by the execution_mode callback which determines which operator to use. The three options are PhysicalCopyToFile single-threaded, PhysicalCopyToFile multi-threaded, or PhysicalBatchCopyToFile.

enum class CopyFunctionExecutionMode { REGULAR_COPY_TO_FILE, PARALLEL_COPY_TO_FILE, BATCH_COPY_TO_FILE };

CopyFunctionExecutionMode CopyToExecutionMode(bool preserve_insertion_order, bool supports_batch_index);

The two callbacks used for parallel writing are as follows:

unique_ptr<PreparedBatchData> CopyPrepareBatch(ClientContext &context, FunctionData &bind_data,
                                                              GlobalFunctionData &gstate,
                                                              unique_ptr<ColumnDataCollection> collection);
void CopyFlushBatch(ClientContext &context, FunctionData &bind_data, GlobalFunctionData &gstate,
                                   PreparedBatchData &batch);

The operator builds on top of the minimum batch index introduced in #7352. When a new batch is encountered (NextBatch) the old batch is immediately prepared for writing using CopyPrepareBatch. The preparing can be done in parallel by multiple threads - and essentially amounts to serializing all to-be-written data for a batch into a buffer.

The CopyFlushBatch is called in the correct order for every batch that is under the minimum batch index - and does the actual writing to the file. This callback is always called single-threaded - i.e. writing the buffers to the file is not parallelized, only the construction of the buffers.

Performance

The writing scales well with the number of threads as the preparing of batches is generally the bottleneck when writing CSV files - at least when writing to a local SSD. Your mileage may vary depending on the storage medium - and we may have to add support for parallel flushing to disk at some point in the future.

Dataset	New	preserve_insertion_order=false	1T
Lineitem SF1	0.5s	0.45s	2.6s
OnTime	1.6s	1.48s	9.5s
ClickBench-Small	5.6s	5.4s	33s

…or per-thread-output or hive partitioned writes

…chcopytofile

Mytherin added 10 commits May 3, 2023 23:19

Batch copy to CSV file

73afeef

Speed up requires_quotes by using a lookup map

3a0cdce

Add extra order preservation tests

2e94623

Merge branch 'minimumbatchindex' into batchcopytofile

22b32cf

Add more CSV tests, and modify copy API to take a unique ptr to a CDC

ba21bbc

Cannot use batch index/do not care about preserving insertion order f…

5a2b2c5

…or per-thread-output or hive partitioned writes

Fix for platforms with signed characters - and fix for unity builds

e62bd79

Missing std::move

b698689

Merge branch 'master' into batchcopytofile

7587b15

Merge branch 'batchcopytofile' of github.com:Mytherin/duckdb into bat…

a3bddbc

…chcopytofile

Mytherin merged commit 380ddcf into duckdb:master May 5, 2023

Mytherin mentioned this pull request May 5, 2023

Add support for parallel order-preserving Parquet write #7375

Merged

Mytherin deleted the batchcopytofile branch May 5, 2023 12:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parallel order-preserving CSV write#7368

Add support for parallel order-preserving CSV write#7368
Mytherin merged 10 commits intoduckdb:masterfrom
Mytherin:batchcopytofile

Mytherin commented May 4, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mytherin commented May 4, 2023

Performance

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant