Skip to content

Add support for parallel order-preserving CSV write#7368

Merged
Mytherin merged 10 commits intoduckdb:masterfrom
Mytherin:batchcopytofile
May 5, 2023
Merged

Add support for parallel order-preserving CSV write#7368
Mytherin merged 10 commits intoduckdb:masterfrom
Mytherin:batchcopytofile

Conversation

@Mytherin
Copy link
Collaborator

@Mytherin Mytherin commented May 4, 2023

This PR adds support for parallel order-preserving CSV writing using batch indexes. This is achieved through a new operator (PhysicalBatchCopyToFile) and two new callbacks in the CopyFunction. In addition, the parallel callback is replaced by the execution_mode callback which determines which operator to use. The three options are PhysicalCopyToFile single-threaded, PhysicalCopyToFile multi-threaded, or PhysicalBatchCopyToFile.

enum class CopyFunctionExecutionMode { REGULAR_COPY_TO_FILE, PARALLEL_COPY_TO_FILE, BATCH_COPY_TO_FILE };

CopyFunctionExecutionMode CopyToExecutionMode(bool preserve_insertion_order, bool supports_batch_index);

The two callbacks used for parallel writing are as follows:

unique_ptr<PreparedBatchData> CopyPrepareBatch(ClientContext &context, FunctionData &bind_data,
                                                              GlobalFunctionData &gstate,
                                                              unique_ptr<ColumnDataCollection> collection);
void CopyFlushBatch(ClientContext &context, FunctionData &bind_data, GlobalFunctionData &gstate,
                                   PreparedBatchData &batch);

The operator builds on top of the minimum batch index introduced in #7352. When a new batch is encountered (NextBatch) the old batch is immediately prepared for writing using CopyPrepareBatch. The preparing can be done in parallel by multiple threads - and essentially amounts to serializing all to-be-written data for a batch into a buffer.

The CopyFlushBatch is called in the correct order for every batch that is under the minimum batch index - and does the actual writing to the file. This callback is always called single-threaded - i.e. writing the buffers to the file is not parallelized, only the construction of the buffers.

Performance

The writing scales well with the number of threads as the preparing of batches is generally the bottleneck when writing CSV files - at least when writing to a local SSD. Your mileage may vary depending on the storage medium - and we may have to add support for parallel flushing to disk at some point in the future.

Dataset New preserve_insertion_order=false 1T
Lineitem SF1 0.5s 0.45s 2.6s
OnTime 1.6s 1.48s 9.5s
ClickBench-Small 5.6s 5.4s 33s

@Mytherin Mytherin merged commit 380ddcf into duckdb:master May 5, 2023
@Mytherin Mytherin deleted the batchcopytofile branch May 5, 2023 12:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant