Add support for parallel order-preserving CSV write#7368
Merged
Mytherin merged 10 commits intoduckdb:masterfrom May 5, 2023
Merged
Add support for parallel order-preserving CSV write#7368Mytherin merged 10 commits intoduckdb:masterfrom
Mytherin merged 10 commits intoduckdb:masterfrom
Conversation
…or per-thread-output or hive partitioned writes
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for parallel order-preserving CSV writing using batch indexes. This is achieved through a new operator (
PhysicalBatchCopyToFile) and two new callbacks in the CopyFunction. In addition, theparallelcallback is replaced by theexecution_modecallback which determines which operator to use. The three options arePhysicalCopyToFilesingle-threaded,PhysicalCopyToFilemulti-threaded, orPhysicalBatchCopyToFile.The two callbacks used for parallel writing are as follows:
The operator builds on top of the minimum batch index introduced in #7352. When a new batch is encountered (
NextBatch) the old batch is immediately prepared for writing usingCopyPrepareBatch. The preparing can be done in parallel by multiple threads - and essentially amounts to serializing all to-be-written data for a batch into a buffer.The
CopyFlushBatchis called in the correct order for every batch that is under the minimum batch index - and does the actual writing to the file. This callback is always called single-threaded - i.e. writing the buffers to the file is not parallelized, only the construction of the buffers.Performance
The writing scales well with the number of threads as the preparing of batches is generally the bottleneck when writing CSV files - at least when writing to a local SSD. Your mileage may vary depending on the storage medium - and we may have to add support for parallel flushing to disk at some point in the future.