Add Minimum Batch Index + Order Preserving Insertion Rework#7352
Merged
Mytherin merged 44 commits intoduckdb:masterfrom May 4, 2023
Merged
Add Minimum Batch Index + Order Preserving Insertion Rework#7352Mytherin merged 44 commits intoduckdb:masterfrom
Mytherin merged 44 commits intoduckdb:masterfrom
Conversation
…en merging collections and instead override the RowGroupCollection field by changing it to a reference<>
…tes, and add concurrency controls to partial block manager
…re we update and move to the next one
…dex - and also is likely not beneficial if output order matters anyway
…xpensive) map traversions
… to avoid creating scenarios where we alternate between small half-full row groups
… case of insertion cancellation the first_segment may be deleted now that we share the partial block allocations over threads
… into the main table
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a minimum batch index to pipelines that maintain insertion order using batch indexes (see #3700). The minimum batch index signifies that at no point in the future will any thread work on a batch index lower than that one. This, coupled with a new callback in the physical operator (
NextBatch) - allows for better parallel processing with batch indexes. It enables the following scenarios:batch_indexof 7,minimum_batch_indexis 10) - we can write those batches out to disk in-order knowing there will be no more rows that come before that batchminimum_batch_indexis 10, we can merge batches with index 7 and 9 together, knowing a batch index of 8 will never exist)Batch Insert Rework
This PR also reworks the batch insert into DuckDB tables to take advantage of the minimum batch index. The previous batch index insert had several limitations:
7and8would be merged, but not7and9, if8was missing). That was because we had no way of knowing whether or not8would appear at a later time. Gaps in batch indexes are common if filters are present or in case of unions.PartialBlockManagerthat would get flushed independently. For highly compressible data or smaller data sets this could lead to many half-empty blocks being written to disk unnecessarily, compared to the single-threaded case.These issues resulted in order preserving loads (1) generating larger than required databases, and (2) holding more data in memory than required.
This PR fixes those issues by reworking the way the batched data insertion works.
PartialBlockManagerof all threads - this allows colocating of row groups across threads allowing us to match the single-threaded database size1 full row group - 1 row group with 1 rowto3 full row groups - 1 row group with 1 row.These changes make the parallel order-preserving insertion much more robust against varying batch sizes, and should make it comparable to the parallel non-order-preserving insertion in both performance and the database size it generates (in fact - it should often generate smaller databases because data in insertion order is frequently more compressible).
Many Small Row Groups
Many Average Sized Row Groups
Many Big Row Groups