Add Minimum Batch Index + Order Preserving Insertion Rework by Mytherin · Pull Request #7352 · duckdb/duckdb

Mytherin · 2023-05-03T18:12:14Z

This PR adds a minimum batch index to pipelines that maintain insertion order using batch indexes (see #3700). The minimum batch index signifies that at no point in the future will any thread work on a batch index lower than that one. This, coupled with a new callback in the physical operator (NextBatch) - allows for better parallel processing with batch indexes. It enables the following scenarios:

When the minimum batch index is larger than previously written batches (e.g. a batch with batch_index of 7, minimum_batch_index is 10) - we can write those batches out to disk in-order knowing there will be no more rows that come before that batch
When the minimum batch index is larger than previously written batches, we can merge small batches together (e.g. if the minimum_batch_index is 10, we can merge batches with index 7 and 9 together, knowing a batch index of 8 will never exist)

Batch Insert Rework

This PR also reworks the batch insert into DuckDB tables to take advantage of the minimum batch index. The previous batch index insert had several limitations:

We would only merge adjacent batch indexes (e.g. 7 and 8 would be merged, but not 7 and 9, if 8 was missing). That was because we had no way of knowing whether or not 8 would appear at a later time. Gaps in batch indexes are common if filters are present or in case of unions.
Every thread had their own PartialBlockManager that would get flushed independently. For highly compressible data or smaller data sets this could lead to many half-empty blocks being written to disk unnecessarily, compared to the single-threaded case.
We would only flush batches to disk larger than a row group, or if they could be merged into batches larger than a row group. This could lead to a lot of data not being optimistically flushed to disk leading to high memory requirements depending on batch size.

These issues resulted in order preserving loads (1) generating larger than required databases, and (2) holding more data in memory than required.

This PR fixes those issues by reworking the way the batched data insertion works.

We use the minimum batch index to merge batches, even if they are not adjacent (e.g. if the minimum batch index is 10, we can merge 7 and 9).
When finished writing we merge the PartialBlockManager of all threads - this allows colocating of row groups across threads allowing us to match the single-threaded database size
We merge batches to disk only after they exceed 3 times the row group size - this reduces the worst case from 1 full row group - 1 row group with 1 row to 3 full row groups - 1 row group with 1 row.

These changes make the parallel order-preserving insertion much more robust against varying batch sizes, and should make it comparable to the parallel non-order-preserving insertion in both performance and the database size it generates (in fact - it should often generate smaller databases because data in insertion order is frequently more compressible).

Many Small Row Groups

COPY (FROM range(100000000)) TO 'small.parquet' (ROW_GROUP_SIZE 5000);
CREATE TABLE small AS FROM 'small.parquet';

Measure	v0.7.1	New	Single-Threaded	preserve_insertion_order=false
Time (s)	1.0s	0.62	2.6s	0.5s
DB Size	4.5MB	2.4MB	2.5MB	2.8MB
# Row Groups	1382	822	814	821

Many Average Sized Row Groups

COPY (FROM range(100000000)) TO 'medium.parquet' (ROW_GROUP_SIZE 200000);
CREATE TABLE medium AS FROM 'medium.parquet';

Measure	v0.7.1	New	Single-Threaded	preserve_insertion_order=false
Time (s)	0.36s	0.37s	2.6s	0.36s
DB Size	4.0MB	2.5MB	2.5MB	2.8MB
# Row Groups	997	997	830	819

Many Big Row Groups

COPY (FROM range(100000000)) TO 'big.parquet' (ROW_GROUP_SIZE 1000000);
CREATE TABLE big AS FROM 'big.parquet';

Measure	v0.7.1	New	Single-Threaded	preserve_insertion_order=false
Time (s)	0.37s	0.37s	2.6s	0.38s
DB Size	3.8MB	2.3MB	2.5MB	2.8MB
# Row Groups	898	898	814	819

…tch index

…en merging collections and instead override the RowGroupCollection field by changing it to a reference<>

…tes, and add concurrency controls to partial block manager

… writer

…re we update and move to the next one

…dex - and also is likely not beneficial if output order matters anyway

…xpensive) map traversions

… to avoid creating scenarios where we alternate between small half-full row groups

…nager

… case of insertion cancellation the first_segment may be deleted now that we share the partial block allocations over threads

… into the main table

… chunks

…lush

Mytherin added 30 commits April 5, 2023 14:46

Add optional_idx and use it for batch_index, keep track of minimum ba…

3ad7d7e

…tch index

Rework batch insert to use minimum batch index

3b8b934

Merge branch 'master' into minimumbatchindex

7e30642

Merge

7ad52ff

Merge fixes

60e71c4

Move optimistic data writer to separate file

f6111d1

Format

e5a83a2

Fix test + missing include

13654e6

Avoid reconstructing RowGroup/ColumnData/ColumnSegment/etc classes wh…

0b59b34

…en merging collections and instead override the RowGroupCollection field by changing it to a reference<>

Use shared PartialBlockWriter for entire local table and parallel wri…

aaf1227

…tes, and add concurrency controls to partial block manager

Add test case where we batch load in multiple different statements

c7d0b17

Merge branch 'master' into minimumbatchindex

506890f

optional_ptr, and correct initialization/checking for optimistic data…

22bc546

… writer

Turn this assertion into an internal exception

5d17422

Add a NextBatch function that allows flushing the previous batch befo…

2d74d6f

…re we update and move to the next one

Scheduling union pipelines in parallel messes up the minimum batch in…

eae16a8

…dex - and also is likely not beneficial if output order matters anyway

Batch insert - convert collections to a vector to avoid doing many (e…

319ded8

…xpensive) map traversions

Increase the merge threshold in batch insert to multiple row groups -…

10e1ecf

… to avoid creating scenarios where we alternate between small half-full row groups

Add missing test

b633846

StructColumnData - set start correctly

5a82501

We should no longer flush the partial block manager here

3bde089

Move rollback code out of optimistic writer and into partial block ma…

bc3fc06

…nager

PartialBlockAllocation - keep a reference to the block around - as in…

3fc100e

… case of insertion cancellation the first_segment may be deleted now that we share the partial block allocations over threads

No more sharing of partial block managers

b067875

Add support for merging partial block managers

ce61839

Merge branch 'master' into minimumbatchindex

01da336

More clarity in comments

03811e7

There is no need to flush the optimistic data writer prior to merging…

7cd60ce

… into the main table

Fix for insert count (LocalAppend can slice inputs)

94aa79b

Explicitly clear blocks in LocalStorage

a9ecf31

Mytherin added 14 commits May 2, 2023 22:01

other.Claer

87ebdb2

Batch insert - flush to disk only when there are multiple batches

ffcdba0

More test cases - reclaim space of rollback with mixed batches

b7a9503

Memory limit tests + more readable error message in buffer manager

ff1f5c8

Correctly clear blocks in optimistic data writer

64b0dfa

For final batch - move to batch index of INT64_MAX, and merge smaller…

0fea375

… chunks

Revert to always flushing batches, and several perf improvements

a4e5cc9

written_rows is no longer necessary

4b86d62

Add back final flush for single-threaded local storage appends

48ec3f7

Add benchmarks for small/medium/large row group batches

6a61602

Correctly increase block reference count PartialBlockForCheckpoint::F…

3e8b7a5

…lush

Remove unused method

1a69272

Merge branch 'master' into minimumbatchindex

b9c7396

Don't update batch index if source is blocked

c5e6b46

Mytherin merged commit 9b1d80a into duckdb:master May 4, 2023

Mytherin mentioned this pull request May 4, 2023

Add support for parallel order-preserving CSV write #7368

Merged

Mytherin deleted the minimumbatchindex branch May 5, 2023 12:10

Tishj mentioned this pull request Jan 16, 2024

[Execution] Parallel StreamQueryResult #10245

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Minimum Batch Index + Order Preserving Insertion Rework#7352

Add Minimum Batch Index + Order Preserving Insertion Rework#7352
Mytherin merged 44 commits intoduckdb:masterfrom
Mytherin:minimumbatchindex

Mytherin commented May 3, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mytherin commented May 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Batch Insert Rework

Many Small Row Groups

Many Average Sized Row Groups

Many Big Row Groups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Mytherin commented May 3, 2023 •

edited

Loading