WAL: Write pointers to optimistically written row groups directly, instead of copying over the data by Mytherin · Pull Request #13372 · duckdb/duckdb

Mytherin · 2024-08-09T12:28:05Z

This PR reworks the way that writing optimistically-written data to the WAL works. When data is optimistically written, we currently rely on performing a CHECKPOINT during the commit to persist the optimistically written pages. The checkpoint then updates all of the table metadata to point towards the new pages, and ensures they are used when the database is restarted.

In scenarios where we could not CHECKPOINT immediately - generally due to concurrenct activity (e.g. other clients also writing/updating/deleting data) - we would write the data out to the WAL. This would lead to a big performance degradation, as the data would be copied over from the pages that had already been written to the database file into the WAL. Furthermore, as the data in the WAL is uncompressed, the WAL could end up being bigger than the actual database file.

WAL Replay

This PR instead reworks this by adding a new entry to the WAL - the WALType::ROW_GROUP_DATA. This contains a series of data pointers that can be used to reconstruct the row groups directly from the (previously optimistically written) pages. This drastically reduces the size of the data written to the WAL in these scenarios.

When replaying the WAL, we must be careful not to overwrite pages that contain optimistically written data. Since the pages are not mentioned in the metadata of the main database file, the pages are considered empty/free when loading only the metadata of the main database file. This could potentially lead to accidentally overwriting the data on these pages.

When doing the WAL replay, we already did a two-pass approach over the WAL to detect serialization errors. This change makes this even more essential. In our first pass, we add any blocks that we encounter as part of WALType::ROW_GROUP_DATA entries to the set of in-use blocks. This prevents them from being overwritten by replaying other WAL entries.

Serialization Rework

In order to facilitate the serialization of row group metadata to the WAL, this PR reworks the way that serialization of row groups and column data works by moving everything into a separate series of classes - PersistentColumnData, PersistentRowGroupData and PersistentCollectionData. These are also used to write the metadata to the main database file to ensure consistency.

debug_skip_checkpoint_on_commit

For testing, a new option is added - debug_skip_checkpoint_on_commit. This option can be used to simulate a scenario in which a checkpoint on commit is not possible and we must instead write to the WAL.

Storage Version

Since we are adding a new WAL entry, WAL files written that contain this new WAL entry cannot be replayed by older versions of DuckDB. Since this WAL entry is rarely used in normal operation, and in our own storage format WAL files are also rarely used, we have opted to not consider the target storage serialization version. Instead, when applicable this WAL entry will always be written.

…e initialization

…ctionality

…as well during checkpoint

Merge pull request duckdb/duckdb#13372 from Mytherin/walwriteblocks

nicku33 · 2024-08-12T14:16:32Z

Dreams really do come true !

When optimistically writing data to disk - there were a few scenarios in which we would not optimistically write row groups: * For batch insert, when the batches were approximately as large as our internal row group size, we would not always flush them as the `CollectionMerger` would have a collection with a single `ColumnDataCollection` in it * For regular insert, we would not flush the last row group in `Combine` For regular insertions, this would not have a large impact as most data would still be written optimistically - but for the optimistic WAL write added in #13372 we need **all** row groups written in sequence to be optimistically written out. By not flushing all row groups, large WAL files would still be created.

Mytherin added 19 commits August 7, 2024 14:18

WIP write block pointers directly to the WAL

abf17c0

WIP: WAL writing of row groups functional, now just missing the replay

147161b

WIP row group collection deserialization

4cfd174

Rework the deserializing code to split up the deserialization from th…

a8fa9f1

…e initialization

In WAL read - mark blocks as used prior to replay

19cd01c

Set count correctly in struct column data

e225414

Allow load to be used in loops, and add a setting to test new WAL fun…

5d27202

…ctionality

Flush unflushed row groups explicitly in PhysicalBatchInsert::Finalize

8f752e9

Remove duplicated serialization code - generate PersistentColumnData …

8b67936

…as well during checkpoint

Add test for mixed regular inserts and optimistically written inserts

90d26b8

Handle partially optimistically written row groups

d57fcea

Format fix

970af40

Add missing includes

fc2980a

Merge branch 'main' into walwriteblocks

d70f7b2

Generate

76c9be4

Set properties required for compressed column data deserialization

82e3d3f

Correctly label overflow string blocks as used as well

da3cc9d

Format fix

dbc5dd1

Skip test

0dd6227

duckdb-draftbot marked this pull request as draft August 9, 2024 12:56

Mytherin marked this pull request as ready for review August 9, 2024 12:56

Mytherin merged commit d8a69cc into duckdb:main Aug 11, 2024

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Aug 11, 2024

chore: Update vendored sources to duckdb/duckdb@d8a69cc

8fce032

Merge pull request duckdb/duckdb#13372 from Mytherin/walwriteblocks

Mytherin deleted the walwriteblocks branch August 28, 2024 14:02

Mytherin mentioned this pull request Nov 8, 2024

Optimistic writes: flush the last row group in all scenarios #14759

Merged

Mytherin mentioned this pull request Jan 6, 2026

Avoid frequent checkpoints triggered by optimistic insertions #20336

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WAL: Write pointers to optimistically written row groups directly, instead of copying over the data#13372

WAL: Write pointers to optimistically written row groups directly, instead of copying over the data#13372
Mytherin merged 19 commits intoduckdb:mainfrom
Mytherin:walwriteblocks

Mytherin commented Aug 9, 2024

Uh oh!

nicku33 commented Aug 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mytherin commented Aug 9, 2024

WAL Replay

Serialization Rework

debug_skip_checkpoint_on_commit

Storage Version

Uh oh!

nicku33 commented Aug 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants