WAL: Write pointers to optimistically written row groups directly, instead of copying over the data#13372
Merged
Mytherin merged 19 commits intoduckdb:mainfrom Aug 11, 2024
Merged
Conversation
…as well during checkpoint
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Aug 11, 2024
Merge pull request duckdb/duckdb#13372 from Mytherin/walwriteblocks
Contributor
|
Dreams really do come true ! |
This was referenced Aug 23, 2024
Mytherin
added a commit
that referenced
this pull request
Nov 8, 2024
When optimistically writing data to disk - there were a few scenarios in which we would not optimistically write row groups: * For batch insert, when the batches were approximately as large as our internal row group size, we would not always flush them as the `CollectionMerger` would have a collection with a single `ColumnDataCollection` in it * For regular insert, we would not flush the last row group in `Combine` For regular insertions, this would not have a large impact as most data would still be written optimistically - but for the optimistic WAL write added in #13372 we need **all** row groups written in sequence to be optimistically written out. By not flushing all row groups, large WAL files would still be created.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR reworks the way that writing optimistically-written data to the WAL works. When data is optimistically written, we currently rely on performing a
CHECKPOINTduring the commit to persist the optimistically written pages. The checkpoint then updates all of the table metadata to point towards the new pages, and ensures they are used when the database is restarted.In scenarios where we could not
CHECKPOINTimmediately - generally due to concurrenct activity (e.g. other clients also writing/updating/deleting data) - we would write the data out to the WAL. This would lead to a big performance degradation, as the data would be copied over from the pages that had already been written to the database file into the WAL. Furthermore, as the data in the WAL is uncompressed, the WAL could end up being bigger than the actual database file.WAL Replay
This PR instead reworks this by adding a new entry to the WAL - the
WALType::ROW_GROUP_DATA. This contains a series of data pointers that can be used to reconstruct the row groups directly from the (previously optimistically written) pages. This drastically reduces the size of the data written to the WAL in these scenarios.When replaying the WAL, we must be careful not to overwrite pages that contain optimistically written data. Since the pages are not mentioned in the metadata of the main database file, the pages are considered empty/free when loading only the metadata of the main database file. This could potentially lead to accidentally overwriting the data on these pages.
When doing the WAL replay, we already did a two-pass approach over the WAL to detect serialization errors. This change makes this even more essential. In our first pass, we add any blocks that we encounter as part of
WALType::ROW_GROUP_DATAentries to the set of in-use blocks. This prevents them from being overwritten by replaying other WAL entries.Serialization Rework
In order to facilitate the serialization of row group metadata to the WAL, this PR reworks the way that serialization of row groups and column data works by moving everything into a separate series of classes -
PersistentColumnData,PersistentRowGroupDataandPersistentCollectionData. These are also used to write the metadata to the main database file to ensure consistency.debug_skip_checkpoint_on_commit
For testing, a new option is added -
debug_skip_checkpoint_on_commit. This option can be used to simulate a scenario in which a checkpoint on commit is not possible and we must instead write to the WAL.Storage Version
Since we are adding a new WAL entry, WAL files written that contain this new WAL entry cannot be replayed by older versions of DuckDB. Since this WAL entry is rarely used in normal operation, and in our own storage format WAL files are also rarely used, we have opted to not consider the target storage serialization version. Instead, when applicable this WAL entry will always be written.