Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock#19937
Merged
Mytherin merged 6 commits intoduckdb:mainfrom Nov 26, 2025
Merged
Conversation
d-justen
pushed a commit
to d-justen/duckdb
that referenced
this pull request
Nov 26, 2025
…nstead grab a shared table lock (duckdb#19937) Currently whenever a database is intended to be modified using a statement such as `INSERT`, `DELETE`, etc we eagerly grab the `SharedCheckpointLock` and store in the `DuckTransaction`. As long as that transaction lives, we can then not initiate a checkpoint. This can keep on applying also after the transaction has concluded as the lock is only released when the transaction is cleaned up, which only happens when there are no active transactions that depend on the data stored within that transaction anymore. If there are readers that need an old snapshot of the data for a long time, these transactions might not be cleaned up for a long time as well. This has two negative effects in concurrent scenarios with several writers, or even a single writer and a single (separate) checkpointer: * Checkpoints can fail to be performed, as the checkpoint lock cannot be grabbed. As a result, we might not checkpoint for extended periods of time, leading to the WAL file continuously growing, data not being compressed / vacuumed, etc. * When a checkpoint is performed (either forcibly through `FORCE CHECKPOINT`, or when the lock manages to be grabbed) writers must wait for the checkpoint to complete, preventing them from continuing operations and potentially introducing large latency spikes. This PR makes an initial move towards grabbing the checkpoint lock less and allowing more operations to happen while a checkpoint is performed (and vice versa - allowing a checkpoint to start in more scenarios). This PR removes the eager grabbing of the `SharedCheckpointLock` for appends (through e.g. an `INSERT INTO` statement). This is accomplished by passing the intended database modification to `RegisterDBModify`. If the modification is an append, we allow it to proceed without grabbing the shared checkpoint lock. Instead, when performing an append on a table, we grab a shared lock to the table (that was previously used for scans). Checkpoints can then still start, but need to wait for an append to finish on a table level (and vice versa). Note that this is still rather limited - because we still grab the `SharedCheckpointLock` when writing to the WAL. That means that we still cannot *commit* transactions that have inserted data while a checkpoint is happening (and cannot start a checkpoint while a transaction that inserted is writing to the WAL). I plan to tackle that in a follow-up PR.
Mytherin
added a commit
that referenced
this pull request
Nov 26, 2025
In #19937 I fixed an issue with `RevertAppend` potentially leaving the segment tree in an undefined state in certain situations (when we would truncate at exactly the row group boundary). This method is called when a unique / primary key constraint violation is found during the commit process. This could then cause issues to happen in subsequent code. This PR backports the fix for this to v1.4.
Mytherin
added a commit
that referenced
this pull request
Dec 5, 2025
) Follow-up from #19937 This PR enables commits to continue while a checkpoint is happening. Currently this is limited to commits that exclusively insert data, as any other changes (deletes, updates, catalog changes, alters, etc) still eagerly grab the checkpoint lock which will prevent a checkpoint from starting while these changes are pending, and vice versa. It is also limited to inserting data into tables **that do not have indexes**. As part of this PR, appending to a table that has indexes now grabs the checkpoint lock again. Enabling commits while checkpointing has two consequences for the system that need to be dealt with: * Checkpointing no longer checkpoints the latest commit. * While checkpointing, new commits that happen need to be written somewhere in order for them to be durable. We can no longer write them to the old WAL as we want to truncate it after our checkpoint is finished. ### Pinned Checkpoint Commit Previously checkpointing code assumed we were always checkpointing the latest commit. This is no longer correct since what is the "latest committed data" might now change *while a checkpoint is running*. Instead, what we need to do is choose a commit id on which we will checkpoint. When starting a checkpoint we get the latest commit id and checkpoint based on that commit. Subsequent commits are not written as part of the checkpoint, but can then be written as part of a future checkpoint. In order to simplify this - we ensure that after starting a checkpoint any new data that is written is always written to *new row groups*. This is managed in `DataTable::AppendLock`. Due to this, when performing the checkpoint, we only need to know "do we need to checkpoint this row group or not", rather than having to checkpoint a part of a row group. This is handled in the new method `RowGroup::ShouldCheckpointRowGroup`. #### Free Blocks Another challenge with the pinned checkpoint commit is how to manage the list of free blocks - i.e. blocks that are present in the file but are not used. Block usage is tracked globally in the `SingleFileBlockManager`. With optimistic writes, we can write to blocks in the storage layer (i.e. make them no longer free blocks). However, if a checkpoint happens at a pinned commit, any optimistic writes that happen after the commit is pinned do not belong to that checkpoint. If we don't write these blocks in the free block list, we might get dangling blocks in case an abort or rollback happens. However, the blocks are not actually free in-memory, as they are being used by the optimistically written data. In order to solve this issue we introduce a new set in the block manager - `newly_used_blocks`. This tracks blocks that are in-use, but are not yet part of a given checkpoint. ### Checkpoint WAL New commits that happen while checkpointing have to be written somewhere. In order to still allow for the checkpoint to truncate the WAL to prevent it from growing indefinitely, we introduce the concept of a **checkpoint WAL**. This is a secondary WAL that can be written to only by concurrent commits while a checkpoint is happening. When a checkpoint is started, the checkpoint flag is written to the original WAL. The checkpoint flag contains the root metadata block pointer that will be written **when the checkpoint is successful**. ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] ``` The checkpoint flag allows us to, during recovery, figure out if a checkpoint was completed or if the checkpoint was not completed. This determines if we need to replay the WAL. This is already done in the current version to deal with a crash between flipping the root block pointer and truncating the WAL, however, in the new version this happens before **any** data is written instead of only happening at the end. After this is written, we set any new commits to write to the checkpoint WAL. For example, assume a new commit comes in that inserts some data. We will now have the following situation: ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` After the checkpoint is finished, we have flushed all changes in `main.db.wal` to the main database file, while the changes in `main.db.checkpoint.wal` have not been flushed. All we need to do is move over the checkpoint WAL and have it replace the original WAL. This will lead us to the following final result after the checkpoint: ``` main.db.wal [INSERT #2][COMMIT] ``` #### Recovery In order to provide ACID compliance all commits that have succeeded must be persisted even across failures. That means that any commits that are written to the checkpoint WAL need to be persisted no matter where we crash. Below is a list of failure modes: ###### Crash Before Checkpoint Complete Our situation is like this: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` In order to recover in this situation, we need to replay both `main.db.wal` and `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root does not match the root in the database, and now also checks for the presence of a checkpoint WAL. It then replays them in order (`main.db.wal` -> `main.db.checkpoint.wal`). If this is a `READ_WRITE` connection it merges the two WALs **except for the checkpoint node** by writing a new WAL that contains the content of both WALs: ``` main.db.recovery.wal [INSERT #1][COMMIT][INSERT #2][COMMIT] ``` After that completes, it overwrites the main WAL with the recovery WAL. Finally, it removes the checkpoint WAL. ``` mv main.db.recovery.wal main.db.wal rm main.db.checkpoint.wal ``` ###### Crash During Recovery If we crash during the above recovery process (after mv, before rm) we would have this situation: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][INSERT #2][COMMIT] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` This is safe to recover from because `main.db.wal` does not contain a `CHECKPOINT` node. As such, we will not replay the checkpoint WAL, and only `main.db.wal` will be replayed. ###### Crash After Checkpoint Complete, Before WAL Move Our situation is like this: ``` main.db [ROOT #2] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` In order to recover in this situation, we need to replay only `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root matches the root in the database, so it knows it does not need to replay `main.db.wal`. It checks for the presence of the checkpoint WAL. It is present - and replays it. If this is a `READ_WRITE` connection it then completes the checkpoint by finalizing the move (i.e. `mv main.db.checkpoint.wal main.db.wal`). ### Other Changes / Fixes #### Windows: make `FileSystem::MoveFile` behave like Linux/MacOS On Linux/MacOS, `MoveFile` is used to mean "move and override the target file". On Windows, this would previously fail if the target exists already. This PR makes Windows behave like Linux/MacOS by using `MOVEFILE_REPLACE_EXISTING` in `MoveFileExW`. In addition, because we tend to use `MoveFile` to mean "we want to be certain this file was moved", we also enable the `MOVEFILE_WRITE_THROUGH` flag. #### SQLLogicTest While testing this PR, I realized the sqllogictest runner was swallowing exceptions thrown in certain locations and incorrectly reporting tests that should fail as succeeded. This PR fixes that and now makes these exceptions fail the test run. This revealed a bunch of failing tests, in particular around the config runners `peg_parser.json` and `encryption.json`, and a few tests in `httpfs`. A few tests were fixed, but others were skipped in the config pending looking at them in the future.
github-actions bot
pushed a commit
to duckdb/duckdb-r
that referenced
this pull request
Jan 1, 2026
Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937)
github-actions bot
added a commit
to duckdb/duckdb-r
that referenced
this pull request
Jan 1, 2026
Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently whenever a database is intended to be modified using a statement such as
INSERT,DELETE, etc we eagerly grab theSharedCheckpointLockand store in theDuckTransaction. As long as that transaction lives, we can then not initiate a checkpoint. This can keep on applying also after the transaction has concluded as the lock is only released when the transaction is cleaned up, which only happens when there are no active transactions that depend on the data stored within that transaction anymore. If there are readers that need an old snapshot of the data for a long time, these transactions might not be cleaned up for a long time as well.This has two negative effects in concurrent scenarios with several writers, or even a single writer and a single (separate) checkpointer:
FORCE CHECKPOINT, or when the lock manages to be grabbed) writers must wait for the checkpoint to complete, preventing them from continuing operations and potentially introducing large latency spikes.This PR makes an initial move towards grabbing the checkpoint lock less and allowing more operations to happen while a checkpoint is performed (and vice versa - allowing a checkpoint to start in more scenarios). This PR removes the eager grabbing of the
SharedCheckpointLockfor appends (through e.g. anINSERT INTOstatement). This is accomplished by passing the intended database modification toRegisterDBModify.If the modification is an append, we allow it to proceed without grabbing the shared checkpoint lock. Instead, when performing an append on a table, we grab a shared lock to the table (that was previously used for scans). Checkpoints can then still start, but need to wait for an append to finish on a table level (and vice versa).
Note that this is still rather limited - because we still grab the
SharedCheckpointLockwhen writing to the WAL. That means that we still cannot commit transactions that have inserted data while a checkpoint is happening (and cannot start a checkpoint while a transaction that inserted is writing to the WAL). I plan to tackle that in a follow-up PR.