Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock by Mytherin · Pull Request #19937 · duckdb/duckdb

Mytherin · 2025-11-25T21:30:56Z

Currently whenever a database is intended to be modified using a statement such as INSERT, DELETE, etc we eagerly grab the SharedCheckpointLock and store in the DuckTransaction. As long as that transaction lives, we can then not initiate a checkpoint. This can keep on applying also after the transaction has concluded as the lock is only released when the transaction is cleaned up, which only happens when there are no active transactions that depend on the data stored within that transaction anymore. If there are readers that need an old snapshot of the data for a long time, these transactions might not be cleaned up for a long time as well.

This has two negative effects in concurrent scenarios with several writers, or even a single writer and a single (separate) checkpointer:

Checkpoints can fail to be performed, as the checkpoint lock cannot be grabbed. As a result, we might not checkpoint for extended periods of time, leading to the WAL file continuously growing, data not being compressed / vacuumed, etc.
When a checkpoint is performed (either forcibly through FORCE CHECKPOINT, or when the lock manages to be grabbed) writers must wait for the checkpoint to complete, preventing them from continuing operations and potentially introducing large latency spikes.

This PR makes an initial move towards grabbing the checkpoint lock less and allowing more operations to happen while a checkpoint is performed (and vice versa - allowing a checkpoint to start in more scenarios). This PR removes the eager grabbing of the SharedCheckpointLock for appends (through e.g. an INSERT INTO statement). This is accomplished by passing the intended database modification to RegisterDBModify.

If the modification is an append, we allow it to proceed without grabbing the shared checkpoint lock. Instead, when performing an append on a table, we grab a shared lock to the table (that was previously used for scans). Checkpoints can then still start, but need to wait for an append to finish on a table level (and vice versa).

Note that this is still rather limited - because we still grab the SharedCheckpointLock when writing to the WAL. That means that we still cannot commit transactions that have inserted data while a checkpoint is happening (and cannot start a checkpoint while a transaction that inserted is writing to the WAL). I plan to tackle that in a follow-up PR.

…lock instead

…have it

…nstead grab a shared table lock (duckdb#19937) Currently whenever a database is intended to be modified using a statement such as `INSERT`, `DELETE`, etc we eagerly grab the `SharedCheckpointLock` and store in the `DuckTransaction`. As long as that transaction lives, we can then not initiate a checkpoint. This can keep on applying also after the transaction has concluded as the lock is only released when the transaction is cleaned up, which only happens when there are no active transactions that depend on the data stored within that transaction anymore. If there are readers that need an old snapshot of the data for a long time, these transactions might not be cleaned up for a long time as well. This has two negative effects in concurrent scenarios with several writers, or even a single writer and a single (separate) checkpointer: * Checkpoints can fail to be performed, as the checkpoint lock cannot be grabbed. As a result, we might not checkpoint for extended periods of time, leading to the WAL file continuously growing, data not being compressed / vacuumed, etc. * When a checkpoint is performed (either forcibly through `FORCE CHECKPOINT`, or when the lock manages to be grabbed) writers must wait for the checkpoint to complete, preventing them from continuing operations and potentially introducing large latency spikes. This PR makes an initial move towards grabbing the checkpoint lock less and allowing more operations to happen while a checkpoint is performed (and vice versa - allowing a checkpoint to start in more scenarios). This PR removes the eager grabbing of the `SharedCheckpointLock` for appends (through e.g. an `INSERT INTO` statement). This is accomplished by passing the intended database modification to `RegisterDBModify`. If the modification is an append, we allow it to proceed without grabbing the shared checkpoint lock. Instead, when performing an append on a table, we grab a shared lock to the table (that was previously used for scans). Checkpoints can then still start, but need to wait for an append to finish on a table level (and vice versa). Note that this is still rather limited - because we still grab the `SharedCheckpointLock` when writing to the WAL. That means that we still cannot *commit* transactions that have inserted data while a checkpoint is happening (and cannot start a checkpoint while a transaction that inserted is writing to the WAL). I plan to tackle that in a follow-up PR.

In #19937 I fixed an issue with `RevertAppend` potentially leaving the segment tree in an undefined state in certain situations (when we would truncate at exactly the row group boundary). This method is called when a unique / primary key constraint violation is found during the commit process. This could then cause issues to happen in subsequent code. This PR backports the fix for this to v1.4.

) Follow-up from #19937 This PR enables commits to continue while a checkpoint is happening. Currently this is limited to commits that exclusively insert data, as any other changes (deletes, updates, catalog changes, alters, etc) still eagerly grab the checkpoint lock which will prevent a checkpoint from starting while these changes are pending, and vice versa. It is also limited to inserting data into tables **that do not have indexes**. As part of this PR, appending to a table that has indexes now grabs the checkpoint lock again. Enabling commits while checkpointing has two consequences for the system that need to be dealt with: * Checkpointing no longer checkpoints the latest commit. * While checkpointing, new commits that happen need to be written somewhere in order for them to be durable. We can no longer write them to the old WAL as we want to truncate it after our checkpoint is finished. ### Pinned Checkpoint Commit Previously checkpointing code assumed we were always checkpointing the latest commit. This is no longer correct since what is the "latest committed data" might now change *while a checkpoint is running*. Instead, what we need to do is choose a commit id on which we will checkpoint. When starting a checkpoint we get the latest commit id and checkpoint based on that commit. Subsequent commits are not written as part of the checkpoint, but can then be written as part of a future checkpoint. In order to simplify this - we ensure that after starting a checkpoint any new data that is written is always written to *new row groups*. This is managed in `DataTable::AppendLock`. Due to this, when performing the checkpoint, we only need to know "do we need to checkpoint this row group or not", rather than having to checkpoint a part of a row group. This is handled in the new method `RowGroup::ShouldCheckpointRowGroup`. #### Free Blocks Another challenge with the pinned checkpoint commit is how to manage the list of free blocks - i.e. blocks that are present in the file but are not used. Block usage is tracked globally in the `SingleFileBlockManager`. With optimistic writes, we can write to blocks in the storage layer (i.e. make them no longer free blocks). However, if a checkpoint happens at a pinned commit, any optimistic writes that happen after the commit is pinned do not belong to that checkpoint. If we don't write these blocks in the free block list, we might get dangling blocks in case an abort or rollback happens. However, the blocks are not actually free in-memory, as they are being used by the optimistically written data. In order to solve this issue we introduce a new set in the block manager - `newly_used_blocks`. This tracks blocks that are in-use, but are not yet part of a given checkpoint. ### Checkpoint WAL New commits that happen while checkpointing have to be written somewhere. In order to still allow for the checkpoint to truncate the WAL to prevent it from growing indefinitely, we introduce the concept of a **checkpoint WAL**. This is a secondary WAL that can be written to only by concurrent commits while a checkpoint is happening. When a checkpoint is started, the checkpoint flag is written to the original WAL. The checkpoint flag contains the root metadata block pointer that will be written **when the checkpoint is successful**. ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] ``` The checkpoint flag allows us to, during recovery, figure out if a checkpoint was completed or if the checkpoint was not completed. This determines if we need to replay the WAL. This is already done in the current version to deal with a crash between flipping the root block pointer and truncating the WAL, however, in the new version this happens before **any** data is written instead of only happening at the end. After this is written, we set any new commits to write to the checkpoint WAL. For example, assume a new commit comes in that inserts some data. We will now have the following situation: ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` After the checkpoint is finished, we have flushed all changes in `main.db.wal` to the main database file, while the changes in `main.db.checkpoint.wal` have not been flushed. All we need to do is move over the checkpoint WAL and have it replace the original WAL. This will lead us to the following final result after the checkpoint: ``` main.db.wal [INSERT #2][COMMIT] ``` #### Recovery In order to provide ACID compliance all commits that have succeeded must be persisted even across failures. That means that any commits that are written to the checkpoint WAL need to be persisted no matter where we crash. Below is a list of failure modes: ###### Crash Before Checkpoint Complete Our situation is like this: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` In order to recover in this situation, we need to replay both `main.db.wal` and `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root does not match the root in the database, and now also checks for the presence of a checkpoint WAL. It then replays them in order (`main.db.wal` -> `main.db.checkpoint.wal`). If this is a `READ_WRITE` connection it merges the two WALs **except for the checkpoint node** by writing a new WAL that contains the content of both WALs: ``` main.db.recovery.wal [INSERT #1][COMMIT][INSERT #2][COMMIT] ``` After that completes, it overwrites the main WAL with the recovery WAL. Finally, it removes the checkpoint WAL. ``` mv main.db.recovery.wal main.db.wal rm main.db.checkpoint.wal ``` ###### Crash During Recovery If we crash during the above recovery process (after mv, before rm) we would have this situation: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][INSERT #2][COMMIT] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` This is safe to recover from because `main.db.wal` does not contain a `CHECKPOINT` node. As such, we will not replay the checkpoint WAL, and only `main.db.wal` will be replayed. ###### Crash After Checkpoint Complete, Before WAL Move Our situation is like this: ``` main.db [ROOT #2] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2] main.db.checkpoint.wal [INSERT #2][COMMIT] ``` In order to recover in this situation, we need to replay only `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root matches the root in the database, so it knows it does not need to replay `main.db.wal`. It checks for the presence of the checkpoint WAL. It is present - and replays it. If this is a `READ_WRITE` connection it then completes the checkpoint by finalizing the move (i.e. `mv main.db.checkpoint.wal main.db.wal`). ### Other Changes / Fixes #### Windows: make `FileSystem::MoveFile` behave like Linux/MacOS On Linux/MacOS, `MoveFile` is used to mean "move and override the target file". On Windows, this would previously fail if the target exists already. This PR makes Windows behave like Linux/MacOS by using `MOVEFILE_REPLACE_EXISTING` in `MoveFileExW`. In addition, because we tend to use `MoveFile` to mean "we want to be certain this file was moved", we also enable the `MOVEFILE_WRITE_THROUGH` flag. #### SQLLogicTest While testing this PR, I realized the sqllogictest runner was swallowing exceptions thrown in certain locations and incorrectly reporting tests that should fail as succeeded. This PR fixes that and now makes these exceptions fail the test run. This revealed a bunch of failing tests, in particular around the config runners `peg_parser.json` and `encryption.json`, and a few tests in `httpfs`. A few tests were fixed, but others were skipped in the config pending looking at them in the future.

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937)

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937) Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>

Mytherin added 6 commits November 25, 2025 10:11

Make RegisterDBModify take the type of modification that will take place

8dc23c7

Propagate DatabaseModificationType to the transaction layer

fb5a4af

WIP: no longer grab checkpoint lock for inserts, grab a shared table …

f39bc76

…lock instead

Grab checkpoint lock while writing to WAL if the transaction doesn't …

4c74ca7

…have it

Forward declaration

49ab3fb

Grab checkpoint lock after unlocking transaction lock

23e9919

Mytherin merged commit f38e0ff into duckdb:main Nov 26, 2025
56 checks passed

Mytherin mentioned this pull request Nov 26, 2025

Backport revert append fixes #19941

Merged

Mytherin deleted the insertnocheckpointlock branch December 4, 2025 11:30

Mytherin mentioned this pull request Dec 4, 2025

Allow concurrent commits to happen while a checkpoint is running #20052

Merged

github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jan 1, 2026

vendor: Update vendored sources to duckdb/duckdb@f38e0ff

0a7ad8c

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937)

github-actions bot mentioned this pull request Jan 1, 2026

vendor: Update vendored sources to duckdb/duckdb@f38e0ff7c3870924ff5f4a5c01ee313b45558f37 duckdb/duckdb-r#1912

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock#19937

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock#19937
Mytherin merged 6 commits intoduckdb:mainfrom
Mytherin:insertnocheckpointlock

Mytherin commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mytherin commented Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant