Skip to content

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock#19937

Merged
Mytherin merged 6 commits intoduckdb:mainfrom
Mytherin:insertnocheckpointlock
Nov 26, 2025
Merged

Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock#19937
Mytherin merged 6 commits intoduckdb:mainfrom
Mytherin:insertnocheckpointlock

Conversation

@Mytherin
Copy link
Collaborator

Currently whenever a database is intended to be modified using a statement such as INSERT, DELETE, etc we eagerly grab the SharedCheckpointLock and store in the DuckTransaction. As long as that transaction lives, we can then not initiate a checkpoint. This can keep on applying also after the transaction has concluded as the lock is only released when the transaction is cleaned up, which only happens when there are no active transactions that depend on the data stored within that transaction anymore. If there are readers that need an old snapshot of the data for a long time, these transactions might not be cleaned up for a long time as well.

This has two negative effects in concurrent scenarios with several writers, or even a single writer and a single (separate) checkpointer:

  • Checkpoints can fail to be performed, as the checkpoint lock cannot be grabbed. As a result, we might not checkpoint for extended periods of time, leading to the WAL file continuously growing, data not being compressed / vacuumed, etc.
  • When a checkpoint is performed (either forcibly through FORCE CHECKPOINT, or when the lock manages to be grabbed) writers must wait for the checkpoint to complete, preventing them from continuing operations and potentially introducing large latency spikes.

This PR makes an initial move towards grabbing the checkpoint lock less and allowing more operations to happen while a checkpoint is performed (and vice versa - allowing a checkpoint to start in more scenarios). This PR removes the eager grabbing of the SharedCheckpointLock for appends (through e.g. an INSERT INTO statement). This is accomplished by passing the intended database modification to RegisterDBModify.

If the modification is an append, we allow it to proceed without grabbing the shared checkpoint lock. Instead, when performing an append on a table, we grab a shared lock to the table (that was previously used for scans). Checkpoints can then still start, but need to wait for an append to finish on a table level (and vice versa).

Note that this is still rather limited - because we still grab the SharedCheckpointLock when writing to the WAL. That means that we still cannot commit transactions that have inserted data while a checkpoint is happening (and cannot start a checkpoint while a transaction that inserted is writing to the WAL). I plan to tackle that in a follow-up PR.

@Mytherin Mytherin merged commit f38e0ff into duckdb:main Nov 26, 2025
56 checks passed
d-justen pushed a commit to d-justen/duckdb that referenced this pull request Nov 26, 2025
…nstead grab a shared table lock (duckdb#19937)

Currently whenever a database is intended to be modified using a
statement such as `INSERT`, `DELETE`, etc we eagerly grab the
`SharedCheckpointLock` and store in the `DuckTransaction`. As long as
that transaction lives, we can then not initiate a checkpoint. This can
keep on applying also after the transaction has concluded as the lock is
only released when the transaction is cleaned up, which only happens
when there are no active transactions that depend on the data stored
within that transaction anymore. If there are readers that need an old
snapshot of the data for a long time, these transactions might not be
cleaned up for a long time as well.

This has two negative effects in concurrent scenarios with several
writers, or even a single writer and a single (separate) checkpointer:
* Checkpoints can fail to be performed, as the checkpoint lock cannot be
grabbed. As a result, we might not checkpoint for extended periods of
time, leading to the WAL file continuously growing, data not being
compressed / vacuumed, etc.
* When a checkpoint is performed (either forcibly through `FORCE
CHECKPOINT`, or when the lock manages to be grabbed) writers must wait
for the checkpoint to complete, preventing them from continuing
operations and potentially introducing large latency spikes.

This PR makes an initial move towards grabbing the checkpoint lock less
and allowing more operations to happen while a checkpoint is performed
(and vice versa - allowing a checkpoint to start in more scenarios).
This PR removes the eager grabbing of the `SharedCheckpointLock` for
appends (through e.g. an `INSERT INTO` statement). This is accomplished
by passing the intended database modification to `RegisterDBModify`.

If the modification is an append, we allow it to proceed without
grabbing the shared checkpoint lock. Instead, when performing an append
on a table, we grab a shared lock to the table (that was previously used
for scans). Checkpoints can then still start, but need to wait for an
append to finish on a table level (and vice versa).

Note that this is still rather limited - because we still grab the
`SharedCheckpointLock` when writing to the WAL. That means that we still
cannot *commit* transactions that have inserted data while a checkpoint
is happening (and cannot start a checkpoint while a transaction that
inserted is writing to the WAL). I plan to tackle that in a follow-up
PR.
Mytherin added a commit that referenced this pull request Nov 26, 2025
In #19937 I fixed an issue with
`RevertAppend` potentially leaving the segment tree in an undefined
state in certain situations (when we would truncate at exactly the row
group boundary). This method is called when a unique / primary key
constraint violation is found during the commit process. This could then
cause issues to happen in subsequent code. This PR backports the fix for
this to v1.4.
@Mytherin Mytherin deleted the insertnocheckpointlock branch December 4, 2025 11:30
Mytherin added a commit that referenced this pull request Dec 5, 2025
)

Follow-up from #19937

This PR enables commits to continue while a checkpoint is happening.
Currently this is limited to commits that exclusively insert data, as
any other changes (deletes, updates, catalog changes, alters, etc) still
eagerly grab the checkpoint lock which will prevent a checkpoint from
starting while these changes are pending, and vice versa. It is also
limited to inserting data into tables **that do not have indexes**. As
part of this PR, appending to a table that has indexes now grabs the
checkpoint lock again.

Enabling commits while checkpointing has two consequences for the system
that need to be dealt with:

* Checkpointing no longer checkpoints the latest commit. 
* While checkpointing, new commits that happen need to be written
somewhere in order for them to be durable. We can no longer write them
to the old WAL as we want to truncate it after our checkpoint is
finished.

### Pinned Checkpoint Commit

Previously checkpointing code assumed we were always checkpointing the
latest commit. This is no longer correct since what is the "latest
committed data" might now change *while a checkpoint is running*.
Instead, what we need to do is choose a commit id on which we will
checkpoint. When starting a checkpoint we get the latest commit id and
checkpoint based on that commit. Subsequent commits are not written as
part of the checkpoint, but can then be written as part of a future
checkpoint.

In order to simplify this - we ensure that after starting a checkpoint
any new data that is written is always written to *new row groups*. This
is managed in `DataTable::AppendLock`. Due to this, when performing the
checkpoint, we only need to know "do we need to checkpoint this row
group or not", rather than having to checkpoint a part of a row group.
This is handled in the new method `RowGroup::ShouldCheckpointRowGroup`.

#### Free Blocks

Another challenge with the pinned checkpoint commit is how to manage the
list of free blocks - i.e. blocks that are present in the file but are
not used. Block usage is tracked globally in the
`SingleFileBlockManager`. With optimistic writes, we can write to blocks
in the storage layer (i.e. make them no longer free blocks). However, if
a checkpoint happens at a pinned commit, any optimistic writes that
happen after the commit is pinned do not belong to that checkpoint. If
we don't write these blocks in the free block list, we might get
dangling blocks in case an abort or rollback happens. However, the
blocks are not actually free in-memory, as they are being used by the
optimistically written data.

In order to solve this issue we introduce a new set in the block manager
- `newly_used_blocks`. This tracks blocks that are in-use, but are not
yet part of a given checkpoint.

### Checkpoint WAL

New commits that happen while checkpointing have to be written
somewhere. In order to still allow for the checkpoint to truncate the
WAL to prevent it from growing indefinitely, we introduce the concept of
a **checkpoint WAL**. This is a secondary WAL that can be written to
only by concurrent commits while a checkpoint is happening.

When a checkpoint is started, the checkpoint flag is written to the
original WAL. The checkpoint flag contains the root metadata block
pointer that will be written **when the checkpoint is successful**.


```
main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2]
```

The checkpoint flag allows us to, during recovery, figure out if a
checkpoint was completed or if the checkpoint was not completed. This
determines if we need to replay the WAL. This is already done in the
current version to deal with a crash between flipping the root block
pointer and truncating the WAL, however, in the new version this happens
before **any** data is written instead of only happening at the end.

After this is written, we set any new commits to write to the checkpoint
WAL. For example, assume a new commit comes in that inserts some data.
We will now have the following situation:

```
main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2]

main.db.checkpoint.wal
[INSERT #2][COMMIT]
```

After the checkpoint is finished, we have flushed all changes in
`main.db.wal` to the main database file, while the changes in
`main.db.checkpoint.wal` have not been flushed. All we need to do is
move over the checkpoint WAL and have it replace the original WAL. This
will lead us to the following final result after the checkpoint:

```
main.db.wal
[INSERT #2][COMMIT]
```


#### Recovery

In order to provide ACID compliance all commits that have succeeded must
be persisted even across failures. That means that any commits that are
written to the checkpoint WAL need to be persisted no matter where we
crash. Below is a list of failure modes:

###### Crash Before Checkpoint Complete

Our situation is like this:

```
main.db
[ROOT #1]

main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2]

main.db.checkpoint.wal
[INSERT #2][COMMIT]
```

In order to recover in this situation, we need to replay both
`main.db.wal` and `main.db.checkpoint.wal`. The recovering process sees
that the checkpoint root does not match the root in the database, and
now also checks for the presence of a checkpoint WAL. It then replays
them in order (`main.db.wal` -> `main.db.checkpoint.wal`).

If this is a `READ_WRITE` connection it merges the two WALs **except for
the checkpoint node** by writing a new WAL that contains the content of
both WALs:

```
main.db.recovery.wal
[INSERT #1][COMMIT][INSERT #2][COMMIT]
```

After that completes, it overwrites the main WAL with the recovery WAL.
Finally, it removes the checkpoint WAL.

```
mv main.db.recovery.wal main.db.wal
rm main.db.checkpoint.wal
```

###### Crash During Recovery

If we crash during the above recovery process (after mv, before rm) we
would have this situation:

```
main.db
[ROOT #1]

main.db.wal
[INSERT #1][COMMIT][INSERT #2][COMMIT]

main.db.checkpoint.wal
[INSERT #2][COMMIT]
```

This is safe to recover from because `main.db.wal` does not contain a
`CHECKPOINT` node. As such, we will not replay the checkpoint WAL, and
only `main.db.wal` will be replayed.

###### Crash After Checkpoint Complete, Before WAL Move


Our situation is like this:

```
main.db
[ROOT #2]

main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: #2]

main.db.checkpoint.wal
[INSERT #2][COMMIT]
```

In order to recover in this situation, we need to replay only
`main.db.checkpoint.wal`. The recovering process sees that the
checkpoint root matches the root in the database, so it knows it does
not need to replay `main.db.wal`. It checks for the presence of the
checkpoint WAL. It is present - and replays it.

If this is a `READ_WRITE` connection it then completes the checkpoint by
finalizing the move (i.e. `mv main.db.checkpoint.wal main.db.wal`).


### Other Changes / Fixes


#### Windows: make `FileSystem::MoveFile` behave like Linux/MacOS

On Linux/MacOS, `MoveFile` is used to mean "move and override the target
file". On Windows, this would previously fail if the target exists
already. This PR makes Windows behave like Linux/MacOS by using
`MOVEFILE_REPLACE_EXISTING` in `MoveFileExW`. In addition, because we
tend to use `MoveFile` to mean "we want to be certain this file was
moved", we also enable the `MOVEFILE_WRITE_THROUGH` flag.


#### SQLLogicTest 

While testing this PR, I realized the sqllogictest runner was swallowing
exceptions thrown in certain locations and incorrectly reporting tests
that should fail as succeeded. This PR fixes that and now makes these
exceptions fail the test run. This revealed a bunch of failing tests, in
particular around the config runners `peg_parser.json` and
`encryption.json`, and a few tests in `httpfs`. A few tests were fixed,
but others were skipped in the config pending looking at them in the
future.
github-actions bot pushed a commit to duckdb/duckdb-r that referenced this pull request Jan 1, 2026
Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937)
github-actions bot added a commit to duckdb/duckdb-r that referenced this pull request Jan 1, 2026
Avoid eagerly grabbing the checkpoint lock when performing inserts, instead grab a shared table lock (duckdb/duckdb#19937)

Co-authored-by: krlmlr <krlmlr@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant