Skip to content

Fix underspecified ROW_NUMBER() OVER(...) clause in test#1

Merged
lnkuiper merged 1 commit intolnkuiper:16989from
kryonix:fix_test_fib_complex
Aug 11, 2025
Merged

Fix underspecified ROW_NUMBER() OVER(...) clause in test#1
lnkuiper merged 1 commit intolnkuiper:16989from
kryonix:fix_test_fib_complex

Conversation

@kryonix
Copy link

@kryonix kryonix commented Aug 11, 2025

There was an underspecified ROW_NUMBER() window function, which could result in correct, but non-deterministic query results.

@lnkuiper lnkuiper merged commit b65e0db into lnkuiper:16989 Aug 11, 2025
lnkuiper pushed a commit that referenced this pull request Aug 18, 2025
I have seen this crashing due to invalid pointer on which a destructor is called, on last night `main` (`2ed9bf887f`) using
unittester compiled from sources (clang 17) and extensions installed from default extension repository.

Basically:
```
DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/17/lib/darwin/libclang_rt.asan_osx_dynamic.dylib LOCAL_EXTENSION_REPO=http://extensions.duckdb.org ./build/release/test/unittest --autoloading all --skip-compiled  --order rand test/parquet/test_parquet_schema.test
```
and seeing runtime sanitizer assertions such as
```
==56046==ERROR: AddressSanitizer: container-overflow on address 0x6100000d4dcf at pc 0x000116c7f450 bp 0x00016fc1d170 sp 0x00016fc1d168
READ of size 1 at 0x6100000d4dcf thread T0
    #0 0x000116c7f44c in std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>* std::__1::__uninitialized_allocator_copy_impl[abi:ne190102]<std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+0x318 (parquet.duckdb_extension:arm64+0xab44c)
    #1 0x000116c7ec90 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__construct_at_end<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, unsigned long)+0x160 (parquet.duckdb_extension:arm64+0xaac90)
    duckdb#2 0x000116c7e7d8 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__assign_with_size[abi:ne190102]<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, long)+0x1e0 (parquet.duckdb_extension:arm64+0xaa7d8)
    duckdb#3 0x000116e8cd48 in duckdb::ParquetMultiFileInfo::BindReader(duckdb::ClientContext&, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileBindData&)+0xf18 (parquet.duckdb_extension:arm64+0x2b8d48)
    duckdb#4 0x000116e6e5fc in duckdb::MultiFileFunction<duckdb::ParquetMultiFileInfo>::MultiFileBindInternal(duckdb::ClientContext&, duckdb::unique_ptr<duckdb::MultiFileReader, std::__1::default_delete<duckdb::MultiFileReader>, true>, duckdb::shared_ptr<duckdb::MultiFileList, true>, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileOptions, duckdb::unique_ptr<duckdb::BaseFileReaderOptions, std::__1::default_delete<duckdb::BaseFileReaderOptions>, true>, duckdb::unique_ptr<duckdb::MultiFileReaderInterface, std::__1::default_delete<duckdb::MultiFileReaderInterface>, true>)+0x1210 (parquet.duckdb_extension:arm64+0x29a5fc)
```

or these failures while using ducklake
```
==56079==ERROR: AddressSanitizer: container-overflow on address 0x616000091a78 at pc 0x0001323fc81c bp 0x00016bd0e890 sp 0x00016bd0e888
READ of size 8 at 0x616000091a78 thread T2049
    #0 0x0001323fc818 in duckdb::MultiFileColumnDefinition::~MultiFileColumnDefinition()+0x258 (parquet.duckdb_extension:arm64+0x2a4818)
    #1 0x0001323fb588 in std::__1::vector<duckdb::MultiFileColumnDefinition, std::__1::allocator<duckdb::MultiFileColumnDefinition>>::__destroy_vector::operator()[abi:ne190102]()+0x98 (parquet.duckdb_extension:arm64+0x2a3588)
    duckdb#2 0x0001324a09e4 in duckdb::BaseFileReader::~BaseFileReader()+0x2bc (parquet.duckdb_extension:arm64+0x3489e4)
    duckdb#3 0x0001324a23ec in duckdb::ParquetReader::~ParquetReader()+0x22c (parquet.duckdb_extension:arm64+0x34a3ec)
```
lnkuiper pushed a commit that referenced this pull request Aug 18, 2025
I have seen this crashing due to invalid pointer on which a destructor
is called, on last night `main` (`2ed9bf887f`) using unittester compiled
from sources (clang 17) and extensions installed from default extension
repository.

Basically:
```
DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/17/lib/darwin/libclang_rt.asan_osx_dynamic.dylib LOCAL_EXTENSION_REPO=http://extensions.duckdb.org ./build/release/test/unittest --autoloading all --skip-compiled  --order rand test/parquet/test_parquet_schema.test
```
and seeing runtime sanitizer assertions such as
```
==56046==ERROR: AddressSanitizer: container-overflow on address 0x6100000d4dcf at pc 0x000116c7f450 bp 0x00016fc1d170 sp 0x00016fc1d168
READ of size 1 at 0x6100000d4dcf thread T0
    #0 0x000116c7f44c in std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>* std::__1::__uninitialized_allocator_copy_impl[abi:ne190102]<std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+0x318 (parquet.duckdb_extension:arm64+0xab44c)
    #1 0x000116c7ec90 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__construct_at_end<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, unsigned long)+0x160 (parquet.duckdb_extension:arm64+0xaac90)
    duckdb#2 0x000116c7e7d8 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__assign_with_size[abi:ne190102]<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, long)+0x1e0 (parquet.duckdb_extension:arm64+0xaa7d8)
    duckdb#3 0x000116e8cd48 in duckdb::ParquetMultiFileInfo::BindReader(duckdb::ClientContext&, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileBindData&)+0xf18 (parquet.duckdb_extension:arm64+0x2b8d48)
    duckdb#4 0x000116e6e5fc in duckdb::MultiFileFunction<duckdb::ParquetMultiFileInfo>::MultiFileBindInternal(duckdb::ClientContext&, duckdb::unique_ptr<duckdb::MultiFileReader, std::__1::default_delete<duckdb::MultiFileReader>, true>, duckdb::shared_ptr<duckdb::MultiFileList, true>, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileOptions, duckdb::unique_ptr<duckdb::BaseFileReaderOptions, std::__1::default_delete<duckdb::BaseFileReaderOptions>, true>, duckdb::unique_ptr<duckdb::MultiFileReaderInterface, std::__1::default_delete<duckdb::MultiFileReaderInterface>, true>)+0x1210 (parquet.duckdb_extension:arm64+0x29a5fc)
```

or these failures while using ducklake
```
==56079==ERROR: AddressSanitizer: container-overflow on address 0x616000091a78 at pc 0x0001323fc81c bp 0x00016bd0e890 sp 0x00016bd0e888
READ of size 8 at 0x616000091a78 thread T2049
    #0 0x0001323fc818 in duckdb::MultiFileColumnDefinition::~MultiFileColumnDefinition()+0x258 (parquet.duckdb_extension:arm64+0x2a4818)
    #1 0x0001323fb588 in std::__1::vector<duckdb::MultiFileColumnDefinition, std::__1::allocator<duckdb::MultiFileColumnDefinition>>::__destroy_vector::operator()[abi:ne190102]()+0x98 (parquet.duckdb_extension:arm64+0x2a3588)
    duckdb#2 0x0001324a09e4 in duckdb::BaseFileReader::~BaseFileReader()+0x2bc (parquet.duckdb_extension:arm64+0x3489e4)
    duckdb#3 0x0001324a23ec in duckdb::ParquetReader::~ParquetReader()+0x22c (parquet.duckdb_extension:arm64+0x34a3ec)
```

With these changes, once having the `parquet` extension build by CI,
this works as expected.

I am not sure if the fix could / should be elsewhere.
lnkuiper pushed a commit that referenced this pull request Sep 3, 2025
…groups, and limiting vacuum operations for the last number of row groups (duckdb#18829)

When performing a checkpoint, we rewrite columns of row groups that have
had modifications. When performing insertions, all columns are modified,
thus all columns of a row group must be rewritten. As a result, any
insertion followed by a checkpoint triggers a full rewrite of the last
row group.

When dealing with wide tables or string columns with large strings,
rewriting a row group can be relatively expensive. This is particularly
the case when only small insertions have been made. For example,
inserting a single row followed by checkpointing can become expensive as
a result of triggering a rewrite of the last row group:

```sql
create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i);
CHECKPOINT;
.timer on
insert into z values ('a','b','c');
Run Time (s): real 0.006 user 0.004249 sys 0.001186
CHECKPOINT;
Run Time (s): real 0.611 user 1.163009 sys 0.054814
```

The checkpoint takes 0.6s, despite us inserting only a single row.

#### Appending a new row group

This PR solves this problem by appending a new row group when writing to
a table that has been persisted/checkpointed. As a result, we will no
longer modify the (already checkpointed) data. As a trade-off, we end up
with more (smaller) row groups.

#### Vacuuming
As part of vacuuming, we merge adjacent row groups if they fit within
the row group size. As part of this PR - we restrict merging for the
last few row groups. Otherwise, we would end up with the same problem.
We would create row groups like so:

```
100K rows
1 row
```

Then merge them back into a single row group with size `100K+1`,
effectively performing the same rewrite as before.

Instead, for the last row groups in a file, we need a *minimum
threshold* to merge. This is either the `row group size` itself, or 2X
the size of the first row group we are considering. We also always merge
row groups if the total size is `< 2048` (the vector size). This
exponential merging ensures we don't merge on every single insert, while
also ensuring we never have too many row groups.

We can see this process in action when starting with 100K rows, and
inserting 100 rows at a time - effectively repeating the below script:

```sql
insert into z select 'a','b','c' from range(100);
checkpoint;
select row_group_id, max(count) from pragma_storage_info('z') group by all order by all;
```

```sql
 -- insert batch #1
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │        100 │
└──────────────┴────────────┘

-- insert batch duckdb#30
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │       2000 │
│            2 │       1000 │
└──────────────┴────────────┘
-- insert batch duckdb#150
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │       8000 │
│            2 │       4000 │
│            3 │       2000 │
│            4 │       1000 │
└──────────────┴────────────┘
-- insert batch duckdb#228
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     100000 │
│            1 │      16000 │
│            2 │       4000 │
│            3 │       2000 │
│            4 │        800 │
└──────────────┴────────────┘
-- insert batch duckdb#229
┌──────────────┬────────────┐
│ row_group_id │ max(count) │
│    int64     │   int64    │
├──────────────┼────────────┤
│            0 │     122800 │
│            1 │        100 │
└──────────────┴────────────┘
```


### Performance

Running the above example again, the checkpoint now completes in `0.005`
seconds, instead of `0.6` seconds.

Note that we still do the compaction at some point, so while most
checkpoints will have gotten faster, not all of them will have gotten
faster.

Running the above example 300X, here are the timings:

```sql
create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i);
-- 300X the below insertion
insert into z select 'a','b','c' from range(100);
CHECKPOINT;
```

|           Summary           | v1.3.2  |  New   |
|-----------------------------|---------|--------|
| Total Time                  | 128.82s | 1.00s  |
| Average Time Per Checkpoint | 0.56s   | 0.01s  |
| Max Checkpoint Time         | 0.631s  | 0.526s |

In the above example, there is still a single checkpoint that does the
full on compaction, and thus has the same timing as the original script
had for every checkpoint.
lnkuiper pushed a commit that referenced this pull request Dec 1, 2025
…uckdb#19680) (duckdb#19811)

Fixes duckdb#19680

This fixes a bug where queries using `NOT EXISTS` with `IS DISTINCT
FROM` returned incorrect results due to improper handling of NULL
semantics in the optimizer.

The issue was that the optimizer's deliminator incorrectly treated
`DISTINCT FROM` variants the same as regular equality/inequality
comparisons, which have different NULL handling:
  - `IS DISTINCT FROM`: NULL-aware (NULL IS DISTINCT FROM NULL = FALSE)
  - != or =: NULL-unaware (NULL != NULL = NULL, filters out NULLs)


### Incorrect Query Plan

```
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             c2            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             duckdb#5            │
│__internal_decompress_integ│
│     ral_integer(duckdb#3, 1)    │
│             #1            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: ANTI      │
│    Conditions: c2 != c2   ├──────────────┐
│                           │              │
│          ~0 rows          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             duckdb#2            ││             duckdb#2            │
│            NULL           ││            NULL           │
│             #1            ││             #1            │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│             #0            ││             #0            │
│__internal_compress_integra││__internal_compress_integra│
│     l_utinyint(#1, 1)     ││     l_utinyint(#1, 1)     │
│             duckdb#2            ││             duckdb#2            │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││           FILTER          │
│    ────────────────────   ││    ────────────────────   │
│         Table: t0         ││     (col0 IS NOT NULL)    │
│   Type: Sequential Scan   ││                           │
│      Projections: c2      ││                           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │    ────────────────────   │
                             │         Table: t0         │
                             │   Type: Sequential Scan   │
                             │      Projections: c2      │
                             │                           │
                             │          ~2 rows          │
                             └───────────────────────────┘
```

  The buggy plan shows two critical issues:
```
  ┌─────────────┴─────────────┐
  │      NESTED_LOOP_JOIN     │
  │      Join Type: ANTI      │
  │    Conditions: c2 != c2   │  ← ❌ Wrong(the join conditions should be c2 IS DISTINCT FROM c2)
  │          ~0 rows          │
  └─────────────┬─────────────┘
                │
                └─────────────┐
                             ┌┴─────────────┐
                             │   FILTER     │
                             │ (col0 IS NOT │  ← ❌ Wrong(the filter should be removed)
                             │    NULL)     │
                             └──────────────┘
```

### Solution

This PR adds proper support for DISTINCT FROM operators throughout the
optimization pipeline:

1. Preserve DISTINCT FROM semantics in join
conversion.(src/optimizer/deliminator.cpp)
```
// NOTE: We should NOT convert DISTINCT FROM to != in general
// Only convert if the ORIGINAL join had != or = (not DISTINCT FROM variants)
if (delim_join.join_type != JoinType::MARK &&
    original_join_comparison != ExpressionType::COMPARE_DISTINCT_FROM &&
    original_join_comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM) {
    // Safe to convert
}
```
2. Skip NULL filters for DISTINCT FROM
variants.(src/optimizer/deliminator.cpp)
```
// Only add IS NOT NULL filter for regular equality/inequality comparisons
// Do NOT add for DISTINCT FROM variants, as they handle NULL correctly
if (cond.comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM &&
    cond.comparison != ExpressionType::COMPARE_DISTINCT_FROM) {
    // Add IS NOT NULL filter
}
```
3. Added negation support for COMPARE_DISTINCT_FROM and
COMPARE_NOT_DISTINCT_FROM
    in expression type handling.(src/common/enums/expression_type.cpp)
4. Updated parser to properly negate IS DISTINCT FROM expressions when
wrapped with NOT.
(src/parser/transform/expression/transform_bool_expr.cpp)
5. Added regression test in
test/sql/subquery/exists/test_correlated_exists_with_derived_table.test
lnkuiper pushed a commit that referenced this pull request Dec 1, 2025
)

We found this issue when using the python client (because of the
`.show() method propagating a LIMIT), that large limit optimizations
where getting in the way of filter pushdowns. The idea is to push the
filter before applying the limit whenever there is a filter. The idea is
from @Mytherin, I just added a test.

**Original Optimized logical plan:**

```text
┌─────────────────────────────┐
│┌───────────────────────────┐│
││  Optimized Logical Plan   ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│       Expressions: a      │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│       Expressions: a      │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│        Expressions:       │
│__internal_decompress_integ│
│     ral_bigint(#0, 0)     │
│             #1            │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          ORDER_BY         │
│    ────────────────────   │
│           rowid           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│        Expressions:       │
│__internal_compress_integra│
│     l_uinteger(#0, 0)     │
│             #1            │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      COMPARISON_JOIN      │
│    ────────────────────   │
│      Join Type: SEMI      │
│                           ├──────────────┐
│        Conditions:        │              │
│      (rowid = rowid)      │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│          SEQ_SCAN         ││           LIMIT           │
│    ────────────────────   ││    ────────────────────   │
│          Table: t         ││                           │
│   Type: Sequential Scan   ││                           │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │          SEQ_SCAN         │
                             │    ────────────────────   │
                             │       Filters: a<50       │
                             │          Table: t         │
                             │   Type: Sequential Scan   │
                             │                           │
                             │       ~400,000 rows       │
                             └───────────────────────────┘
```

Logical plan after this PR:

```text
┌─────────────────────────────┐
│┌───────────────────────────┐│
││  Optimized Logical Plan   ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│       Expressions: a      │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           LIMIT           │
│    ────────────────────   │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│          SEQ_SCAN         │
│    ────────────────────   │
│       Filters: a<50       │
│          Table: t         │
│   Type: Sequential Scan   │
│                           │
│       ~400,000 rows       │
└───────────────────────────┘
```
lnkuiper pushed a commit that referenced this pull request Dec 5, 2025
…kdb#20052)

Follow-up from duckdb#19937

This PR enables commits to continue while a checkpoint is happening.
Currently this is limited to commits that exclusively insert data, as
any other changes (deletes, updates, catalog changes, alters, etc) still
eagerly grab the checkpoint lock which will prevent a checkpoint from
starting while these changes are pending, and vice versa. It is also
limited to inserting data into tables **that do not have indexes**. As
part of this PR, appending to a table that has indexes now grabs the
checkpoint lock again.

Enabling commits while checkpointing has two consequences for the system
that need to be dealt with:

* Checkpointing no longer checkpoints the latest commit. 
* While checkpointing, new commits that happen need to be written
somewhere in order for them to be durable. We can no longer write them
to the old WAL as we want to truncate it after our checkpoint is
finished.

### Pinned Checkpoint Commit

Previously checkpointing code assumed we were always checkpointing the
latest commit. This is no longer correct since what is the "latest
committed data" might now change *while a checkpoint is running*.
Instead, what we need to do is choose a commit id on which we will
checkpoint. When starting a checkpoint we get the latest commit id and
checkpoint based on that commit. Subsequent commits are not written as
part of the checkpoint, but can then be written as part of a future
checkpoint.

In order to simplify this - we ensure that after starting a checkpoint
any new data that is written is always written to *new row groups*. This
is managed in `DataTable::AppendLock`. Due to this, when performing the
checkpoint, we only need to know "do we need to checkpoint this row
group or not", rather than having to checkpoint a part of a row group.
This is handled in the new method `RowGroup::ShouldCheckpointRowGroup`.

#### Free Blocks

Another challenge with the pinned checkpoint commit is how to manage the
list of free blocks - i.e. blocks that are present in the file but are
not used. Block usage is tracked globally in the
`SingleFileBlockManager`. With optimistic writes, we can write to blocks
in the storage layer (i.e. make them no longer free blocks). However, if
a checkpoint happens at a pinned commit, any optimistic writes that
happen after the commit is pinned do not belong to that checkpoint. If
we don't write these blocks in the free block list, we might get
dangling blocks in case an abort or rollback happens. However, the
blocks are not actually free in-memory, as they are being used by the
optimistically written data.

In order to solve this issue we introduce a new set in the block manager
- `newly_used_blocks`. This tracks blocks that are in-use, but are not
yet part of a given checkpoint.

### Checkpoint WAL

New commits that happen while checkpointing have to be written
somewhere. In order to still allow for the checkpoint to truncate the
WAL to prevent it from growing indefinitely, we introduce the concept of
a **checkpoint WAL**. This is a secondary WAL that can be written to
only by concurrent commits while a checkpoint is happening.

When a checkpoint is started, the checkpoint flag is written to the
original WAL. The checkpoint flag contains the root metadata block
pointer that will be written **when the checkpoint is successful**.


```
main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2]
```

The checkpoint flag allows us to, during recovery, figure out if a
checkpoint was completed or if the checkpoint was not completed. This
determines if we need to replay the WAL. This is already done in the
current version to deal with a crash between flipping the root block
pointer and truncating the WAL, however, in the new version this happens
before **any** data is written instead of only happening at the end.

After this is written, we set any new commits to write to the checkpoint
WAL. For example, assume a new commit comes in that inserts some data.
We will now have the following situation:

```
main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2]

main.db.checkpoint.wal
[INSERT duckdb#2][COMMIT]
```

After the checkpoint is finished, we have flushed all changes in
`main.db.wal` to the main database file, while the changes in
`main.db.checkpoint.wal` have not been flushed. All we need to do is
move over the checkpoint WAL and have it replace the original WAL. This
will lead us to the following final result after the checkpoint:

```
main.db.wal
[INSERT duckdb#2][COMMIT]
```


#### Recovery

In order to provide ACID compliance all commits that have succeeded must
be persisted even across failures. That means that any commits that are
written to the checkpoint WAL need to be persisted no matter where we
crash. Below is a list of failure modes:

###### Crash Before Checkpoint Complete

Our situation is like this:

```
main.db
[ROOT #1]

main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2]

main.db.checkpoint.wal
[INSERT duckdb#2][COMMIT]
```

In order to recover in this situation, we need to replay both
`main.db.wal` and `main.db.checkpoint.wal`. The recovering process sees
that the checkpoint root does not match the root in the database, and
now also checks for the presence of a checkpoint WAL. It then replays
them in order (`main.db.wal` -> `main.db.checkpoint.wal`).

If this is a `READ_WRITE` connection it merges the two WALs **except for
the checkpoint node** by writing a new WAL that contains the content of
both WALs:

```
main.db.recovery.wal
[INSERT #1][COMMIT][INSERT duckdb#2][COMMIT]
```

After that completes, it overwrites the main WAL with the recovery WAL.
Finally, it removes the checkpoint WAL.

```
mv main.db.recovery.wal main.db.wal
rm main.db.checkpoint.wal
```

###### Crash During Recovery

If we crash during the above recovery process (after mv, before rm) we
would have this situation:

```
main.db
[ROOT #1]

main.db.wal
[INSERT #1][COMMIT][INSERT duckdb#2][COMMIT]

main.db.checkpoint.wal
[INSERT duckdb#2][COMMIT]
```

This is safe to recover from because `main.db.wal` does not contain a
`CHECKPOINT` node. As such, we will not replay the checkpoint WAL, and
only `main.db.wal` will be replayed.

###### Crash After Checkpoint Complete, Before WAL Move


Our situation is like this:

```
main.db
[ROOT duckdb#2]

main.db.wal
[INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2]

main.db.checkpoint.wal
[INSERT duckdb#2][COMMIT]
```

In order to recover in this situation, we need to replay only
`main.db.checkpoint.wal`. The recovering process sees that the
checkpoint root matches the root in the database, so it knows it does
not need to replay `main.db.wal`. It checks for the presence of the
checkpoint WAL. It is present - and replays it.

If this is a `READ_WRITE` connection it then completes the checkpoint by
finalizing the move (i.e. `mv main.db.checkpoint.wal main.db.wal`).


### Other Changes / Fixes


#### Windows: make `FileSystem::MoveFile` behave like Linux/MacOS

On Linux/MacOS, `MoveFile` is used to mean "move and override the target
file". On Windows, this would previously fail if the target exists
already. This PR makes Windows behave like Linux/MacOS by using
`MOVEFILE_REPLACE_EXISTING` in `MoveFileExW`. In addition, because we
tend to use `MoveFile` to mean "we want to be certain this file was
moved", we also enable the `MOVEFILE_WRITE_THROUGH` flag.


#### SQLLogicTest 

While testing this PR, I realized the sqllogictest runner was swallowing
exceptions thrown in certain locations and incorrectly reporting tests
that should fail as succeeded. This PR fixes that and now makes these
exceptions fail the test run. This revealed a bunch of failing tests, in
particular around the config runners `peg_parser.json` and
`encryption.json`, and a few tests in `httpfs`. A few tests were fixed,
but others were skipped in the config pending looking at them in the
future.
lnkuiper added a commit that referenced this pull request Dec 11, 2025
…CHAR in `shell_renderer.cpp` (duckdb#20096)

Hi DuckDB Team,

When I used Linux to build the main branch of DuckDB with my
`adbc_scanner` extension I often encountered this link error in CI. This
PR resolves it by not using `LogicalType::VARCHAR` instead using
`LogicalType(LogicalTypeId::VARCHAR)`. The code that is changed is in
`shell_renderer.cpp` so not anywhere in my extension code. With this
change the compilation succeeds.

Build error:

```
[2/3] Linking CXX executable duckdb
FAILED: duckdb 
: && /usr/bin/c++ -g -g -O0 -DDEBUG -Wall    -fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -Wunused -Werror=vla -Wnarrowing -pedantic  tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/highlighting.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/history.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/linenoise.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/linenoise-c.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/rendering.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/terminal.cpp.o extension/CMakeFiles/duckdb_generated_extension_loader.dir/__/codegen/src/generated_extension_loader.cpp.o tools/shell/CMakeFiles/shell.dir/shell_command_line_option.cpp.o tools/shell/CMakeFiles/shell.dir/shell_extension.cpp.o tools/shell/CMakeFiles/shell.dir/shell.cpp.o tools/shell/CMakeFiles/shell.dir/shell_helpers.cpp.o tools/shell/CMakeFiles/shell.dir/shell_metadata_command.cpp.o tools/shell/CMakeFiles/shell.dir/shell_prompt.cpp.o tools/shell/CMakeFiles/shell.dir/shell_renderer.cpp.o tools/shell/CMakeFiles/shell.dir/shell_highlight.cpp.o tools/shell/CMakeFiles/shell.dir/shell_progress_bar.cpp.o tools/shell/CMakeFiles/shell.dir/shell_render_table_metadata.cpp.o tools/shell/CMakeFiles/shell.dir/shell_windows.cpp.o -o duckdb  src/libduckdb_static.a  extension/adbc_scanner/libadbc_scanner_extension.a  extension/core_functions/libcore_functions_extension.a  extension/parquet/libparquet_extension.a  extension/jemalloc/libjemalloc_extension.a  third_party/utf8proc/libduckdb_utf8proc.a  vcpkg_installed/x64-linux/debug/lib/libadbc_driver_manager.a  vcpkg_installed/x64-linux/debug/lib/libtomlplusplus.a  vcpkg_installed/x64-linux/debug/lib/libnanoarrow_static.a  src/libduckdb_static.a  -ldl && :
/usr/bin/ld: src/libduckdb_static.a(ub_duckdb_common.cpp.o):(.rodata+0x4460): multiple definition of `duckdb::LogicalType::VARCHAR'; tools/shell/CMakeFiles/shell.dir/shell_renderer.cpp.o:(.rodata._ZN6duckdb11LogicalType7VARCHARE[_ZN6duckdb11LogicalType7VARCHARE]+0x0): first defined here
collect2: error: ld returned 1 exit status
```

Distro information: Amazon Linux 2023 on AWS

Linux ip-172-16-3-91.ec2.internal 6.1.158-178.288.amzn2023.x86_64 #1 SMP
PREEMPT_DYNAMIC Mon Nov 3 18:38:36 UTC 2025 x86_64 x86_64 x86_64
GNU/Linux

```
[ec2-user@ip-172-16-3-91 duckdb]$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-amazon-linux/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-amazon-linux
Configured with: ../configure --enable-bootstrap --enable-host-pie --enable-host-bind-now --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/amazonlinux/amazon-linux-2022 --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-11.5.0-20240719/obj-x86_64-amazon-linux/isl-install --enable-multilib --with-linker-hash-style=gnu --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_64=x86-64-v2 --with-arch_32=x86-64 --build=x86_64-amazon-linux --with-build-config=bootstrap-lto --enable-link-serialization=1
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC) 
```

I only knew how to resolve it from past patterns where I've seen the
same problem in other extensions.

You can see a full build failure here:


https://github.com/Query-farm/adbc_scanner/actions/runs/20048370767/job/57498881373

Thanks,

Rusty
lnkuiper added a commit that referenced this pull request Dec 15, 2025
This PR improves the `CommonSubplanOptimizer`, and should put it in
"maintenance mode", at least for now, and I think I'm done improving it
for a while.

## Multiple Nested Matching

This PR allows multiple subplans to be matched, rather than just a
single subplan, and allows nesting. Take the following query, for
example:

```sql
explain
select distinct range from range(10)
union all
select distinct range from range(10)
union all
select range % 2 as range from (select distinct range from range(10)) group by range
union all
select range % 2 as range from (select distinct range from range(10)) group by range
union all
select count(*) from (select range % 2 as range from (select distinct range from range(10)) group by range)
union all
select count(*) from (select range % 2 as range from (select distinct range from range(10)) group by range);
```

This query unions 6 queries together. Each query occurs twice, and is
nested in the next query. With the optimizer disabled, this yields the
following plan:
```
┌───────────────────────────┐
│           UNION           ├──────────────┬────────────────────────────┬────────────────────────────┬────────────────────────────┬────────────────────────────┐
└─────────────┬─────────────┘              │                            │                            │                            │                            │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│       HASH_GROUP_BY       ││       HASH_GROUP_BY       ││         PROJECTION        ││         PROJECTION        ││    UNGROUPED_AGGREGATE    ││    UNGROUPED_AGGREGATE    │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│         Groups: #0        ││         Groups: #0        ││           range           ││           range           ││        Aggregates:        ││        Aggregates:        │
│                           ││                           ││                           ││                           ││        count_star()       ││        count_star()       │
│          ~10 rows         ││          ~10 rows         ││          ~6 rows          ││          ~6 rows          ││                           ││                           │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        ││       HASH_GROUP_BY       ││       HASH_GROUP_BY       ││         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│           range           ││           range           ││         Groups: #0        ││         Groups: #0        ││             42            ││             42            │
│                           ││                           ││                           ││                           ││                           ││                           │
│          ~10 rows         ││          ~10 rows         ││          ~6 rows          ││          ~6 rows          ││          ~6 rows          ││          ~6 rows          │
└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│           RANGE           ││           RANGE           ││         PROJECTION        ││         PROJECTION        ││       HASH_GROUP_BY       ││       HASH_GROUP_BY       │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│      Function: RANGE      ││      Function: RANGE      ││           range           ││           range           ││         Groups: #0        ││         Groups: #0        │
│                           ││                           ││                           ││                           ││                           ││                           │
│          ~10 rows         ││          ~10 rows         ││          ~10 rows         ││          ~10 rows         ││          ~6 rows          ││          ~6 rows          │
└───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │       HASH_GROUP_BY       ││       HASH_GROUP_BY       ││         PROJECTION        ││         PROJECTION        │
                                                          │    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
                                                          │         Groups: #0        ││         Groups: #0        ││           range           ││           range           │
                                                          │                           ││                           ││                           ││                           │
                                                          │          ~10 rows         ││          ~10 rows         ││          ~10 rows         ││          ~10 rows         │
                                                          └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │         PROJECTION        ││         PROJECTION        ││       HASH_GROUP_BY       ││       HASH_GROUP_BY       │
                                                          │    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
                                                          │           range           ││           range           ││         Groups: #0        ││         Groups: #0        │
                                                          │                           ││                           ││                           ││                           │
                                                          │          ~10 rows         ││          ~10 rows         ││          ~10 rows         ││          ~10 rows         │
                                                          └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │           RANGE           ││           RANGE           ││         PROJECTION        ││         PROJECTION        │
                                                          │    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
                                                          │      Function: RANGE      ││      Function: RANGE      ││           range           ││           range           │
                                                          │                           ││                           ││                           ││                           │
                                                          │          ~10 rows         ││          ~10 rows         ││          ~10 rows         ││          ~10 rows         │
                                                          └───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
                                                                                                                    ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                                                                                    │           RANGE           ││           RANGE           │
                                                                                                                    │    ────────────────────   ││    ────────────────────   │
                                                                                                                    │      Function: RANGE      ││      Function: RANGE      │
                                                                                                                    │                           ││                           │
                                                                                                                    │          ~10 rows         ││          ~10 rows         │
                                                                                                                    └───────────────────────────┘└───────────────────────────┘
```
As we can see, there is a lot of redundance here. With this PR (and the
optimizer enabled, of course), we get the following plan:
```
┌───────────────────────┐
│          CTE          │
│    ────────────────   │
│       CTE Name:       │
│   __common_subplan_1  │
│                       ├────────────┐
│    Table Index: 79    │            │
│                       │            │
│        ~0 rows        │            │
└───────────┬───────────┘            │
┌───────────┴───────────┐┌───────────┴───────────┐
│     HASH_GROUP_BY     ││          CTE          │
│    ────────────────   ││    ────────────────   │
│       Groups: #0      ││       CTE Name:       │
│                       ││   __common_subplan_2  │
│                       ││                       ├────────────┐
│                       ││    Table Index: 86    │            │
│                       ││                       │            │
│        ~10 rows       ││        ~0 rows        │            │
└───────────┬───────────┘└───────────┬───────────┘            │
┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
│       PROJECTION      ││     HASH_GROUP_BY     ││          CTE          │
│    ────────────────   ││    ────────────────   ││    ────────────────   │
│         range         ││       Groups: #0      ││       CTE Name:       │
│                       ││                       ││   __common_subplan_3  │
│                       ││                       ││                       ├────────────┐
│                       ││                       ││    Table Index: 91    │            │
│                       ││                       ││                       │            │
│        ~10 rows       ││        ~6 rows        ││        ~0 rows        │            │
└───────────┬───────────┘└───────────┬───────────┘└───────────┬───────────┘            │
┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
│         RANGE         ││       PROJECTION      ││  UNGROUPED_AGGREGATE  ││         UNION         │
│    ────────────────   ││    ────────────────   ││    ────────────────   ││                       │
│    Function: RANGE    ││         range         ││      Aggregates:      ││                       ├────────────┬────────────────────────┬────────────────────────┬────────────────────────┬────────────────────────┐
│                       ││                       ││      count_star()     ││                       │            │                        │                        │                        │                        │
│        ~10 rows       ││        ~10 rows       ││                       ││                       │            │                        │                        │                        │                        │
└───────────────────────┘└───────────┬───────────┘└───────────┬───────────┘└───────────┬───────────┘            │                        │                        │                        │                        │
                         ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐
                         │        CTE_SCAN       ││       PROJECTION      ││        CTE_SCAN       ││        CTE_SCAN       ││       PROJECTION      ││       PROJECTION      ││        CTE_SCAN       ││        CTE_SCAN       │
                         │    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   ││    ────────────────   │
                         │     CTE Index: 79     ││           42          ││     CTE Index: 79     ││     CTE Index: 79     ││         range         ││         range         ││     CTE Index: 91     ││     CTE Index: 91     │
                         │                       ││                       ││                       ││                       ││                       ││                       ││                       ││                       │
                         │        ~10 rows       ││        ~6 rows        ││        ~10 rows       ││        ~10 rows       ││        ~6 rows        ││        ~6 rows        ││         ~1 row        ││         ~1 row        │
                         └───────────────────────┘└───────────┬───────────┘└───────────────────────┘└───────────────────────┘└───────────┬───────────┘└───────────┬───────────┘└───────────────────────┘└───────────────────────┘
                                                  ┌───────────┴───────────┐                                                  ┌───────────┴───────────┐┌───────────┴───────────┐
                                                  │        CTE_SCAN       │                                                  │        CTE_SCAN       ││        CTE_SCAN       │
                                                  │    ────────────────   │                                                  │    ────────────────   ││    ────────────────   │
                                                  │     CTE Index: 86     │                                                  │     CTE Index: 86     ││     CTE Index: 86     │
                                                  │                       │                                                  │                       ││                       │
                                                  │        ~6 rows        │                                                  │        ~6 rows        ││        ~6 rows        │
                                                  └───────────────────────┘                                                  └───────────────────────┘└───────────────────────┘
```
Which has 3 CTEs, 8 CTE scans, and only 3 aggregations (instead of the
original 12!).

## Fuzzy Plan Matching

Something that can show up in query plans is an "almost" exact subplan
match. Before this PR, an exact match was required. With this PR, we can
do "fuzzy" matching, where the plan is mostly the same, save for some
selected columns. If we have the following query, for example:

```sql
-- Create build table
create table build as
select range a, range * 2 b, range * 3 c, range * 4 d
from (select range::utinyint as range from range(11))
where range % 5 = 0;
-- Create probe table
create table probe as
select range e, range * 2 f, range * 3 g, range * 4 h
from (select range::utinyint as range from range(11));
-- View that joins the two
create view my_view as
from probe join build on (build.a = probe.e);
-- Select greatest of all columns, unioned with greatest of just two columns
explain
select greatest(a, b, c, d, e, f, g, h) from my_view
union all
select greatest(b, f) from my_view;
```
Here, one of the unioned queries selects all columns, and the other
selects just two columns. As we can see, the join (coming from the view)
in the second query is "contained" in the join in first query, as the
first query selects all columns that the second query needs.

We currently get the following plan (the optimizer doesn't trigger):
```
┌───────────────────────────┐
│           UNION           ├───────────────────────────────────────────┐
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         PROJECTION        │                             │         PROJECTION        │
│    ────────────────────   │                             │    ────────────────────   │
│ greatest(a, b, c, d, e, f,│                             │       greatest(b, f)      │
│            g, h)          │                             │                           │
│                           │                             │                           │
│          ~3 rows          │                             │          ~3 rows          │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         PROJECTION        │                             │         HASH_JOIN         │
│    ────────────────────   │                             │    ────────────────────   │
│             e             │                             │      Join Type: INNER     │
│             f             │                             │     Conditions: e = a     │
│             g             │                             │                           │
│             h             │                             │                           │
│             a             │                             │                           ├──────────────┐
│             b             │                             │                           │              │
│             c             │                             │                           │              │
│             d             │                             │                           │              │
│                           │                             │                           │              │
│          ~3 rows          │                             │          ~3 rows          │              │
└─────────────┬─────────────┘                             └─────────────┬─────────────┘              │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   │                             │    ────────────────────   ││    ────────────────────   │
│      Join Type: INNER     │                             │        Table: probe       ││        Table: build       │
│     Conditions: e = a     │                             │   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           │                             │                           ││                           │
│                           ├──────────────┐              │        Projections:       ││        Projections:       │
│                           │              │              │             e             ││             a             │
│                           │              │              │             f             ││             b             │
│                           │              │              │                           ││                           │
│          ~3 rows          │              │              │          ~11 rows         ││          ~3 rows          │
└─────────────┬─────────────┘              │              └───────────────────────────┘└───────────────────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          │
│    ────────────────────   ││    ────────────────────   │
│        Table: probe       ││        Table: build       │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│                           ││                           │
│        Projections:       ││        Projections:       │
│             e             ││             a             │
│             f             ││             b             │
│             g             ││             c             │
│             h             ││             d             │
│                           ││                           │
│          ~11 rows         ││          ~3 rows          │
└───────────────────────────┘└───────────────────────────┘
```

With the improvements to the optimizer in this PR, we now get this plan:
```
┌───────────────────────────┐
│            CTE            │
│    ────────────────────   │
│         CTE Name:         │
│     __common_subplan_1    │
│                           ├───────────────────────────────────────────┐
│      Table Index: 29      │                                           │
│                           │                                           │
│          ~0 rows          │                                           │
└─────────────┬─────────────┘                                           │
┌─────────────┴─────────────┐                             ┌─────────────┴─────────────┐
│         HASH_JOIN         │                             │           UNION           │
│    ────────────────────   │                             │                           │
│      Join Type: INNER     │                             │                           │
│     Conditions: e = a     ├──────────────┐              │                           ├──────────────┐
│                           │              │              │                           │              │
│          ~3 rows          │              │              │                           │              │
└─────────────┬─────────────┘              │              └─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││         SEQ_SCAN          ││         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   ││    ────────────────────   ││    ────────────────────   │
│        Table: probe       ││        Table: build       ││ greatest(a, b, c, d, e, f,││       greatest(b, f)      │
│   Type: Sequential Scan   ││   Type: Sequential Scan   ││            g, h)          ││                           │
│                           ││                           ││                           ││                           │
│        Projections:       ││        Projections:       ││                           ││                           │
│             e             ││             a             ││                           ││                           │
│             f             ││             b             ││                           ││                           │
│             g             ││             c             ││                           ││                           │
│             h             ││             d             ││                           ││                           │
│                           ││                           ││                           ││                           │
│          ~11 rows         ││          ~3 rows          ││          ~3 rows          ││          ~3 rows          │
└───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │         PROJECTION        ││         PROJECTION        │
                                                          │    ────────────────────   ││    ────────────────────   │
                                                          │             e             ││             #1            │
                                                          │             f             ││             duckdb#4            │
                                                          │             g             ││                           │
                                                          │             h             ││                           │
                                                          │             a             ││                           │
                                                          │             b             ││                           │
                                                          │             c             ││                           │
                                                          │             d             ││                           │
                                                          │                           ││                           │
                                                          │          ~3 rows          ││          ~0 rows          │
                                                          └─────────────┬─────────────┘└─────────────┬─────────────┘
                                                          ┌─────────────┴─────────────┐┌─────────────┴─────────────┐
                                                          │          CTE_SCAN         ││          CTE_SCAN         │
                                                          │    ────────────────────   ││    ────────────────────   │
                                                          │       CTE Index: 29       ││       CTE Index: 29       │
                                                          │                           ││                           │
                                                          │          ~3 rows          ││          ~3 rows          │
                                                          └───────────────────────────┘└───────────────────────────┘
```
As we can see, the join is materialized as a CTE, and scanned twice.
Columns that aren't needed are projected out after the CTE scan.

## Benchmark Improvements

With the changes, we now trigger more subplan elimination on TPC-DS and
TPC-H. Here are some results that I collected on my laptop.

TPC-DS SF100 improvements:
Q61: 0.61s -> 0.33s (~1.8x)
Q70: 0.81s -> 0.55s (~1.5x)

TPC-H SF100 improvements:
Q11: 0.16s -> 0.12s (~1.3x)

## Notes

I needed a lot of indirection to get the column bindings to match with
the fuzzy plan matching, which required a lot of `unordered_map`s. To
avoid doing so many allocations, I've also implemented
`arena_unordered_map`, so that the number of allocations grow
logarithmically. I'm not sure how much this helped, but it's just
something I wanted to do as we are trying to reduce allocations.
Overall, this optimizer now takes ~10% of total optimization time.
lnkuiper pushed a commit that referenced this pull request Dec 29, 2025
…0283)

Fix for: duckdblabs/duckdb-internal#6809 ,
duckdb#20086

I would like someone to take a look at this before I run CI, to see if
the fix makes sense.

In ConstantOrNullFunction, there is a bug where if the first loop
iteration is a FLAT_VECTOR, the result validity mask is created as a
reference to the validity mask of args.data[idx]. If the subsequent
iteration is the default branch (say, a DICTIONARY_VECTOR), and we call
result_mask.SetInvalid(i), this is now overwriting the validity mask of
the first input column where the reference was created.

I believe the fix for this is to call EnsureWritable in the FLAT_VECTOR
case, to make sure the validity mask is not a reference to the input's
validity mask before we call

```cpp
result_mask.Combine(input_mask, args.size()) 
```
(which is where the alias is actually created). 

The reproducer hits this case -- a specific scenario of unique index +
update + no checkpointing was leading to the this scenario.

For reference, here is the query plan of the last query in the
reproducer, where the bug was occuring. The t1.c0 column is being passed
as a FLAT_VECTOR to constantOrNullFunction, and the t0.c1 column is
being passed in as a dictionary vector. Since the argument at index 1 in
ConstantOrNullFunction is the c0 column in the output, we were
overwriting NULLs into the ouput since the filter was overwriting the
validity mask in ConstantOrNullFunction:

```
┌───────┬───────┬───────┐
│  c0   │  c0   │  c1   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│  NULL │     1 │  NULL │
│  NULL │    -1 │  NULL │
└───────┴───────┴───────┘
```

Whereas it should be: 

```
┌───────┬───────┬───────┐
│  c0   │  c0   │  c1   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     0 │     1 │  NULL │
│  NULL │    -1 │  NULL │
└───────┴───────┴───────┘
```

```┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 9.18s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             c0            │
│             c0            │
│             c1            │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             duckdb#3            │
│             duckdb#7            │
│            duckdb#11            │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  (constant_or_null(false, │
│      c0, c1) IS NULL)     │
│                           │
│           2 rows          │
│          (1.82s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            NULL           │
│             duckdb#6            │
│            NULL           │
│             duckdb#5            │
│            NULL           │
│             duckdb#4            │
│            NULL           │
│             duckdb#3            │
│            NULL           │
│             duckdb#2            │
│            NULL           │
│             #1            │
│            NULL           │
│             #0            │
│            NULL           │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            NULL           │
│             duckdb#2            │
│            NULL           │
│             #1            │
│            NULL           │
│             #0            │
│            NULL           │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      POSITIONAL_SCAN      │
│    ────────────────────   │
│           2 rows          ├──────────────┐
│          (7.30s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t1         ││         Table: t0         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│      Projections: c0      ││                           │
│                           ││        Projections:       │
│                           ││             c1            │
│                           ││             c0            │
│                           ││                           │
│           0 rows          ││           0 rows          │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants