Fix underspecified ROW_NUMBER() OVER(...) clause in test by kryonix · Pull Request #1 · lnkuiper/duckdb

kryonix · 2025-08-11T06:48:14Z

There was an underspecified ROW_NUMBER() window function, which could result in correct, but non-deterministic query results.

I have seen this crashing due to invalid pointer on which a destructor is called, on last night `main` (`2ed9bf887f`) using unittester compiled from sources (clang 17) and extensions installed from default extension repository. Basically: ``` DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/17/lib/darwin/libclang_rt.asan_osx_dynamic.dylib LOCAL_EXTENSION_REPO=http://extensions.duckdb.org ./build/release/test/unittest --autoloading all --skip-compiled --order rand test/parquet/test_parquet_schema.test ``` and seeing runtime sanitizer assertions such as ``` ==56046==ERROR: AddressSanitizer: container-overflow on address 0x6100000d4dcf at pc 0x000116c7f450 bp 0x00016fc1d170 sp 0x00016fc1d168 READ of size 1 at 0x6100000d4dcf thread T0 #0 0x000116c7f44c in std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>* std::__1::__uninitialized_allocator_copy_impl[abi:ne190102]<std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+0x318 (parquet.duckdb_extension:arm64+0xab44c) #1 0x000116c7ec90 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__construct_at_end<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, unsigned long)+0x160 (parquet.duckdb_extension:arm64+0xaac90) duckdb#2 0x000116c7e7d8 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__assign_with_size[abi:ne190102]<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, long)+0x1e0 (parquet.duckdb_extension:arm64+0xaa7d8) duckdb#3 0x000116e8cd48 in duckdb::ParquetMultiFileInfo::BindReader(duckdb::ClientContext&, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileBindData&)+0xf18 (parquet.duckdb_extension:arm64+0x2b8d48) duckdb#4 0x000116e6e5fc in duckdb::MultiFileFunction<duckdb::ParquetMultiFileInfo>::MultiFileBindInternal(duckdb::ClientContext&, duckdb::unique_ptr<duckdb::MultiFileReader, std::__1::default_delete<duckdb::MultiFileReader>, true>, duckdb::shared_ptr<duckdb::MultiFileList, true>, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileOptions, duckdb::unique_ptr<duckdb::BaseFileReaderOptions, std::__1::default_delete<duckdb::BaseFileReaderOptions>, true>, duckdb::unique_ptr<duckdb::MultiFileReaderInterface, std::__1::default_delete<duckdb::MultiFileReaderInterface>, true>)+0x1210 (parquet.duckdb_extension:arm64+0x29a5fc) ``` or these failures while using ducklake ``` ==56079==ERROR: AddressSanitizer: container-overflow on address 0x616000091a78 at pc 0x0001323fc81c bp 0x00016bd0e890 sp 0x00016bd0e888 READ of size 8 at 0x616000091a78 thread T2049 #0 0x0001323fc818 in duckdb::MultiFileColumnDefinition::~MultiFileColumnDefinition()+0x258 (parquet.duckdb_extension:arm64+0x2a4818) #1 0x0001323fb588 in std::__1::vector<duckdb::MultiFileColumnDefinition, std::__1::allocator<duckdb::MultiFileColumnDefinition>>::__destroy_vector::operator()[abi:ne190102]()+0x98 (parquet.duckdb_extension:arm64+0x2a3588) duckdb#2 0x0001324a09e4 in duckdb::BaseFileReader::~BaseFileReader()+0x2bc (parquet.duckdb_extension:arm64+0x3489e4) duckdb#3 0x0001324a23ec in duckdb::ParquetReader::~ParquetReader()+0x22c (parquet.duckdb_extension:arm64+0x34a3ec) ```

I have seen this crashing due to invalid pointer on which a destructor is called, on last night `main` (`2ed9bf887f`) using unittester compiled from sources (clang 17) and extensions installed from default extension repository. Basically: ``` DYLD_INSERT_LIBRARIES=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/17/lib/darwin/libclang_rt.asan_osx_dynamic.dylib LOCAL_EXTENSION_REPO=http://extensions.duckdb.org ./build/release/test/unittest --autoloading all --skip-compiled --order rand test/parquet/test_parquet_schema.test ``` and seeing runtime sanitizer assertions such as ``` ==56046==ERROR: AddressSanitizer: container-overflow on address 0x6100000d4dcf at pc 0x000116c7f450 bp 0x00016fc1d170 sp 0x00016fc1d168 READ of size 1 at 0x6100000d4dcf thread T0 #0 0x000116c7f44c in std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>* std::__1::__uninitialized_allocator_copy_impl[abi:ne190102]<std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*)+0x318 (parquet.duckdb_extension:arm64+0xab44c) #1 0x000116c7ec90 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__construct_at_end<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, unsigned long)+0x160 (parquet.duckdb_extension:arm64+0xaac90) duckdb#2 0x000116c7e7d8 in void std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__assign_with_size[abi:ne190102]<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*, long)+0x1e0 (parquet.duckdb_extension:arm64+0xaa7d8) duckdb#3 0x000116e8cd48 in duckdb::ParquetMultiFileInfo::BindReader(duckdb::ClientContext&, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileBindData&)+0xf18 (parquet.duckdb_extension:arm64+0x2b8d48) duckdb#4 0x000116e6e5fc in duckdb::MultiFileFunction<duckdb::ParquetMultiFileInfo>::MultiFileBindInternal(duckdb::ClientContext&, duckdb::unique_ptr<duckdb::MultiFileReader, std::__1::default_delete<duckdb::MultiFileReader>, true>, duckdb::shared_ptr<duckdb::MultiFileList, true>, duckdb::vector<duckdb::LogicalType, true>&, duckdb::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, true>&, duckdb::MultiFileOptions, duckdb::unique_ptr<duckdb::BaseFileReaderOptions, std::__1::default_delete<duckdb::BaseFileReaderOptions>, true>, duckdb::unique_ptr<duckdb::MultiFileReaderInterface, std::__1::default_delete<duckdb::MultiFileReaderInterface>, true>)+0x1210 (parquet.duckdb_extension:arm64+0x29a5fc) ``` or these failures while using ducklake ``` ==56079==ERROR: AddressSanitizer: container-overflow on address 0x616000091a78 at pc 0x0001323fc81c bp 0x00016bd0e890 sp 0x00016bd0e888 READ of size 8 at 0x616000091a78 thread T2049 #0 0x0001323fc818 in duckdb::MultiFileColumnDefinition::~MultiFileColumnDefinition()+0x258 (parquet.duckdb_extension:arm64+0x2a4818) #1 0x0001323fb588 in std::__1::vector<duckdb::MultiFileColumnDefinition, std::__1::allocator<duckdb::MultiFileColumnDefinition>>::__destroy_vector::operator()[abi:ne190102]()+0x98 (parquet.duckdb_extension:arm64+0x2a3588) duckdb#2 0x0001324a09e4 in duckdb::BaseFileReader::~BaseFileReader()+0x2bc (parquet.duckdb_extension:arm64+0x3489e4) duckdb#3 0x0001324a23ec in duckdb::ParquetReader::~ParquetReader()+0x22c (parquet.duckdb_extension:arm64+0x34a3ec) ``` With these changes, once having the `parquet` extension build by CI, this works as expected. I am not sure if the fix could / should be elsewhere.

…groups, and limiting vacuum operations for the last number of row groups (duckdb#18829) When performing a checkpoint, we rewrite columns of row groups that have had modifications. When performing insertions, all columns are modified, thus all columns of a row group must be rewritten. As a result, any insertion followed by a checkpoint triggers a full rewrite of the last row group. When dealing with wide tables or string columns with large strings, rewriting a row group can be relatively expensive. This is particularly the case when only small insertions have been made. For example, inserting a single row followed by checkpointing can become expensive as a result of triggering a rewrite of the last row group: ```sql create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i); CHECKPOINT; .timer on insert into z values ('a','b','c'); Run Time (s): real 0.006 user 0.004249 sys 0.001186 CHECKPOINT; Run Time (s): real 0.611 user 1.163009 sys 0.054814 ``` The checkpoint takes 0.6s, despite us inserting only a single row. #### Appending a new row group This PR solves this problem by appending a new row group when writing to a table that has been persisted/checkpointed. As a result, we will no longer modify the (already checkpointed) data. As a trade-off, we end up with more (smaller) row groups. #### Vacuuming As part of vacuuming, we merge adjacent row groups if they fit within the row group size. As part of this PR - we restrict merging for the last few row groups. Otherwise, we would end up with the same problem. We would create row groups like so: ``` 100K rows 1 row ``` Then merge them back into a single row group with size `100K+1`, effectively performing the same rewrite as before. Instead, for the last row groups in a file, we need a *minimum threshold* to merge. This is either the `row group size` itself, or 2X the size of the first row group we are considering. We also always merge row groups if the total size is `< 2048` (the vector size). This exponential merging ensures we don't merge on every single insert, while also ensuring we never have too many row groups. We can see this process in action when starting with 100K rows, and inserting 100 rows at a time - effectively repeating the below script: ```sql insert into z select 'a','b','c' from range(100); checkpoint; select row_group_id, max(count) from pragma_storage_info('z') group by all order by all; ``` ```sql -- insert batch #1 ┌──────────────┬────────────┐ │ row_group_id │ max(count) │ │ int64 │ int64 │ ├──────────────┼────────────┤ │ 0 │ 100000 │ │ 1 │ 100 │ └──────────────┴────────────┘ -- insert batch duckdb#30 ┌──────────────┬────────────┐ │ row_group_id │ max(count) │ │ int64 │ int64 │ ├──────────────┼────────────┤ │ 0 │ 100000 │ │ 1 │ 2000 │ │ 2 │ 1000 │ └──────────────┴────────────┘ -- insert batch duckdb#150 ┌──────────────┬────────────┐ │ row_group_id │ max(count) │ │ int64 │ int64 │ ├──────────────┼────────────┤ │ 0 │ 100000 │ │ 1 │ 8000 │ │ 2 │ 4000 │ │ 3 │ 2000 │ │ 4 │ 1000 │ └──────────────┴────────────┘ -- insert batch duckdb#228 ┌──────────────┬────────────┐ │ row_group_id │ max(count) │ │ int64 │ int64 │ ├──────────────┼────────────┤ │ 0 │ 100000 │ │ 1 │ 16000 │ │ 2 │ 4000 │ │ 3 │ 2000 │ │ 4 │ 800 │ └──────────────┴────────────┘ -- insert batch duckdb#229 ┌──────────────┬────────────┐ │ row_group_id │ max(count) │ │ int64 │ int64 │ ├──────────────┼────────────┤ │ 0 │ 122800 │ │ 1 │ 100 │ └──────────────┴────────────┘ ``` ### Performance Running the above example again, the checkpoint now completes in `0.005` seconds, instead of `0.6` seconds. Note that we still do the compaction at some point, so while most checkpoints will have gotten faster, not all of them will have gotten faster. Running the above example 300X, here are the timings: ```sql create table z as select repeat(sha256(i::varchar), 10) as a, repeat(sha256((i**2)::varchar), 10) as b, repeat(sha256((i**3)::varchar), 10) as c from range(100000) r(i); -- 300X the below insertion insert into z select 'a','b','c' from range(100); CHECKPOINT; ``` | Summary | v1.3.2 | New | |-----------------------------|---------|--------| | Total Time | 128.82s | 1.00s | | Average Time Per Checkpoint | 0.56s | 0.01s | | Max Checkpoint Time | 0.631s | 0.526s | In the above example, there is still a single checkpoint that does the full on compaction, and thus has the same timing as the original script had for every checkpoint.

…uckdb#19680) (duckdb#19811) Fixes duckdb#19680 This fixes a bug where queries using `NOT EXISTS` with `IS DISTINCT FROM` returned incorrect results due to improper handling of NULL semantics in the optimizer. The issue was that the optimizer's deliminator incorrectly treated `DISTINCT FROM` variants the same as regular equality/inequality comparisons, which have different NULL handling: - `IS DISTINCT FROM`: NULL-aware (NULL IS DISTINCT FROM NULL = FALSE) - != or =: NULL-unaware (NULL != NULL = NULL, filters out NULLs) ### Incorrect Query Plan ``` ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ c2 │ │ │ │ ~0 rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ duckdb#5 │ │__internal_decompress_integ│ │ ral_integer(duckdb#3, 1) │ │ #1 │ │ │ │ ~0 rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ NESTED_LOOP_JOIN │ │ ──────────────────── │ │ Join Type: ANTI │ │ Conditions: c2 != c2 ├──────────────┐ │ │ │ │ ~0 rows │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── │ │ NULL ││ NULL │ │ duckdb#2 ││ duckdb#2 │ │ NULL ││ NULL │ │ #1 ││ #1 │ │ NULL ││ NULL │ │ #0 ││ #0 │ │ NULL ││ NULL │ │ ││ │ │ ~2 rows ││ ~1 row │ └─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── │ │ #0 ││ #0 │ │__internal_compress_integra││__internal_compress_integra│ │ l_utinyint(#1, 1) ││ l_utinyint(#1, 1) │ │ duckdb#2 ││ duckdb#2 │ │ ││ │ │ ~2 rows ││ ~1 row │ └─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── │ │ NULL ││ NULL │ │ #0 ││ #0 │ │ NULL ││ NULL │ │ ││ │ │ ~2 rows ││ ~1 row │ └─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ SEQ_SCAN ││ FILTER │ │ ──────────────────── ││ ──────────────────── │ │ Table: t0 ││ (col0 IS NOT NULL) │ │ Type: Sequential Scan ││ │ │ Projections: c2 ││ │ │ ││ │ │ ~2 rows ││ ~1 row │ └───────────────────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ SEQ_SCAN │ │ ──────────────────── │ │ Table: t0 │ │ Type: Sequential Scan │ │ Projections: c2 │ │ │ │ ~2 rows │ └───────────────────────────┘ ``` The buggy plan shows two critical issues: ``` ┌─────────────┴─────────────┐ │ NESTED_LOOP_JOIN │ │ Join Type: ANTI │ │ Conditions: c2 != c2 │ ← ❌ Wrong(the join conditions should be c2 IS DISTINCT FROM c2) │ ~0 rows │ └─────────────┬─────────────┘ │ └─────────────┐ ┌┴─────────────┐ │ FILTER │ │ (col0 IS NOT │ ← ❌ Wrong(the filter should be removed) │ NULL) │ └──────────────┘ ``` ### Solution This PR adds proper support for DISTINCT FROM operators throughout the optimization pipeline: 1. Preserve DISTINCT FROM semantics in join conversion.(src/optimizer/deliminator.cpp) ``` // NOTE: We should NOT convert DISTINCT FROM to != in general // Only convert if the ORIGINAL join had != or = (not DISTINCT FROM variants) if (delim_join.join_type != JoinType::MARK && original_join_comparison != ExpressionType::COMPARE_DISTINCT_FROM && original_join_comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM) { // Safe to convert } ``` 2. Skip NULL filters for DISTINCT FROM variants.(src/optimizer/deliminator.cpp) ``` // Only add IS NOT NULL filter for regular equality/inequality comparisons // Do NOT add for DISTINCT FROM variants, as they handle NULL correctly if (cond.comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM && cond.comparison != ExpressionType::COMPARE_DISTINCT_FROM) { // Add IS NOT NULL filter } ``` 3. Added negation support for COMPARE_DISTINCT_FROM and COMPARE_NOT_DISTINCT_FROM in expression type handling.(src/common/enums/expression_type.cpp) 4. Updated parser to properly negate IS DISTINCT FROM expressions when wrapped with NOT. (src/parser/transform/expression/transform_bool_expr.cpp) 5. Added regression test in test/sql/subquery/exists/test_correlated_exists_with_derived_table.test

@Mytherin

) We found this issue when using the python client (because of the `.show() method propagating a LIMIT), that large limit optimizations where getting in the way of filter pushdowns. The idea is to push the filter before applying the limit whenever there is a filter. The idea is from @Mytherin, I just added a test. **Original Optimized logical plan:** ```text ┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Optimized Logical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ Expressions: a │ │ │ │ ~0 rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ Expressions: a │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ Expressions: │ │__internal_decompress_integ│ │ ral_bigint(#0, 0) │ │ #1 │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ ORDER_BY │ │ ──────────────────── │ │ rowid │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ Expressions: │ │__internal_compress_integra│ │ l_uinteger(#0, 0) │ │ #1 │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ COMPARISON_JOIN │ │ ──────────────────── │ │ Join Type: SEMI │ │ ├──────────────┐ │ Conditions: │ │ │ (rowid = rowid) │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ SEQ_SCAN ││ LIMIT │ │ ──────────────────── ││ ──────────────────── │ │ Table: t ││ │ │ Type: Sequential Scan ││ │ └───────────────────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ SEQ_SCAN │ │ ──────────────────── │ │ Filters: a<50 │ │ Table: t │ │ Type: Sequential Scan │ │ │ │ ~400,000 rows │ └───────────────────────────┘ ``` Logical plan after this PR: ```text ┌─────────────────────────────┐ │┌───────────────────────────┐│ ││ Optimized Logical Plan ││ │└───────────────────────────┘│ └─────────────────────────────┘ ┌───────────────────────────┐ │ PROJECTION │ │ ──────────────────── │ │ Expressions: a │ │ │ │ ~0 rows │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ LIMIT │ │ ──────────────────── │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ SEQ_SCAN │ │ ──────────────────── │ │ Filters: a<50 │ │ Table: t │ │ Type: Sequential Scan │ │ │ │ ~400,000 rows │ └───────────────────────────┘ ```

…kdb#20052) Follow-up from duckdb#19937 This PR enables commits to continue while a checkpoint is happening. Currently this is limited to commits that exclusively insert data, as any other changes (deletes, updates, catalog changes, alters, etc) still eagerly grab the checkpoint lock which will prevent a checkpoint from starting while these changes are pending, and vice versa. It is also limited to inserting data into tables **that do not have indexes**. As part of this PR, appending to a table that has indexes now grabs the checkpoint lock again. Enabling commits while checkpointing has two consequences for the system that need to be dealt with: * Checkpointing no longer checkpoints the latest commit. * While checkpointing, new commits that happen need to be written somewhere in order for them to be durable. We can no longer write them to the old WAL as we want to truncate it after our checkpoint is finished. ### Pinned Checkpoint Commit Previously checkpointing code assumed we were always checkpointing the latest commit. This is no longer correct since what is the "latest committed data" might now change *while a checkpoint is running*. Instead, what we need to do is choose a commit id on which we will checkpoint. When starting a checkpoint we get the latest commit id and checkpoint based on that commit. Subsequent commits are not written as part of the checkpoint, but can then be written as part of a future checkpoint. In order to simplify this - we ensure that after starting a checkpoint any new data that is written is always written to *new row groups*. This is managed in `DataTable::AppendLock`. Due to this, when performing the checkpoint, we only need to know "do we need to checkpoint this row group or not", rather than having to checkpoint a part of a row group. This is handled in the new method `RowGroup::ShouldCheckpointRowGroup`. #### Free Blocks Another challenge with the pinned checkpoint commit is how to manage the list of free blocks - i.e. blocks that are present in the file but are not used. Block usage is tracked globally in the `SingleFileBlockManager`. With optimistic writes, we can write to blocks in the storage layer (i.e. make them no longer free blocks). However, if a checkpoint happens at a pinned commit, any optimistic writes that happen after the commit is pinned do not belong to that checkpoint. If we don't write these blocks in the free block list, we might get dangling blocks in case an abort or rollback happens. However, the blocks are not actually free in-memory, as they are being used by the optimistically written data. In order to solve this issue we introduce a new set in the block manager - `newly_used_blocks`. This tracks blocks that are in-use, but are not yet part of a given checkpoint. ### Checkpoint WAL New commits that happen while checkpointing have to be written somewhere. In order to still allow for the checkpoint to truncate the WAL to prevent it from growing indefinitely, we introduce the concept of a **checkpoint WAL**. This is a secondary WAL that can be written to only by concurrent commits while a checkpoint is happening. When a checkpoint is started, the checkpoint flag is written to the original WAL. The checkpoint flag contains the root metadata block pointer that will be written **when the checkpoint is successful**. ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2] ``` The checkpoint flag allows us to, during recovery, figure out if a checkpoint was completed or if the checkpoint was not completed. This determines if we need to replay the WAL. This is already done in the current version to deal with a crash between flipping the root block pointer and truncating the WAL, however, in the new version this happens before **any** data is written instead of only happening at the end. After this is written, we set any new commits to write to the checkpoint WAL. For example, assume a new commit comes in that inserts some data. We will now have the following situation: ``` main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2] main.db.checkpoint.wal [INSERT duckdb#2][COMMIT] ``` After the checkpoint is finished, we have flushed all changes in `main.db.wal` to the main database file, while the changes in `main.db.checkpoint.wal` have not been flushed. All we need to do is move over the checkpoint WAL and have it replace the original WAL. This will lead us to the following final result after the checkpoint: ``` main.db.wal [INSERT duckdb#2][COMMIT] ``` #### Recovery In order to provide ACID compliance all commits that have succeeded must be persisted even across failures. That means that any commits that are written to the checkpoint WAL need to be persisted no matter where we crash. Below is a list of failure modes: ###### Crash Before Checkpoint Complete Our situation is like this: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2] main.db.checkpoint.wal [INSERT duckdb#2][COMMIT] ``` In order to recover in this situation, we need to replay both `main.db.wal` and `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root does not match the root in the database, and now also checks for the presence of a checkpoint WAL. It then replays them in order (`main.db.wal` -> `main.db.checkpoint.wal`). If this is a `READ_WRITE` connection it merges the two WALs **except for the checkpoint node** by writing a new WAL that contains the content of both WALs: ``` main.db.recovery.wal [INSERT #1][COMMIT][INSERT duckdb#2][COMMIT] ``` After that completes, it overwrites the main WAL with the recovery WAL. Finally, it removes the checkpoint WAL. ``` mv main.db.recovery.wal main.db.wal rm main.db.checkpoint.wal ``` ###### Crash During Recovery If we crash during the above recovery process (after mv, before rm) we would have this situation: ``` main.db [ROOT #1] main.db.wal [INSERT #1][COMMIT][INSERT duckdb#2][COMMIT] main.db.checkpoint.wal [INSERT duckdb#2][COMMIT] ``` This is safe to recover from because `main.db.wal` does not contain a `CHECKPOINT` node. As such, we will not replay the checkpoint WAL, and only `main.db.wal` will be replayed. ###### Crash After Checkpoint Complete, Before WAL Move Our situation is like this: ``` main.db [ROOT duckdb#2] main.db.wal [INSERT #1][COMMIT][CHECKPOINT: NEW_ROOT: duckdb#2] main.db.checkpoint.wal [INSERT duckdb#2][COMMIT] ``` In order to recover in this situation, we need to replay only `main.db.checkpoint.wal`. The recovering process sees that the checkpoint root matches the root in the database, so it knows it does not need to replay `main.db.wal`. It checks for the presence of the checkpoint WAL. It is present - and replays it. If this is a `READ_WRITE` connection it then completes the checkpoint by finalizing the move (i.e. `mv main.db.checkpoint.wal main.db.wal`). ### Other Changes / Fixes #### Windows: make `FileSystem::MoveFile` behave like Linux/MacOS On Linux/MacOS, `MoveFile` is used to mean "move and override the target file". On Windows, this would previously fail if the target exists already. This PR makes Windows behave like Linux/MacOS by using `MOVEFILE_REPLACE_EXISTING` in `MoveFileExW`. In addition, because we tend to use `MoveFile` to mean "we want to be certain this file was moved", we also enable the `MOVEFILE_WRITE_THROUGH` flag. #### SQLLogicTest While testing this PR, I realized the sqllogictest runner was swallowing exceptions thrown in certain locations and incorrectly reporting tests that should fail as succeeded. This PR fixes that and now makes these exceptions fail the test run. This revealed a bunch of failing tests, in particular around the config runners `peg_parser.json` and `encryption.json`, and a few tests in `httpfs`. A few tests were fixed, but others were skipped in the config pending looking at them in the future.

…CHAR in `shell_renderer.cpp` (duckdb#20096) Hi DuckDB Team, When I used Linux to build the main branch of DuckDB with my `adbc_scanner` extension I often encountered this link error in CI. This PR resolves it by not using `LogicalType::VARCHAR` instead using `LogicalType(LogicalTypeId::VARCHAR)`. The code that is changed is in `shell_renderer.cpp` so not anywhere in my extension code. With this change the compilation succeeds. Build error: ``` [2/3] Linking CXX executable duckdb FAILED: duckdb : && /usr/bin/c++ -g -g -O0 -DDEBUG -Wall -fsanitize=address -fsanitize=undefined -fno-sanitize-recover=all -Wunused -Werror=vla -Wnarrowing -pedantic tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/highlighting.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/history.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/linenoise.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/linenoise-c.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/rendering.cpp.o tools/shell/linenoise/CMakeFiles/duckdb_linenoise.dir/terminal.cpp.o extension/CMakeFiles/duckdb_generated_extension_loader.dir/__/codegen/src/generated_extension_loader.cpp.o tools/shell/CMakeFiles/shell.dir/shell_command_line_option.cpp.o tools/shell/CMakeFiles/shell.dir/shell_extension.cpp.o tools/shell/CMakeFiles/shell.dir/shell.cpp.o tools/shell/CMakeFiles/shell.dir/shell_helpers.cpp.o tools/shell/CMakeFiles/shell.dir/shell_metadata_command.cpp.o tools/shell/CMakeFiles/shell.dir/shell_prompt.cpp.o tools/shell/CMakeFiles/shell.dir/shell_renderer.cpp.o tools/shell/CMakeFiles/shell.dir/shell_highlight.cpp.o tools/shell/CMakeFiles/shell.dir/shell_progress_bar.cpp.o tools/shell/CMakeFiles/shell.dir/shell_render_table_metadata.cpp.o tools/shell/CMakeFiles/shell.dir/shell_windows.cpp.o -o duckdb src/libduckdb_static.a extension/adbc_scanner/libadbc_scanner_extension.a extension/core_functions/libcore_functions_extension.a extension/parquet/libparquet_extension.a extension/jemalloc/libjemalloc_extension.a third_party/utf8proc/libduckdb_utf8proc.a vcpkg_installed/x64-linux/debug/lib/libadbc_driver_manager.a vcpkg_installed/x64-linux/debug/lib/libtomlplusplus.a vcpkg_installed/x64-linux/debug/lib/libnanoarrow_static.a src/libduckdb_static.a -ldl && : /usr/bin/ld: src/libduckdb_static.a(ub_duckdb_common.cpp.o):(.rodata+0x4460): multiple definition of `duckdb::LogicalType::VARCHAR'; tools/shell/CMakeFiles/shell.dir/shell_renderer.cpp.o:(.rodata._ZN6duckdb11LogicalType7VARCHARE[_ZN6duckdb11LogicalType7VARCHARE]+0x0): first defined here collect2: error: ld returned 1 exit status ``` Distro information: Amazon Linux 2023 on AWS Linux ip-172-16-3-91.ec2.internal 6.1.158-178.288.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 3 18:38:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux ``` [ec2-user@ip-172-16-3-91 duckdb]$ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-amazon-linux/11/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-amazon-linux Configured with: ../configure --enable-bootstrap --enable-host-pie --enable-host-bind-now --enable-languages=c,c++,fortran,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/amazonlinux/amazon-linux-2022 --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --enable-plugin --enable-initfini-array --with-isl=/builddir/build/BUILD/gcc-11.5.0-20240719/obj-x86_64-amazon-linux/isl-install --enable-multilib --with-linker-hash-style=gnu --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_64=x86-64-v2 --with-arch_32=x86-64 --build=x86_64-amazon-linux --with-build-config=bootstrap-lto --enable-link-serialization=1 Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 11.5.0 20240719 (Red Hat 11.5.0-5) (GCC) ``` I only knew how to resolve it from past patterns where I've seen the same problem in other extensions. You can see a full build failure here: https://github.com/Query-farm/adbc_scanner/actions/runs/20048370767/job/57498881373 Thanks, Rusty

This PR improves the `CommonSubplanOptimizer`, and should put it in "maintenance mode", at least for now, and I think I'm done improving it for a while. ## Multiple Nested Matching This PR allows multiple subplans to be matched, rather than just a single subplan, and allows nesting. Take the following query, for example: ```sql explain select distinct range from range(10) union all select distinct range from range(10) union all select range % 2 as range from (select distinct range from range(10)) group by range union all select range % 2 as range from (select distinct range from range(10)) group by range union all select count(*) from (select range % 2 as range from (select distinct range from range(10)) group by range) union all select count(*) from (select range % 2 as range from (select distinct range from range(10)) group by range); ``` This query unions 6 queries together. Each query occurs twice, and is nested in the next query. With the optimizer disabled, this yields the following plan: ``` ┌───────────────────────────┐ │ UNION ├──────────────┬────────────────────────────┬────────────────────────────┬────────────────────────────┬────────────────────────────┐ └─────────────┬─────────────┘ │ │ │ │ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ HASH_GROUP_BY ││ HASH_GROUP_BY ││ PROJECTION ││ PROJECTION ││ UNGROUPED_AGGREGATE ││ UNGROUPED_AGGREGATE │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ Groups: #0 ││ Groups: #0 ││ range ││ range ││ Aggregates: ││ Aggregates: │ │ ││ ││ ││ ││ count_star() ││ count_star() │ │ ~10 rows ││ ~10 rows ││ ~6 rows ││ ~6 rows ││ ││ │ └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION ││ HASH_GROUP_BY ││ HASH_GROUP_BY ││ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ range ││ range ││ Groups: #0 ││ Groups: #0 ││ 42 ││ 42 │ │ ││ ││ ││ ││ ││ │ │ ~10 rows ││ ~10 rows ││ ~6 rows ││ ~6 rows ││ ~6 rows ││ ~6 rows │ └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ RANGE ││ RANGE ││ PROJECTION ││ PROJECTION ││ HASH_GROUP_BY ││ HASH_GROUP_BY │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ Function: RANGE ││ Function: RANGE ││ range ││ range ││ Groups: #0 ││ Groups: #0 │ │ ││ ││ ││ ││ ││ │ │ ~10 rows ││ ~10 rows ││ ~10 rows ││ ~10 rows ││ ~6 rows ││ ~6 rows │ └───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ HASH_GROUP_BY ││ HASH_GROUP_BY ││ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ Groups: #0 ││ Groups: #0 ││ range ││ range │ │ ││ ││ ││ │ │ ~10 rows ││ ~10 rows ││ ~10 rows ││ ~10 rows │ └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION ││ HASH_GROUP_BY ││ HASH_GROUP_BY │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ range ││ range ││ Groups: #0 ││ Groups: #0 │ │ ││ ││ ││ │ │ ~10 rows ││ ~10 rows ││ ~10 rows ││ ~10 rows │ └─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ RANGE ││ RANGE ││ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ Function: RANGE ││ Function: RANGE ││ range ││ range │ │ ││ ││ ││ │ │ ~10 rows ││ ~10 rows ││ ~10 rows ││ ~10 rows │ └───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ RANGE ││ RANGE │ │ ──────────────────── ││ ──────────────────── │ │ Function: RANGE ││ Function: RANGE │ │ ││ │ │ ~10 rows ││ ~10 rows │ └───────────────────────────┘└───────────────────────────┘ ``` As we can see, there is a lot of redundance here. With this PR (and the optimizer enabled, of course), we get the following plan: ``` ┌───────────────────────┐ │ CTE │ │ ──────────────── │ │ CTE Name: │ │ __common_subplan_1 │ │ ├────────────┐ │ Table Index: 79 │ │ │ │ │ │ ~0 rows │ │ └───────────┬───────────┘ │ ┌───────────┴───────────┐┌───────────┴───────────┐ │ HASH_GROUP_BY ││ CTE │ │ ──────────────── ││ ──────────────── │ │ Groups: #0 ││ CTE Name: │ │ ││ __common_subplan_2 │ │ ││ ├────────────┐ │ ││ Table Index: 86 │ │ │ ││ │ │ │ ~10 rows ││ ~0 rows │ │ └───────────┬───────────┘└───────────┬───────────┘ │ ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐ │ PROJECTION ││ HASH_GROUP_BY ││ CTE │ │ ──────────────── ││ ──────────────── ││ ──────────────── │ │ range ││ Groups: #0 ││ CTE Name: │ │ ││ ││ __common_subplan_3 │ │ ││ ││ ├────────────┐ │ ││ ││ Table Index: 91 │ │ │ ││ ││ │ │ │ ~10 rows ││ ~6 rows ││ ~0 rows │ │ └───────────┬───────────┘└───────────┬───────────┘└───────────┬───────────┘ │ ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐ │ RANGE ││ PROJECTION ││ UNGROUPED_AGGREGATE ││ UNION │ │ ──────────────── ││ ──────────────── ││ ──────────────── ││ │ │ Function: RANGE ││ range ││ Aggregates: ││ ├────────────┬────────────────────────┬────────────────────────┬────────────────────────┬────────────────────────┐ │ ││ ││ count_star() ││ │ │ │ │ │ │ │ ~10 rows ││ ~10 rows ││ ││ │ │ │ │ │ │ └───────────────────────┘└───────────┬───────────┘└───────────┬───────────┘└───────────┬───────────┘ │ │ │ │ │ ┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐┌───────────┴───────────┐ │ CTE_SCAN ││ PROJECTION ││ CTE_SCAN ││ CTE_SCAN ││ PROJECTION ││ PROJECTION ││ CTE_SCAN ││ CTE_SCAN │ │ ──────────────── ││ ──────────────── ││ ──────────────── ││ ──────────────── ││ ──────────────── ││ ──────────────── ││ ──────────────── ││ ──────────────── │ │ CTE Index: 79 ││ 42 ││ CTE Index: 79 ││ CTE Index: 79 ││ range ││ range ││ CTE Index: 91 ││ CTE Index: 91 │ │ ││ ││ ││ ││ ││ ││ ││ │ │ ~10 rows ││ ~6 rows ││ ~10 rows ││ ~10 rows ││ ~6 rows ││ ~6 rows ││ ~1 row ││ ~1 row │ └───────────────────────┘└───────────┬───────────┘└───────────────────────┘└───────────────────────┘└───────────┬───────────┘└───────────┬───────────┘└───────────────────────┘└───────────────────────┘ ┌───────────┴───────────┐ ┌───────────┴───────────┐┌───────────┴───────────┐ │ CTE_SCAN │ │ CTE_SCAN ││ CTE_SCAN │ │ ──────────────── │ │ ──────────────── ││ ──────────────── │ │ CTE Index: 86 │ │ CTE Index: 86 ││ CTE Index: 86 │ │ │ │ ││ │ │ ~6 rows │ │ ~6 rows ││ ~6 rows │ └───────────────────────┘ └───────────────────────┘└───────────────────────┘ ``` Which has 3 CTEs, 8 CTE scans, and only 3 aggregations (instead of the original 12!). ## Fuzzy Plan Matching Something that can show up in query plans is an "almost" exact subplan match. Before this PR, an exact match was required. With this PR, we can do "fuzzy" matching, where the plan is mostly the same, save for some selected columns. If we have the following query, for example: ```sql -- Create build table create table build as select range a, range * 2 b, range * 3 c, range * 4 d from (select range::utinyint as range from range(11)) where range % 5 = 0; -- Create probe table create table probe as select range e, range * 2 f, range * 3 g, range * 4 h from (select range::utinyint as range from range(11)); -- View that joins the two create view my_view as from probe join build on (build.a = probe.e); -- Select greatest of all columns, unioned with greatest of just two columns explain select greatest(a, b, c, d, e, f, g, h) from my_view union all select greatest(b, f) from my_view; ``` Here, one of the unioned queries selects all columns, and the other selects just two columns. As we can see, the join (coming from the view) in the second query is "contained" in the join in first query, as the first query selects all columns that the second query needs. We currently get the following plan (the optimizer doesn't trigger): ``` ┌───────────────────────────┐ │ UNION ├───────────────────────────────────────────┐ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ PROJECTION │ │ PROJECTION │ │ ──────────────────── │ │ ──────────────────── │ │ greatest(a, b, c, d, e, f,│ │ greatest(b, f) │ │ g, h) │ │ │ │ │ │ │ │ ~3 rows │ │ ~3 rows │ └─────────────┬─────────────┘ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ PROJECTION │ │ HASH_JOIN │ │ ──────────────────── │ │ ──────────────────── │ │ e │ │ Join Type: INNER │ │ f │ │ Conditions: e = a │ │ g │ │ │ │ h │ │ │ │ a │ │ ├──────────────┐ │ b │ │ │ │ │ c │ │ │ │ │ d │ │ │ │ │ │ │ │ │ │ ~3 rows │ │ ~3 rows │ │ └─────────────┬─────────────┘ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ HASH_JOIN │ │ SEQ_SCAN ││ SEQ_SCAN │ │ ──────────────────── │ │ ──────────────────── ││ ──────────────────── │ │ Join Type: INNER │ │ Table: probe ││ Table: build │ │ Conditions: e = a │ │ Type: Sequential Scan ││ Type: Sequential Scan │ │ │ │ ││ │ │ ├──────────────┐ │ Projections: ││ Projections: │ │ │ │ │ e ││ a │ │ │ │ │ f ││ b │ │ │ │ │ ││ │ │ ~3 rows │ │ │ ~11 rows ││ ~3 rows │ └─────────────┬─────────────┘ │ └───────────────────────────┘└───────────────────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ SEQ_SCAN ││ SEQ_SCAN │ │ ──────────────────── ││ ──────────────────── │ │ Table: probe ││ Table: build │ │ Type: Sequential Scan ││ Type: Sequential Scan │ │ ││ │ │ Projections: ││ Projections: │ │ e ││ a │ │ f ││ b │ │ g ││ c │ │ h ││ d │ │ ││ │ │ ~11 rows ││ ~3 rows │ └───────────────────────────┘└───────────────────────────┘ ``` With the improvements to the optimizer in this PR, we now get this plan: ``` ┌───────────────────────────┐ │ CTE │ │ ──────────────────── │ │ CTE Name: │ │ __common_subplan_1 │ │ ├───────────────────────────────────────────┐ │ Table Index: 29 │ │ │ │ │ │ ~0 rows │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐ ┌─────────────┴─────────────┐ │ HASH_JOIN │ │ UNION │ │ ──────────────────── │ │ │ │ Join Type: INNER │ │ │ │ Conditions: e = a ├──────────────┐ │ ├──────────────┐ │ │ │ │ │ │ │ ~3 rows │ │ │ │ │ └─────────────┬─────────────┘ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ SEQ_SCAN ││ SEQ_SCAN ││ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── ││ ──────────────────── ││ ──────────────────── │ │ Table: probe ││ Table: build ││ greatest(a, b, c, d, e, f,││ greatest(b, f) │ │ Type: Sequential Scan ││ Type: Sequential Scan ││ g, h) ││ │ │ ││ ││ ││ │ │ Projections: ││ Projections: ││ ││ │ │ e ││ a ││ ││ │ │ f ││ b ││ ││ │ │ g ││ c ││ ││ │ │ h ││ d ││ ││ │ │ ││ ││ ││ │ │ ~11 rows ││ ~3 rows ││ ~3 rows ││ ~3 rows │ └───────────────────────────┘└───────────────────────────┘└─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ PROJECTION ││ PROJECTION │ │ ──────────────────── ││ ──────────────────── │ │ e ││ #1 │ │ f ││ duckdb#4 │ │ g ││ │ │ h ││ │ │ a ││ │ │ b ││ │ │ c ││ │ │ d ││ │ │ ││ │ │ ~3 rows ││ ~0 rows │ └─────────────┬─────────────┘└─────────────┬─────────────┘ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ CTE_SCAN ││ CTE_SCAN │ │ ──────────────────── ││ ──────────────────── │ │ CTE Index: 29 ││ CTE Index: 29 │ │ ││ │ │ ~3 rows ││ ~3 rows │ └───────────────────────────┘└───────────────────────────┘ ``` As we can see, the join is materialized as a CTE, and scanned twice. Columns that aren't needed are projected out after the CTE scan. ## Benchmark Improvements With the changes, we now trigger more subplan elimination on TPC-DS and TPC-H. Here are some results that I collected on my laptop. TPC-DS SF100 improvements: Q61: 0.61s -> 0.33s (~1.8x) Q70: 0.81s -> 0.55s (~1.5x) TPC-H SF100 improvements: Q11: 0.16s -> 0.12s (~1.3x) ## Notes I needed a lot of indirection to get the column bindings to match with the fuzzy plan matching, which required a lot of `unordered_map`s. To avoid doing so many allocations, I've also implemented `arena_unordered_map`, so that the number of allocations grow logarithmically. I'm not sure how much this helped, but it's just something I wanted to do as we are trying to reduce allocations. Overall, this optimizer now takes ~10% of total optimization time.

…0283) Fix for: duckdblabs/duckdb-internal#6809 , duckdb#20086 I would like someone to take a look at this before I run CI, to see if the fix makes sense. In ConstantOrNullFunction, there is a bug where if the first loop iteration is a FLAT_VECTOR, the result validity mask is created as a reference to the validity mask of args.data[idx]. If the subsequent iteration is the default branch (say, a DICTIONARY_VECTOR), and we call result_mask.SetInvalid(i), this is now overwriting the validity mask of the first input column where the reference was created. I believe the fix for this is to call EnsureWritable in the FLAT_VECTOR case, to make sure the validity mask is not a reference to the input's validity mask before we call ```cpp result_mask.Combine(input_mask, args.size()) ``` (which is where the alias is actually created). The reproducer hits this case -- a specific scenario of unique index + update + no checkpointing was leading to the this scenario. For reference, here is the query plan of the last query in the reproducer, where the bug was occuring. The t1.c0 column is being passed as a FLAT_VECTOR to constantOrNullFunction, and the t0.c1 column is being passed in as a dictionary vector. Since the argument at index 1 in ConstantOrNullFunction is the c0 column in the output, we were overwriting NULLs into the ouput since the filter was overwriting the validity mask in ConstantOrNullFunction: ``` ┌───────┬───────┬───────┐ │ c0 │ c0 │ c1 │ │ int32 │ int32 │ int32 │ ├───────┼───────┼───────┤ │ NULL │ 1 │ NULL │ │ NULL │ -1 │ NULL │ └───────┴───────┴───────┘ ``` Whereas it should be: ``` ┌───────┬───────┬───────┐ │ c0 │ c0 │ c1 │ │ int32 │ int32 │ int32 │ ├───────┼───────┼───────┤ │ 0 │ 1 │ NULL │ │ NULL │ -1 │ NULL │ └───────┴───────┴───────┘ ``` ```┌────────────────────────────────────────────────┐ │┌──────────────────────────────────────────────┐│ ││ Total Time: 9.18s ││ │└──────────────────────────────────────────────┘│ └────────────────────────────────────────────────┘ ┌───────────────────────────┐ │ QUERY │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ EXPLAIN_ANALYZE │ │ ──────────────────── │ │ 0 rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ c0 │ │ c0 │ │ c1 │ │ │ │ 2 rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ duckdb#3 │ │ duckdb#7 │ │ duckdb#11 │ │ │ │ 2 rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ FILTER │ │ ──────────────────── │ │ (constant_or_null(false, │ │ c0, c1) IS NULL) │ │ │ │ 2 rows │ │ (1.82s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ NULL │ │ duckdb#6 │ │ NULL │ │ duckdb#5 │ │ NULL │ │ duckdb#4 │ │ NULL │ │ duckdb#3 │ │ NULL │ │ duckdb#2 │ │ NULL │ │ #1 │ │ NULL │ │ #0 │ │ NULL │ │ │ │ 2 rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ PROJECTION │ │ ──────────────────── │ │ NULL │ │ duckdb#2 │ │ NULL │ │ #1 │ │ NULL │ │ #0 │ │ NULL │ │ │ │ 2 rows │ │ (0.00s) │ └─────────────┬─────────────┘ ┌─────────────┴─────────────┐ │ POSITIONAL_SCAN │ │ ──────────────────── │ │ 2 rows ├──────────────┐ │ (7.30s) │ │ └─────────────┬─────────────┘ │ ┌─────────────┴─────────────┐┌─────────────┴─────────────┐ │ TABLE_SCAN ││ TABLE_SCAN │ │ ──────────────────── ││ ──────────────────── │ │ Table: t1 ││ Table: t0 │ │ Type: Sequential Scan ││ Type: Sequential Scan │ │ Projections: c0 ││ │ │ ││ Projections: │ │ ││ c1 │ │ ││ c0 │ │ ││ │ │ 0 rows ││ 0 rows │ │ (0.00s) ││ (0.00s) │ └───────────────────────────┘└───────────────────────────┘ ```

Fix underspecified ROW_NUMBER() OVER(...) clause in test

55150ac

kryonix mentioned this pull request Aug 11, 2025

Flip left/right delim join based on cardinalities duckdb/duckdb#18552

Merged

lnkuiper merged commit b65e0db into lnkuiper:16989 Aug 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix underspecified ROW_NUMBER() OVER(...) clause in test#1

Fix underspecified ROW_NUMBER() OVER(...) clause in test#1
lnkuiper merged 1 commit intolnkuiper:16989from
kryonix:fix_test_fib_complex

kryonix commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kryonix commented Aug 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants