Skip to content

Imdb validation and benchmark improvements#5

Closed
Tmonster wants to merge 21 commits intofeaturefrom
imdb_validation_and_benchmark_improvements
Closed

Imdb validation and benchmark improvements#5
Tmonster wants to merge 21 commits intofeaturefrom
imdb_validation_and_benchmark_improvements

Conversation

@Tmonster
Copy link
Owner

Opening here to work on CI

@Tmonster Tmonster force-pushed the imdb_validation_and_benchmark_improvements branch from b9585e6 to 55d9613 Compare November 28, 2022 14:36
@Tmonster
Copy link
Owner Author

Closing, the commit history got very messed up here

@Tmonster Tmonster closed this Nov 28, 2022
@Tmonster
Copy link
Owner Author

Closing, the commit history got very messed up here

@Tmonster Tmonster reopened this Nov 28, 2022
@Tmonster Tmonster changed the base branch from master to feature November 28, 2022 14:40
@Tmonster
Copy link
Owner Author

Reopened because I managed to clean the git history

@Tmonster Tmonster force-pushed the imdb_validation_and_benchmark_improvements branch from 55d9613 to 132cb0d Compare November 29, 2022 15:21
@Tmonster Tmonster force-pushed the imdb_validation_and_benchmark_improvements branch from fcbf465 to a968122 Compare December 5, 2022 16:51
@Tmonster
Copy link
Owner Author

Tmonster commented Dec 7, 2022

Closing because commit history is sorta messed up. fixing it in #10

@Tmonster Tmonster closed this Dec 7, 2022
Tmonster added a commit that referenced this pull request Jul 4, 2024
avoid CatalogType::TABLE_MACRO_ENTRY which can cause SIGABRT
Tmonster pushed a commit that referenced this pull request Oct 24, 2024
Tmonster pushed a commit that referenced this pull request Dec 3, 2024
I was investigating the following crash where a checkpoint task had its
underlying resources being destroyed while it was still running. The two
interesting threads are the following:

```
thread #1, name = 'duckling', stop reason = signal SIGTRAP
    frame #0: 0x0000ffff91bb71ec
    frame #1: 0x0000aaaad73a38e8 duckling`duckdb::InternalException::InternalException(this=<unavailable>, msg=<unavailable>) at exception.cpp:336:2
    frame #2: 0x0000aaaad786eb68 duckling`duckdb::unique_ptr<duckdb::RowGroup, std::default_delete<duckdb::RowGroup>, true>::operator*() const [inlined] duckdb::unique_ptr<duckdb::RowGroup, std::default_delete<duckdb::RowGroup>, true>::AssertNotNull(null=<unavailable>) at unique_ptr.hpp:25:10
    frame #3: 0x0000aaaad786eaf4 duckling`duckdb::unique_ptr<duckdb::RowGroup, std::default_delete<duckdb::RowGroup>, true>::operator*(this=0x0000aaacbb73e008) const at unique_ptr.hpp:34:4
    frame #4: 0x0000aaaad787abbc duckling`duckdb::CheckpointTask::ExecuteTask(this=0x0000aaabec92be60) at row_group_collection.cpp:732:21
    frame #5: 0x0000aaaad7756ea4 duckling`duckdb::BaseExecutorTask::Execute(this=0x0000aaabec92be60, mode=<unavailable>) at task_executor.cpp:72:3
    frame #6: 0x0000aaaad7757e28 duckling`duckdb::TaskScheduler::ExecuteForever(this=0x0000aaaafda30e10, marker=0x0000aaaafda164a8) at task_scheduler.cpp:189:32
    frame #7: 0x0000ffff91a031fc
    frame #8: 0x0000ffff91c0d5c8

thread #120, stop reason = signal 0
    frame #0: 0x0000ffff91c71c24
    frame #1: 0x0000ffff91e1264c
    frame #2: 0x0000ffff91e01888
    frame #3: 0x0000ffff91e018f8
    frame #4: 0x0000ffff91e01c10
    frame #5: 0x0000ffff91e05afc
    frame #6: 0x0000ffff91e05e70
    frame #7: 0x0000aaaad784b63c duckling`duckdb::RowGroup::~RowGroup() [inlined] std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release(this=<unavailable>) at shared_ptr_base.h:184:10
    frame #8: 0x0000aaaad784b5b4 duckling`duckdb::RowGroup::~RowGroup() [inlined] std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count(this=<unavailable>) at shared_ptr_base.h:705:11
    frame #9: 0x0000aaaad784b5ac duckling`duckdb::RowGroup::~RowGroup() [inlined] std::__shared_ptr<duckdb::ColumnData, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr(this=<unavailable>) at shared_ptr_base.h:1154:31
    frame #10: 0x0000aaaad784b5ac duckling`duckdb::RowGroup::~RowGroup() [inlined] duckdb::shared_ptr<duckdb::ColumnData, true>::~shared_ptr(this=<unavailable>) at shared_ptr_ipp.hpp:115:24
    frame #11: 0x0000aaaad784b5ac duckling`duckdb::RowGroup::~RowGroup() [inlined] void std::_Destroy<duckdb::shared_ptr<duckdb::ColumnData, true>>(__pointer=<unavailable>) at stl_construct.h:151:19
    frame #12: 0x0000aaaad784b5ac duckling`duckdb::RowGroup::~RowGroup() [inlined] void std::_Destroy_aux<false>::__destroy<duckdb::shared_ptr<duckdb::ColumnData, true>*>(__first=<unavailable>, __last=<unavailable>) at stl_construct.h:163:6
    frame #13: 0x0000aaaad784b5a0 duckling`duckdb::RowGroup::~RowGroup() [inlined] void std::_Destroy<duckdb::shared_ptr<duckdb::ColumnData, true>*>(__first=<unavailable>, __last=<unavailable>) at stl_construct.h:195:7
    frame #14: 0x0000aaaad784b5a0 duckling`duckdb::RowGroup::~RowGroup() [inlined] void std::_Destroy<duckdb::shared_ptr<duckdb::ColumnData, true>*, duckdb::shared_ptr<duckdb::ColumnData, true>>(__first=<unavailable>, __last=<unavailable>, (null)=<unavailable>) at alloc_traits.h:848:7
    frame #15: 0x0000aaaad784b5a0 duckling`duckdb::RowGroup::~RowGroup() [inlined] std::vector<duckdb::shared_ptr<duckdb::ColumnData, true>, std::allocator<duckdb::shared_ptr<duckdb::ColumnData, true>>>::~vector(this=<unavailable>) at stl_vector.h:680:2
    frame #16: 0x0000aaaad784b5a0 duckling`duckdb::RowGroup::~RowGroup(this=<unavailable>) at row_group.cpp:83:1
    frame #17: 0x0000aaaad786ee18 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] std::default_delete<duckdb::RowGroup>::operator()(this=0x0000aaacbb73e1a8, __ptr=0x0000aaab75ae7860) const at unique_ptr.h:85:2
    frame #18: 0x0000aaaad786ee10 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() at unique_ptr.h:361:4
    frame #19: 0x0000aaaad786ee08 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] duckdb::SegmentNode<duckdb::RowGroup>::~SegmentNode(this=0x0000aaacbb73e1a0) at segment_tree.hpp:21:8
    frame #20: 0x0000aaaad786ee08 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] void std::_Destroy<duckdb::SegmentNode<duckdb::RowGroup>>(__pointer=0x0000aaacbb73e1a0) at stl_construct.h:151:19
    frame #21: 0x0000aaaad786ee08 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] void std::_Destroy_aux<false>::__destroy<duckdb::SegmentNode<duckdb::RowGroup>*>(__first=0x0000aaacbb73e1a0, __last=0x0000aaacbb751130) at stl_construct.h:163:6
    frame #22: 0x0000aaaad786ede8 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] void std::_Destroy<duckdb::SegmentNode<duckdb::RowGroup>*>(__first=<unavailable>, __last=0x0000aaacbb751130) at stl_construct.h:195:7
    frame #23: 0x0000aaaad786ede8 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector() [inlined] void std::_Destroy<duckdb::SegmentNode<duckdb::RowGroup>*, duckdb::SegmentNode<duckdb::RowGroup>>(__first=<unavailable>, __last=0x0000aaacbb751130, (null)=0x0000fffefc81a908) at alloc_traits.h:848:7
    frame #24: 0x0000aaaad786ede8 duckling`std::vector<duckdb::SegmentNode<duckdb::RowGroup>, std::allocator<duckdb::SegmentNode<duckdb::RowGroup>>>::~vector(this=size=4883) at stl_vector.h:680:2
    frame #25: 0x0000aaaad7857f74 duckling`duckdb::RowGroupCollection::Checkpoint(this=<unavailable>, writer=<unavailable>, global_stats=0x0000fffefc81a9c0) at row_group_collection.cpp:1017:1
    frame #26: 0x0000aaaad788f02c duckling`duckdb::DataTable::Checkpoint(this=0x0000aaab04649e70, writer=0x0000aaab6584ea80, serializer=0x0000fffefc81ab38) at data_table.cpp:1427:14
    frame #27: 0x0000aaaad7881394 duckling`duckdb::SingleFileCheckpointWriter::WriteTable(this=0x0000fffefc81b128, table=0x0000aaab023b78c0, serializer=0x0000fffefc81ab38) at checkpoint_manager.cpp:528:11
    frame #28: 0x0000aaaad787ece4 duckling`duckdb::SingleFileCheckpointWriter::CreateCheckpoint() [inlined] duckdb::SingleFileCheckpointWriter::CreateCheckpoint(this=<unavailable>, obj=0x0000fffefc81ab38)::$_7::operator()(duckdb::Serializer::List&, unsigned long) const::'lambda'(duckdb::Serializer&)::operator()(duckdb::Serializer&) const at checkpoint_manager.cpp:181:43
    frame #29: 0x0000aaaad787ecd8 duckling`duckdb::SingleFileCheckpointWriter::CreateCheckpoint() [inlined] void duckdb::Serializer::List::WriteObject<duckdb::SingleFileCheckpointWriter::CreateCheckpoint()::$_7::operator()(duckdb::Serializer::List&, unsigned long) const::'lambda'(duckdb::Serializer&)>(this=<unavailable>, f=(unnamed class) @ 0x0000600002cbd2b0) at serializer.hpp:385:2
    frame #30: 0x0000aaaad787ecc4 duckling`duckdb::SingleFileCheckpointWriter::CreateCheckpoint() [inlined] duckdb::SingleFileCheckpointWriter::CreateCheckpoint()::$_7::operator()(this=<unavailable>, list=<unavailable>, i=2) const at checkpoint_manager.cpp:181:8
    frame #31: 0x0000aaaad787ecb0 duckling`duckdb::SingleFileCheckpointWriter::CreateCheckpoint() at serializer.hpp:151:4
    frame #32: 0x0000aaaad787ec94 duckling`duckdb::SingleFileCheckpointWriter::CreateCheckpoint(this=0x0000fffefc81b128) at checkpoint_manager.cpp:179:13
    frame #33: 0x0000aaaad78954a8 duckling`duckdb::SingleFileStorageManager::CreateCheckpoint(this=0x0000aaaafe1de140, options=(wal_action = DONT_DELETE_WAL, action = CHECKPOINT_IF_REQUIRED, type = FULL_CHECKPOINT)) at storage_manager.cpp:365:17
    frame #34: 0x0000aaaad78baac0 duckling`duckdb::DuckTransactionManager::Checkpoint(this=0x0000aaaafe167e00, context=<unavailable>, force=<unavailable>) at duck_transaction_manager.cpp:198:18
    frame #35: 0x0000aaaad69d02c0 duckling`md::Server::BackgroundCheckpointIfNeeded(this=0x0000aaaafdbfe900) at server.cpp:1983:31
    frame #36: 0x0000aaaadac5d3f0 duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] std::function<void ()>::operator()(this=<unavailable>) const at std_function.h:590:9
    frame #37: 0x0000aaaadac5d3e0 duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] md::BackgroundCronTask::Start(unsigned long)::$_0::operator()(this=0x0000aaaafdf169a8) const at background_cron_task.cpp:25:4
    frame #38: 0x0000aaaadac5d30c duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] void std::__invoke_impl<void, md::BackgroundCronTask::Start(unsigned long)::$_0>((null)=<unavailable>, __f=0x0000aaaafdf169a8) at invoke.h:61:14
    frame #39: 0x0000aaaadac5d30c duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] std::__invoke_result<md::BackgroundCronTask::Start(unsigned long)::$_0>::type std::__invoke<md::BackgroundCronTask::Start(unsigned long)::$_0>(__fn=0x0000aaaafdf169a8) at invoke.h:96:14
    frame #40: 0x0000aaaadac5d30c duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] void std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>::_M_invoke<0ul>(this=0x0000aaaafdf169a8, (null)=<unavailable>) at std_thread.h:259:13
    frame #41: 0x0000aaaadac5d30c duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run() [inlined] std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>::operator()(this=0x0000aaaafdf169a8) at std_thread.h:266:11
    frame #42: 0x0000aaaadac5d30c duckling`std::thread::_State_impl<std::thread::_Invoker<std::tuple<md::BackgroundCronTask::Start(unsigned long)::$_0>>>::_M_run(this=0x0000aaaafdf169a0) at std_thread.h:211:13
    frame #43: 0x0000ffff91a031fc
    frame #44: 0x0000ffff91c0d5c8
```

The problem is that if there's an IO exception being thrown in
`RowGroupCollection::Checkpoint` after some (but not all) checkpoint
tasks have been scheduled but before
`checkpoint_state.executor.WorkOnTasks();` is called, it results in an
InternalException / DuckDB crash as the `Checkpoint ` method does not
wait for the scheduled tasks to have completed before destroying the
referenced resources.
Tmonster pushed a commit that referenced this pull request Feb 17, 2025
We had two users crash with the following backtrace:

```
    frame #0: 0x0000ffffab2571ec
    frame #1: 0x0000aaaaac00c5fc duckling`duckdb::InternalException::InternalException(this=<unavailable>, msg=<unavailable>) at exception.cpp:328:2
    frame #2: 0x0000aaaaac1ee418 duckling`duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::CheckValid(this=<unavailable>) const at optional_ptr.hpp:34:11
    frame #3: 0x0000aaaaac1eea8c duckling`duckdb::MergeCollectionTask::Execute(duckdb::PhysicalBatchInsert const&, duckdb::ClientContext&, duckdb::GlobalSinkState&, duckdb::LocalSinkState&) [inlined] duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::operator*(this=<unavailable>) at optional_ptr.hpp:43:3
    frame #4: 0x0000aaaaac1eea84 duckling`duckdb::MergeCollectionTask::Execute(this=0x0000aaaaf1b06150, op=<unavailable>, context=0x0000aaaba820d8d0, gstate_p=0x0000aaab06880f00, lstate_p=<unavailable>) at physical_batch_insert.cpp:219:90
    frame #5: 0x0000aaaaac1d2e10 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTask(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:425:8
    frame #6: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTasks(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:431:9
    frame #7: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(this=0x0000aaaafa62ab40, context=0x0000aab2fffd7cb0, chunk=<unavailable>, input=<unavailable>) const at physical_batch_insert.cpp:494:4
    frame #8: 0x0000aaaaac353158 duckling`duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, duckdb::ExecutionBudget&, unsigned long) [inlined] duckdb::PipelineExecutor::Sink(this=0x0000aab2fffd7c00, chunk=0x0000aab2fffd7d30, input=0x0000fffec0aba8d8) at pipeline_executor.cpp:521:24
    frame #9: 0x0000aaaaac353130 duckling`duckdb::PipelineExecutor::ExecutePushInternal(this=0x0000aab2fffd7c00, input=0x0000aab2fffd7d30, chunk_budget=0x0000fffec0aba980, initial_idx=0) at pipeline_executor.cpp:332:23
    frame #10: 0x0000aaaaac34f7b4 duckling`duckdb::PipelineExecutor::Execute(this=0x0000aab2fffd7c00, max_chunks=<unavailable>) at pipeline_executor.cpp:201:13
    frame #11: 0x0000aaaaac34f258 duckling`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) [inlined] duckdb::PipelineExecutor::Execute(this=<unavailable>) at pipeline_executor.cpp:278:9
    frame #12: 0x0000aaaaac34f250 duckling`duckdb::PipelineTask::ExecuteTask(this=0x0000aab16dafd630, mode=<unavailable>) at pipeline.cpp:51:33
    frame #13: 0x0000aaaaac348298 duckling`duckdb::ExecutorTask::Execute(this=0x0000aab16dafd630, mode=<unavailable>) at executor_task.cpp:49:11
    frame #14: 0x0000aaaaac356600 duckling`duckdb::TaskScheduler::ExecuteForever(this=0x0000aaaaf0105560, marker=0x0000aaaaf00ee578) at task_scheduler.cpp:189:32
    frame #15: 0x0000ffffab0a31fc
    frame #16: 0x0000ffffab2ad5c8
```

Core dump analysis showed that the assertion `D_ASSERT(lstate.writer);`
in `MergeCollectionTask::Execute` (i.e. it is crashing because
`lstate.writer` is NULLPTR) was not satisfied when
`PhysicalBatchInsert::Sink` was processing merge tasks from (other)
pipeline executors.

My suspicion is that this is only likely to happen for heavily
concurrent workloads (applicable to the two users which crashed). The
patch submitted as part of this PR has addressed the issue for these
users.
Tmonster pushed a commit that referenced this pull request May 12, 2025
Tmonster pushed a commit that referenced this pull request Dec 3, 2025
…uckdb#19680) (duckdb#19811)

Fixes duckdb#19680

This fixes a bug where queries using `NOT EXISTS` with `IS DISTINCT
FROM` returned incorrect results due to improper handling of NULL
semantics in the optimizer.

The issue was that the optimizer's deliminator incorrectly treated
`DISTINCT FROM` variants the same as regular equality/inequality
comparisons, which have different NULL handling:
  - `IS DISTINCT FROM`: NULL-aware (NULL IS DISTINCT FROM NULL = FALSE)
  - != or =: NULL-unaware (NULL != NULL = NULL, filters out NULLs)


### Incorrect Query Plan

```
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             c2            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #5            │
│__internal_decompress_integ│
│     ral_integer(#3, 1)    │
│             #1            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: ANTI      │
│    Conditions: c2 != c2   ├──────────────┐
│                           │              │
│          ~0 rows          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             #2            ││             #2            │
│            NULL           ││            NULL           │
│             #1            ││             #1            │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│             #0            ││             #0            │
│__internal_compress_integra││__internal_compress_integra│
│     l_utinyint(#1, 1)     ││     l_utinyint(#1, 1)     │
│             #2            ││             #2            │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││           FILTER          │
│    ────────────────────   ││    ────────────────────   │
│         Table: t0         ││     (col0 IS NOT NULL)    │
│   Type: Sequential Scan   ││                           │
│      Projections: c2      ││                           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │    ────────────────────   │
                             │         Table: t0         │
                             │   Type: Sequential Scan   │
                             │      Projections: c2      │
                             │                           │
                             │          ~2 rows          │
                             └───────────────────────────┘
```

  The buggy plan shows two critical issues:
```
  ┌─────────────┴─────────────┐
  │      NESTED_LOOP_JOIN     │
  │      Join Type: ANTI      │
  │    Conditions: c2 != c2   │  ← ❌ Wrong(the join conditions should be c2 IS DISTINCT FROM c2)
  │          ~0 rows          │
  └─────────────┬─────────────┘
                │
                └─────────────┐
                             ┌┴─────────────┐
                             │   FILTER     │
                             │ (col0 IS NOT │  ← ❌ Wrong(the filter should be removed)
                             │    NULL)     │
                             └──────────────┘
```

### Solution

This PR adds proper support for DISTINCT FROM operators throughout the
optimization pipeline:

1. Preserve DISTINCT FROM semantics in join
conversion.(src/optimizer/deliminator.cpp)
```
// NOTE: We should NOT convert DISTINCT FROM to != in general
// Only convert if the ORIGINAL join had != or = (not DISTINCT FROM variants)
if (delim_join.join_type != JoinType::MARK &&
    original_join_comparison != ExpressionType::COMPARE_DISTINCT_FROM &&
    original_join_comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM) {
    // Safe to convert
}
```
2. Skip NULL filters for DISTINCT FROM
variants.(src/optimizer/deliminator.cpp)
```
// Only add IS NOT NULL filter for regular equality/inequality comparisons
// Do NOT add for DISTINCT FROM variants, as they handle NULL correctly
if (cond.comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM &&
    cond.comparison != ExpressionType::COMPARE_DISTINCT_FROM) {
    // Add IS NOT NULL filter
}
```
3. Added negation support for COMPARE_DISTINCT_FROM and
COMPARE_NOT_DISTINCT_FROM
    in expression type handling.(src/common/enums/expression_type.cpp)
4. Updated parser to properly negate IS DISTINCT FROM expressions when
wrapped with NOT.
(src/parser/transform/expression/transform_bool_expr.cpp)
5. Added regression test in
test/sql/subquery/exists/test_correlated_exists_with_derived_table.test
Tmonster pushed a commit that referenced this pull request Jan 12, 2026
…0283)

Fix for: duckdblabs/duckdb-internal#6809 ,
duckdb#20086

I would like someone to take a look at this before I run CI, to see if
the fix makes sense.

In ConstantOrNullFunction, there is a bug where if the first loop
iteration is a FLAT_VECTOR, the result validity mask is created as a
reference to the validity mask of args.data[idx]. If the subsequent
iteration is the default branch (say, a DICTIONARY_VECTOR), and we call
result_mask.SetInvalid(i), this is now overwriting the validity mask of
the first input column where the reference was created.

I believe the fix for this is to call EnsureWritable in the FLAT_VECTOR
case, to make sure the validity mask is not a reference to the input's
validity mask before we call

```cpp
result_mask.Combine(input_mask, args.size()) 
```
(which is where the alias is actually created). 

The reproducer hits this case -- a specific scenario of unique index +
update + no checkpointing was leading to the this scenario.

For reference, here is the query plan of the last query in the
reproducer, where the bug was occuring. The t1.c0 column is being passed
as a FLAT_VECTOR to constantOrNullFunction, and the t0.c1 column is
being passed in as a dictionary vector. Since the argument at index 1 in
ConstantOrNullFunction is the c0 column in the output, we were
overwriting NULLs into the ouput since the filter was overwriting the
validity mask in ConstantOrNullFunction:

```
┌───────┬───────┬───────┐
│  c0   │  c0   │  c1   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│  NULL │     1 │  NULL │
│  NULL │    -1 │  NULL │
└───────┴───────┴───────┘
```

Whereas it should be: 

```
┌───────┬───────┬───────┐
│  c0   │  c0   │  c1   │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│     0 │     1 │  NULL │
│  NULL │    -1 │  NULL │
└───────┴───────┴───────┘
```

```┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││               Total Time: 9.18s              ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             c0            │
│             c0            │
│             c1            │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #3            │
│             #7            │
│            #11            │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│  (constant_or_null(false, │
│      c0, c1) IS NULL)     │
│                           │
│           2 rows          │
│          (1.82s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            NULL           │
│             #6            │
│            NULL           │
│             #5            │
│            NULL           │
│             #4            │
│            NULL           │
│             #3            │
│            NULL           │
│             #2            │
│            NULL           │
│             #1            │
│            NULL           │
│             #0            │
│            NULL           │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            NULL           │
│             #2            │
│            NULL           │
│             #1            │
│            NULL           │
│             #0            │
│            NULL           │
│                           │
│           2 rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      POSITIONAL_SCAN      │
│    ────────────────────   │
│           2 rows          ├──────────────┐
│          (7.30s)          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         TABLE_SCAN        ││         TABLE_SCAN        │
│    ────────────────────   ││    ────────────────────   │
│         Table: t1         ││         Table: t0         │
│   Type: Sequential Scan   ││   Type: Sequential Scan   │
│      Projections: c0      ││                           │
│                           ││        Projections:       │
│                           ││             c1            │
│                           ││             c0            │
│                           ││                           │
│           0 rows          ││           0 rows          │
│          (0.00s)          ││          (0.00s)          │
└───────────────────────────┘└───────────────────────────┘
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant