Skip to content

HSE course work: feature/in filter operator#79648

Open
afigor2701 wants to merge 7 commits intoClickHouse:masterfrom
afigor2701:feature/in_filter_operator
Open

HSE course work: feature/in filter operator#79648
afigor2701 wants to merge 7 commits intoClickHouse:masterfrom
afigor2701:feature/in_filter_operator

Conversation

@afigor2701
Copy link
Copy Markdown

@afigor2701 afigor2701 commented Apr 27, 2025

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Adding new IN_BLOOM_FILTER and IN_CUCKOO_FILTER operators

This pr closes 'Probabilistic data structures for filtering' issue from #71175

Documentation entry for user-facing changes

The idea of this task is to provide a probabilistic alternative for the IN (subquery) operator using bloom filter, counting bloom filter (to check for elements likely appeared multiple times), cuckoo filter, quotient filter, vacuum filter, and to compare all these algorithms.

The applications are cohort analysis and antifraud.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 27, 2025

CLA assistant check
All committers have signed the CLA.

@afigor2701 afigor2701 changed the title Feature/in filter operator HSE course work: feature/in filter operator Apr 27, 2025
@Femistoclus
Copy link
Copy Markdown

Great! Good work! 👏🏼👏🏼👏🏼

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Apr 27, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Apr 27, 2025

Workflow [PR], commit [6919904]

@clickhouse-gh clickhouse-gh bot added the pr-feature Pull request with new product feature label Apr 27, 2025
@hanfei1991
Copy link
Copy Markdown
Member

Could you please write some tests and documents for the new function?

@shankar-iyer shankar-iyer self-assigned this Apr 28, 2025
@shankar-iyer
Copy link
Copy Markdown
Member

@afigor2701 Please check the comment from hanfei1991 and check the indentation/style of Set.h

Task [Style check] failed.
  Task [cpp] failed.
    cpp:
        ./src/Interpreters/Set.h:  SetFilter(
        ./src/Interpreters/Set.h:  void initSetVariant(const ColumnRawPtrs & key_columns) override;
        ./src/Interpreters/Set.h:  double targetFPR;

@afigor2701 afigor2701 force-pushed the feature/in_filter_operator branch from 2b5445a to ad818b8 Compare May 14, 2025 21:42
@shankar-iyer
Copy link
Copy Markdown
Member

Can you please add documentation for the new functions inBloomFilter , inCuckooFilter, inVacuumFilter ? For reference, you can check how functions are documented under : ClickHouse/docs/en/sql-reference/functions. The new functions can go into in-functions.md

Testcase also needed - please check examples in ClickHouse/tests/queries/0_stateless

@shankar-iyer
Copy link
Copy Markdown
Member

@afigor2701 Just checking, are you currently working on this? I took the PR for a spin and couple of quick issues - 1) There is a crash if the query is processed by multiple threads (No crash if max_threads = 1 ) 2) For bloom filter, I am getting too many positives (e.g only 10 rows should match if normal IN operator, but with IN_BLOOM_FILTER, I am getting 52 matches). Are there parameters to size the bloom filter?

@afigor2701 afigor2701 closed this Jun 3, 2025
@afigor2701 afigor2701 reopened this Jun 3, 2025
@afigor2701
Copy link
Copy Markdown
Author

@afigor2701 Just checking, are you currently working on this? I took the PR for a spin and couple of quick issues - 1) There is a crash if the query is processed by multiple threads (No crash if max_threads = 1 ) 2) For bloom filter, I am getting too many positives (e.g only 10 rows should match if normal IN operator, but with IN_BLOOM_FILTER, I am getting 52 matches). Are there parameters to size the bloom filter?

Yes, I am going to continue working with this issue. Thank you for comments, I will try to fix this issues

@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 8, 2025

Dear @shankar-iyer, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@shankar-iyer shankar-iyer self-assigned this Jul 8, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Aug 12, 2025

Dear @shankar-iyer, this PR hasn't been updated for a while. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants