Skip to content

Support bloom_filter usage for "has" on const array#83945

Merged
shankar-iyer merged 2 commits intoClickHouse:masterfrom
dorki:has_const_bf_pr
Jul 21, 2025
Merged

Support bloom_filter usage for "has" on const array#83945
shankar-iyer merged 2 commits intoClickHouse:masterfrom
dorki:has_const_bf_pr

Conversation

@dorki
Copy link
Copy Markdown

@dorki dorki commented Jul 17, 2025

Changelog category (leave one):

  • Performance Improvement

Changelog entry

The bloom filter index is now used for conditions like has([c1, c2, ...], column), where column is not of an Array type.
This improves performance for such queries, making them as efficient as the IN operator.

Motivation

This change extends the power of bloom filter indexes to a common query pattern.
Previously, to use a bloom filter on a scalar column, users had to write column IN (c1, c2).
Now, they can also use has([c1, c2], column) syntax and receive the same performance benefit, allowing the query to skip data granules that don't contain the relevant values.

Example use

Given a table with a bloom filter index on a non-Array column:

CREATE TABLE users (
    user_id UInt64,
    name String,
    INDEX bf_idx user_id TYPE bloom_filter
) ENGINE = MergeTree ORDER BY user_id;

The following query will now efficiently use the bf_idx index to filter granules, whereas previously it would have resulted in a full table scan:

SELECT name FROM users WHERE has([123, 456, 789], user_id);

@dorki
Copy link
Copy Markdown
Author

dorki commented Jul 17, 2025

hope I haven’t bitten off more than I can chew with this PR 😅. From what I understand, has(const, column) behaves somewhat similarly to hasAny, in the sense that it requires creating a column on the fly. I was unsure whether I should extract a common static method for creating that column, as the logic feels duplicated.

Also, I’m not fully certain whether the RPNElement should be set to FUNCTION_IN or FUNCTION_HAS_ANY, since both seem to be handled the same way later in mayBeTrueOnGranule. I’d really appreciate any guidance or corrections on this.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jul 17, 2025
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Jul 17, 2025

Workflow [PR], commit [f99cf3b]

Summary:

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Jul 17, 2025
@shankar-iyer shankar-iyer self-assigned this Jul 18, 2025
@shankar-iyer
Copy link
Copy Markdown
Member

I am reviewing the PR. For avoiding the test failures due to variations in the plan output, can you please only record the lines "Description: bloom_filter.." and "Granules:.." in the reference file. Please take a look at e.g -

@shankar-iyer
Copy link
Copy Markdown
Member

The 2 failures (Stateless tests (amd_tsan) & Stateless tests(amd_ubsan)) were not seen in the rerun.

@shankar-iyer shankar-iyer added this pull request to the merge queue Jul 21, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 21, 2025
@shankar-iyer shankar-iyer added this pull request to the merge queue Jul 21, 2025
Merged via the queue into ClickHouse:master with commit dd4bf83 Jul 21, 2025
238 of 242 checks passed
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jul 21, 2025
@azat
Copy link
Copy Markdown
Member

azat commented Jul 21, 2025

Reverted in #84142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants