Query condition cache: introduce selectivity threshold by zhongyuankai · Pull Request #86076 · ClickHouse/ClickHouse

zhongyuankai · 2025-08-23T04:26:00Z

This is a revert of #85965 which reverted #85253 because of a performance regression.

This PR improves performance by reducing lock contention.

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added setting query_condition_cache_selectivity_threshold (default value: 1.0) which excludes scan results of predicates with low selectivity from insertion into the query condition cache. This allows to reduce the memory consumption of the query condition cache at the cost of a worse cache hit rate.

clickhouse-gh · 2025-08-25T07:55:42Z

Workflow [PR], commit [557014a]

Summary: ❌

job_name	test_name	status	info	comment
Stateless tests (amd_binary, old analyzer, s3 storage, DatabaseReplicated, parallel)		failure
	01201_read_single_thread_in_order	FAIL

src/Interpreters/Cache/QueryConditionCache.cpp

rschu1ze · 2025-08-28T17:07:05Z

src/Interpreters/Cache/QueryConditionCache.cpp

+    /// Marks potentially need to be updated. We don't know for sure because another thread could have updated them in the meantime.
+    /// Acquire a write lock and attempt the update.
+    bool is_insert = false;
+    if (!need_update_marks)


I don't understand why we need to code in l. 146-172. Could you explain a bit more? Is it really needed for correctness or performance?

When the query QPS is very high, if the query condition cache already contains the complete result, there is no need to continue collecting matching marks. This can help avoid frequent acquire of the mutex and the cache_entry->mutex write lock, which improves performance.

For queries like LIMIT, execution may terminate early before all parts are fully scanned, resulting in incomplete results in the query condition cache. This piece of code can fill in the missing marks.

This reverts commit 827bd37.

rschu1ze · 2025-09-16T12:15:14Z

@zhongyuankai Sorry for the delay. I have too many other things going on. I'll try to check today.

rschu1ze · 2025-09-16T18:55:06Z

@zhongyuankai I needed to simplify the logic in QueryConditionCacheWriter::addRanges because it looked too complex and I did not understand it (in particular the new dirty bit). Please let me know if you think the rewritten logic is wrong in any way.

The PR message says:

This is a revert of #85965 which reverted #85253 because of a performance regression.

This time I wanted to be more careful with regards to performance regressions.

I ran ClickBench on a current master (on my 48 vCPUs system):

[0.019, 0.001, 0.001],
[0.102, 0.006, 0.006],
[0.045, 0.012, 0.012],
[0.719, 0.015, 0.015],
[1.600, 0.354, 0.337],
[1.785, 0.301, 0.257],
[0.014, 0.007, 0.007],
[0.022, 0.009, 0.010],
[1.502, 0.365, 0.362],
[1.932, 0.379, 0.380],
[0.672, 0.108, 0.104],
[1.265, 0.117, 0.114],
[2.474, 0.430, 0.428],
[3.478, 0.428, 0.421],
[2.130, 0.342, 0.357],
[0.952, 0.471, 0.466],
[4.193, 1.405, 1.335],
[3.365, 0.480, 0.488],
[7.972, 2.592, 2.478],
[0.094, 0.006, 0.006],
[18.434, 0.174, 0.172],
[20.875, 0.061, 0.064],
[25.084, 0.284, 0.275],
[12.371, 0.726, 0.716],
[4.528, 0.086, 0.086],
[2.214, 0.077, 0.079],
[5.191, 0.085, 0.087],
[18.534, 0.412, 0.401],
[15.191, 1.974, 2.013],
[0.058, 0.031, 0.031],
[3.679, 0.183, 0.185],
[9.791, 0.324, 0.327],
[8.869, 3.239, 3.030],
[19.144, 1.970, 1.912],
[19.125, 1.966, 1.908],
[0.298, 0.247, 0.109],
[0.060, 0.038, 0.038],
[0.037, 0.024, 0.024],
[0.040, 0.018, 0.019],
[0.085, 0.064, 0.065],
[0.032, 0.012, 0.013],
[0.029, 0.010, 0.010],
[0.023, 0.008, 0.008],

Then I ran ClickBench on this PR:

[0.034, 0.001, 0.001],
[0.144, 0.006, 0.006],
[0.044, 0.012, 0.012],
[0.743, 0.015, 0.015],
[1.598, 0.338, 0.362],
[1.784, 0.251, 0.250],
[0.015, 0.007, 0.007],
[0.027, 0.009, 0.008],
[1.540, 0.380, 0.364],
[1.969, 0.382, 0.388],
[0.673, 0.108, 0.107],
[1.307, 0.118, 0.119],
[2.452, 0.421, 0.416],
[3.472, 0.410, 0.406],
[2.114, 0.231, 0.356],
[1.019, 0.433, 0.455],
[4.170, 1.343, 1.331],
[3.350, 0.470, 0.484],
[8.086, 2.720, 2.479],
[0.085, 0.018, 0.006],
[18.447, 0.176, 0.314],
[20.752, 0.371, 0.362],
[24.601, 0.509, 0.522],
[12.495, 0.753, 0.707],
[4.560, 0.086, 0.086],
[2.238, 0.077, 0.074],
[5.170, 0.087, 0.091],
[18.588, 0.410, 0.404],
[15.200, 1.978, 1.999],
[0.066, 0.031, 0.031],
[3.666, 0.185, 0.189],
[9.781, 0.318, 0.317],
[8.817, 3.197, 3.176],
[19.146, 1.626, 1.924],
[19.113, 1.653, 1.608],
[0.154, 0.111, 0.234],
[0.061, 0.038, 0.038],
[0.043, 0.027, 0.023],
[0.041, 0.022, 0.018],
[0.091, 0.063, 0.060],
[0.038, 0.013, 0.012],
[0.029, 0.010, 0.010],
[0.025, 0.008, 0.009],

No performance deterioration.

PS: Let's try to avoid force-pushing PRs once someone reviewed them. A force-push potentially rewrites the complete commit history and the reviewer needs to start reviewing from scratch (in principle).

zhongyuankai · 2025-09-17T02:10:36Z

@zhongyuankai I needed to simplify the logic in QueryConditionCacheWriter::addRanges because it looked too complex and I did not understand it (in particular the new dirty bit). Please let me know if you think the rewritten logic is wrong in any way.

The introduction of the dirty bit is mainly to avoid fetching results from the query cache every time. Instead, we fetch once and cache them in new_entries. If marks are updated during execution, we set the dirty flag and write them back to the query condition cache at the end. Perhaps we can first address the performance issue and then work on simplifying the code.

No performance deterioration.

I think the key to improving performance is to leverage the results already cached in the query cache to check whether the marks need to be updated. Under high QPS, the query condition cache already contains the complete results, so there is no need to write them repeatedly, which also helps avoid frequent write lock acquisitions.

If only this PR #80247 is introduced, it will not improve performance, as your tests have shown, because marks almost always need to be updated.

Perhaps you could rerun ClickBench based on this commit; my understanding is that it brings significant performance improvements.

PS: Let's try to avoid force-pushing PRs once someone reviewed them. A force-push potentially rewrites the complete commit history and the reviewer needs to start reviewing from scratch (in principle).

Okay, I understand.

rschu1ze · 2025-09-17T08:11:44Z

@zhongyuankai Thanks for the reply.

The introduction of the dirty bit is mainly to avoid fetching results from the query cache every time.

I wonder where in the current PR (my last commit) we "fetch results from the query cache every time". Do you mean function QueryConditionCache::read?

I think the key to improving performance is to leverage the results already cached in the query cache to check whether the marks need to be updated. Under high QPS, the query condition cache already contains the complete results, so there is no need to write them repeatedly, which also helps avoid frequent write lock acquisitions.

This confuses me. As far as I understand, each query has its own QueryConditionCacheWriter. At the end of the query, QueryConditionCacheWriter::finalize is called and the new entry is persisted in the cache. What you describe makes sense if - during ::finalize, an entry with the same key exists in the cache already. That would contradict the assumption that each query has its own QueryConditionCacheWriter.

If only this PR #80247 is introduced, it will not improve performance, as your tests have shown, because marks almost always need to be updated.

The goal of this PR is not to improve performance, the goal is to not deteriorate performance over current master (which was the reason why this same PR was previously reverted). To implement further performance improvements, we can make follow-up PRs on top.

zhongyuankai · 2025-09-18T01:43:21Z

@rschu1ze Thanks for the reply.

I wonder where in the current PR (my last commit) we "fetch results from the query cache every time". Do you mean function QueryConditionCache::read?

No, I was referring to this line of code

This confuses me. As far as I understand, each query has its own QueryConditionCacheWriter. At the end of the query, QueryConditionCacheWriter::finalize is called and the new entry is persisted in the cache. What you describe makes sense if - during ::finalize, an entry with the same key exists in the cache already. That would contradict the assumption that each query has its own QueryConditionCacheWriter.

I mainly want to address the case where an entry with the same key already exists in the cache. In QueryConditionCacheWriter::addRanges, I aim to minimize the time the write lock is held. I will submit a PR for this.

This seems somewhat contradictory. Do you have a better idea?

rschu1ze · 2025-09-18T18:59:30Z

Nevermind, let's do this in a separate PR (you can add me as reviewer). Thanks!

rschu1ze · 2025-09-23T12:10:21Z

According to internal performance testing, this PR introduced regressions in ClickBench queries Q20 (by 2x), Q21 (by 8x) and Q22 (by 3x) despite the lockless implementation (compared to the original implementation #86076) and my performance testing (here). I have no idea what caused this and how to check deeper (perhaps try to reproduce with ClickBench again, maybe #87337 will help), but for now I need to revert this PR as it blocks the 25.9 release.

zhongyuankai · 2025-09-23T12:22:17Z

According to internal performance testing, this PR introduced regressions in ClickBench queries Q20 (by 2x), Q21 (by 8x) and Q22 (by 3x) despite the lockless implementation (compared to the original implementation #86076) and my performance testing (here). I have no idea what caused this and how to check deeper (perhaps try to reproduce with ClickBench again, maybe #87337 will help), but for now I need to revert this PR as it blocks the 25.9 release.

I think introducing PR #87337 can resolve the performance regression.

zhongyuankai · 2025-09-23T12:26:40Z

@rschu1ze Can we run ClickBench again based on PR #87337?

rschu1ze · 2025-09-23T12:26:40Z

I think introducing PR #87337 can resolve the performance regression.

Could be. #87337 will nevertheless need some time to review and test. We can do it on top of a revert of the revert (which I am preparing right now).

rschu1ze · 2025-09-23T12:27:57Z

@rschu1ze Can we run ClickBench again based on PR #87337?

Sure we can. What confused me is that our internal ClickBench tests showed a regression but my testing (above) did not.

zhongyuankai · 2025-09-23T12:44:36Z

I think introducing PR #87337 can resolve the performance regression.

Could be. #87337 will nevertheless need some time to review and test. We can do it on top of a revert of the revert (which I am preparing right now).

Ok.

alexey-milovidov · 2025-09-24T12:37:47Z

Is there any real scenario in which the memory consumption of the query condition cache is a problem?

rschu1ze · 2025-09-29T08:15:23Z

Is there any real scenario in which the memory consumption of the query condition cache is a problem?

There is none. The real point of this PR is its new QueryConditionCacheWriter abstraction. It gets rid of dirty in-place modification of existing cache entries in QueryConditionCache::write (cache entries are supposed to be immutable).