Additional primary key scan after using skip index for query with FINAL clause. by shankar-iyer · Pull Request #78350 · ClickHouse/ClickHouse

shankar-iyer · 2025-03-27T03:50:46Z

Resolves #70292, #31411 and #34243 . Solution will make sure that correct results are returned by a SELECT query with FINAL clause on a ReplacingMergeTree table that used a skip index. This should significantly improve performance of FINAL queries and reduce memory usage. Solution takes the initial set of PK ranges retrieved from the skip index and then finds matching PK ranges across all parts.

Earlier version of this PR : #70210

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

New setting introduced : use_skip_indexes_in_final_exact_mode. If a query on a ReplacingMergeTree table has FINAL clause, reading only table ranges based on skip indexes may produce incorrect result. This setting can ensure that correct results are returned by scanning newer parts that have overlap with primary key ranges returned by the skip index. Set to 0 to disable, 1 to enable

…_skip_index

clickhouse-gh · 2025-03-27T04:13:19Z

Workflow [PR], commit [6b0d003]

shankar-iyer · 2025-03-27T04:13:58Z

Greetings @nickitat ! I have incorporated review feedback from #70210 and raised this PR. Regarding the suggestion to not introduce a new index type in EXPLAIN , please check below outputs comparing PrimayKeyExpand and Skip. In case of Skip, we cannot fill an index name because multiple skip indexes could have been used to perform the filtering. Hence we need to fill something like <internal>. I can also improve the name and description for PrimaryKeyExpand. Let me know your opinion, this PR right now retains PrimaryKeyExpand.

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/5                                             │
10. │             Granules: 7812502/7812502                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 3/5                                             │
15. │             Granules: 3/7812502                                    │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 5/3                                             │
19. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

changed to

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/5                                             │
10. │             Granules: 7812502/7812502                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 3/5                                             │
15. │             Granules: 3/7812502                                    │
16. │           Skip                                                     │
17. |             Name: <internal>                                       |
18. │             Description: Find primary keys in all parts for FINAL  │
19. │             Parts: 5/3                                             │
20. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

shankar-iyer · 2025-03-27T10:18:04Z

There are 5 "AST fuzzer" failures and 3 other failures. I did a check of the "AST fuzzer" failures and the logs show that the new testcase in this PR is being fuzz'ed and concurrently run during the failure. But the logs also show that the new code has not been entered before the time of failure (because of various reasons). The testcase inserts 1M rows in 5 parts, maybe something there.

…_skip_index

shankar-iyer · 2025-03-28T16:00:56Z

2 Failures -

PR / Integration tests (release, 1/4) - Failure in test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 , unlikely due to PR
Stateless tests - Test 02020_alter_table_modify_comment has timed out after 600s. Nothing much in the log apart from this

2025.03.28 08:38:58.263923 [ 78053 ] {866c0874-4f59-4b58-b8fa-42ea8ca3513b} <Debug> executeQuery: (from [::1]:34054) (comment: 02020_alter_table_modify_comment.sh) (query 1, line 1) ALTER TABLE comment_test_table MODIFY COMMENT ''; (stage: Complete)
...
2025.03.28 08:58:12.520765 [ 406 ] {} <Trace> CancellationChecker: Cancelling the task because of the timeout: 600000 ms, query: ALTER TABLE comment_test_table MODIFY COMMENT '';

nickitat · 2025-03-31T13:07:54Z

Greetings @nickitat ! I have incorporated review feedback from #70210 and raised this PR. Regarding the suggestion to not introduce a new index type in EXPLAIN , please check below outputs comparing PrimayKeyExpand and Skip. In case of Skip, we cannot fill an index name because multiple skip indexes could have been used to perform the filtering. Hence we need to fill something like <internal>. I can also improve the name and description for PrimaryKeyExpand. Let me know your opinion, this PR right now retains PrimaryKeyExpand.

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/5                                             │
10. │             Granules: 7812502/7812502                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 3/5                                             │
15. │             Granules: 3/7812502                                    │
16. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 5/3                                             │
19. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

changed to

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/5                                             │
10. │             Granules: 7812502/7812502                              │
11. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 1                      │
14. │             Parts: 3/5                                             │
15. │             Granules: 3/7812502                                    │
16. │           Skip                                                     │
17. |             Name: <internal>                                       |
18. │             Description: Find primary keys in all parts for FINAL  │
19. │             Parts: 5/3                                             │
20. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

I'm fine with both. Let's make PrimaryKeyExpand description more clear. Maybe "Selects all granules that intersect by PK values with the previous skip indexes selection".

nickitat · 2025-03-31T14:12:23Z

As for additional testing. I have two ideas:

try to randomise use_skip_indexes_if_final, but set it only along with use_skip_indexes_in_final_exact_mode. It could break tests that check the number of read marks or explain indexes, but hopefully there are not too many such tests.
create a randomised test (but reproducible knowing the seed) and check output with and without use_skip_indexes_if_final.

…_skip_index

shankar-iyer · 2025-04-07T05:57:43Z

Implemented the change in the description of EXPLAIN, now looks like -

$ explain indexes=1 select id from tx FINAL where v = 33 order by id SETTINGS use_skip_indexes_if_final=1,use_skip_indexes_if_final_exact_mode=1;

EXPLAIN indexes = 1
SELECT id
FROM tx
FINAL
WHERE v = 33
ORDER BY id ASC
SETTINGS use_skip_indexes_if_final = 1, use_skip_indexes_if_final_exact_mode = 1

Query id: 4c830c90-4cc0-4afc-8d41-e869e18ee485

    ┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 1. │ Expression (Project names)                                                                                         │
 2. │   Sorting (Sorting for ORDER BY)                                                                                   │
 3. │     Expression ((Before ORDER BY + Projection))                                                                    │
 4. │       Filter ((WHERE + Change column names to column identifiers))                                                 │
 5. │         ReadFromMergeTree (default.tx)                                                                             │
 6. │         Indexes:                                                                                                   │
 7. │           PrimaryKey                                                                                               │
 8. │             Condition: true                                                                                        │
 9. │             Parts: 4/4                                                                                             │
10. │             Granules: 13087317/13087317                                                                            │
11. │           Skip                                                                                                     │
12. │             Name: vx                                                                                               │
13. │             Description: minmax GRANULARITY 1                                                                      │
14. │             Parts: 4/4                                                                                             │
15. │             Granules: 55/13087317                                                                                  │
16. │           PrimaryKeyExpand                                                                                         │
17. │             Description: Selects all granules that intersect by PK values with the previous skip indexes selection │
18. │             Parts: 4/4                                                                                             │
19. │             Granules: 492/55                                                                                       │
    └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 10.353 sec.

Testing -

CREATE TABLE tx (id UInt32, v UInt32, INDEX vx v TYPE minmax) ENGINE = ReplacingMergeTree ORDER BY (id) SETTINGS index_granularity=64;

SYSTEM STOP MERGES x;

insert into tx select abs(floor(randUniform(1,100000000))),floor(randExponential(1 / 2)) from numbers(10000000);
..
insert into tx select abs(floor(randUniform(1,100000000))),floor(randExponential(1 / 2)) from numbers(10000000);

Inserted millions of rows and created 13.08 million granules. Then ran queries in set Q1 and set Q2 and verified consolidated output is exactly the same.

Q1

set use_skip_indexes=0;
set use_skip_indexes_if_final=0;
set use_skip_indexes_if_final_exact_mode=0;

select id from tx FINAL where v = 33 order by id;
select id from tx FINAL where v = 32 order by id;
…
///100 queries with =, >, < and different predicate values

Q2

set use_skip_indexes=1;
set use_skip_indexes_if_final=1;
set use_skip_indexes_if_final_exact_mode=1;

select id from tx FINAL where v = 33 order by id;
select id from tx FINAL where v = 32 order by id;
//exact same queries as Q1

Verified output is exactly byte by byte same form multiple runs with varying predicate values in Q1 & Q2

Ran clickhouse-test with random use_skip_indexes_if_final True/False and random use_skip_indexes_if_final_exact_mode True/False - no problems, except for testcase 02202_use_skip_indexes_if_final which now works correctly because _exact_mode.

shankar-iyer · 2025-04-07T11:00:12Z

4 failures -

Stateless tests (release, ParallalReplicas, s3 storage)
Test failed is 03271_s3_table_function_asterisk_glob with error message -

2025-04-07 04:05:59 [1878c3c9d982] 2025.04.07 10:05:57.562727 [ 14526 ] {d21667f6-d62b-4aa9-8f01-7dc1068d445b} <Warning> StorageS3(_table_function.s3Cluster): Failed to list object storage, cannot use hive partitioning. Error: Poco::Exception. Code: 1000, e.code() = 0, Timeout, Stack trace (when copying this message, always include the lines below):

PR / AST fuzzer (debug)
Clickhouse assert with Logical error: 'Can't set alias of * of Asterisk'.
PR / Integration tests (release 1/4)
Failure is in test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_
PR / Stress test(debug)
CH crashed and callstack is same as Segfault in RefreshTask::alterRefreshParams #78693

Failures unrelated to work in PR.

nickitat · 2025-04-07T11:08:34Z

My bad; I should have explained myself clearly. We have a way to randomize settings in tests:

ClickHouse/tests/clickhouse-test

Line 1089 in d8620a6

"use_query_condition_cache": lambda: random.randint(0, 1),

To test different combinations. Let's support new settings there.

Regarding the randomized test case, let's also add it to the whole test fleet to avoid regression.

nickitat · 2025-04-08T12:02:12Z

tests/queries/0_stateless/02202_use_skip_indexes_if_final.sql

@@ -1,4 +1,6 @@
 -- This tests will show the difference in data with use_skip_indexes_if_final and w/o
+-- Tags: no-random-settings


we better fix values for specific settings than disable randomization completely:

set use_skip_indexes_if_final=0,use_skip_indexes_if_final_exact_mode = 0;

Good point, I will correct that!

nickitat · 2025-04-08T12:05:15Z

tests/queries/0_stateless/03244_skip_index_in_final_query_with_pk_rescan_random.sql

+CREATE TABLE st (id Int32, v Int32, r Int32, INDEX bfv v TYPE bloom_filter) ENGINE=ReplacingMergeTree ORDER BY (id) SETTINGS index_granularity = 64;
+SYSTEM STOP MERGES st;
+
+INSERT INTO st SELECT id % 9999999, if(id % 729 = 0, 4, v), 1  FROM (SELECT * FROM generateRandom('id UInt32, v UInt32', toUnixTimestamp(now())) limit 1000000) SETTINGS max_threads = 1;


There should be a way to know the seed to reproduce the failure. I think the simplest solution is to rework the test as a bash script, save the seed into a variable, and print it in case of failure.

I chose toUnixTimestamp(now())) as seed to get randomness literally in every CI run. And in case of unlikely test failure tomorrow, deduce the seed from the timestamp in the failure log. I will try to write the bash script and upload back.

shankar-iyer · 2025-04-12T15:00:50Z

Only failure in recent CI run is

test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_1 |  
test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 |

#71049

…_skip_index

shankar-iyer · 2025-04-15T12:47:24Z

Not sure what happened for cancellation in amd_darwin build and PR / Stateless tests (release)

Only other failure in"Srateless tests (msan, 2/4)"is flaky #75876

…inal_exact_mode Improvements skip index final exact mode (follow up to PR #78350)

shankar-iyer added 4 commits March 25, 2025 12:33

Moved over sources from PR ClickHouse#70210

cba1f77

Good names for start_value1/end_value1

00c5920

Disable optimization if reverse sorted keys

b327b7f

Code review feedback and SettingsChangesHistory update

c5e1781

shankar-iyer requested a review from nickitat March 27, 2025 03:51

Merge branch 'master' into additional_primary_key_scan_for_final_with…

9ea432e

…_skip_index

clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 27, 2025

Spell-check

11c103e

shankar-iyer added 4 commits March 27, 2025 15:56

Correctly identify ranges filtered by PK

57e83e3

Initialize skip_index_used_in_part

b4f570a

initialized vector<bool> not thread-safe

04dfcdf

Merge branch 'master' into additional_primary_key_scan_for_final_with…

1c68096

…_skip_index

devcrafter assigned nickitat Mar 31, 2025

Merge branch 'master' into additional_primary_key_scan_for_final_with…

1c19fba

…_skip_index

Modified the description of PrimaryKeyExpandForFinal

8a3cb8b

shankar-iyer added 2 commits April 7, 2025 14:51

More tests

d76519b

no-random for 1 test and need new analyzer for other test

b1ad84a

nickitat reviewed Apr 8, 2025

View reviewed changes

shankar-iyer added 2 commits April 12, 2025 04:54

Test updates and additional check for used skip index

8e75fce

add back enable_analyzer else generateRandom() complains

9cf0560

shankar-iyer added 3 commits April 15, 2025 08:57

Remove no-random tag

acc208c

Merge branch 'master' into additional_primary_key_scan_for_final_with…

fad9939

…_skip_index

Merge master and update SettingsChangesHistory.cpp

6b0d003

nickitat approved these changes Apr 15, 2025

View reviewed changes

shankar-iyer added this pull request to the merge queue Apr 16, 2025

Merged via the queue into ClickHouse:master with commit 659da48 Apr 16, 2025
119 of 120 checks passed

shankar-iyer deleted the additional_primary_key_scan_for_final_with_skip_index branch April 16, 2025 05:36

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 16, 2025

shankar-iyer mentioned this pull request Apr 28, 2025

Improvements skip index final exact mode (follow up to PR #78350) #79661

Merged

github-merge-queue bot pushed a commit that referenced this pull request Jun 3, 2025

Merge pull request #79661 from shankar-iyer/improvements_skip_index_f…

741609f

…inal_exact_mode Improvements skip index final exact mode (follow up to PR #78350)

shankar-iyer mentioned this pull request Jun 5, 2025

Default enable use of skip index in exact mode in queries with FINAL #81331

Merged

		@@ -1,4 +1,6 @@
		-- This tests will show the difference in data with use_skip_indexes_if_final and w/o
		-- Tags: no-random-settings

Conversation

shankar-iyer commented Mar 27, 2025

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Uh oh!

clickhouse-gh bot commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shankar-iyer commented Mar 27, 2025

Uh oh!

shankar-iyer commented Mar 27, 2025

Uh oh!

shankar-iyer commented Mar 28, 2025

Uh oh!

nickitat commented Mar 31, 2025

Uh oh!

nickitat commented Mar 31, 2025

Uh oh!

shankar-iyer commented Apr 7, 2025

Uh oh!

shankar-iyer commented Apr 7, 2025

Uh oh!

nickitat commented Apr 7, 2025

Uh oh!

nickitat Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

shankar-iyer Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

nickitat Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

shankar-iyer Apr 12, 2025

Choose a reason for hiding this comment

Uh oh!

shankar-iyer commented Apr 12, 2025

Uh oh!

shankar-iyer commented Apr 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh bot commented Mar 27, 2025 •

edited

Loading