Skip to content

Additional primary key scan after using skip index for query with FINAL clause.#78350

Merged
shankar-iyer merged 19 commits intoClickHouse:masterfrom
shankar-iyer:additional_primary_key_scan_for_final_with_skip_index
Apr 16, 2025
Merged

Additional primary key scan after using skip index for query with FINAL clause.#78350
shankar-iyer merged 19 commits intoClickHouse:masterfrom
shankar-iyer:additional_primary_key_scan_for_final_with_skip_index

Conversation

@shankar-iyer
Copy link
Copy Markdown
Member

Resolves #70292, #31411 and #34243 . Solution will make sure that correct results are returned by a SELECT query with FINAL clause on a ReplacingMergeTree table that used a skip index. This should significantly improve performance of FINAL queries and reduce memory usage. Solution takes the initial set of PK ranges retrieved from the skip index and then finds matching PK ranges across all parts.

Earlier version of this PR : #70210

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • New setting introduced : use_skip_indexes_in_final_exact_mode. If a query on a ReplacingMergeTree table has FINAL clause, reading only table ranges based on skip indexes may produce incorrect result. This setting can ensure that correct results are returned by scanning newer parts that have overlap with primary key ranges returned by the skip index. Set to 0 to disable, 1 to enable

@shankar-iyer shankar-iyer requested a review from nickitat March 27, 2025 03:51
@clickhouse-gh
Copy link
Copy Markdown
Contributor

clickhouse-gh bot commented Mar 27, 2025

Workflow [PR], commit [6b0d003]

@clickhouse-gh clickhouse-gh bot added the pr-performance Pull request with some performance improvements label Mar 27, 2025
@shankar-iyer
Copy link
Copy Markdown
Member Author

Greetings @nickitat ! I have incorporated review feedback from #70210 and raised this PR. Regarding the suggestion to not introduce a new index type in EXPLAIN , please check below outputs comparing PrimayKeyExpand and Skip. In case of Skip, we cannot fill an index name because multiple skip indexes could have been used to perform the filtering. Hence we need to fill something like <internal>. I can also improve the name and description for PrimaryKeyExpand. Let me know your opinion, this PR right now retains PrimaryKeyExpand.

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/510. │             Granules: 7812502/781250211. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 114. │             Parts: 3/515. │             Granules: 3/781250216. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 5/319. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

changed to

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/510. │             Granules: 7812502/781250211. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 114. │             Parts: 3/515. │             Granules: 3/781250216. │           Skip                                                     │
17. |             Name: <internal>                                       |
18. │             Description: Find primary keys in all parts for FINAL  │
19. │             Parts: 5/320. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

@shankar-iyer
Copy link
Copy Markdown
Member Author

There are 5 "AST fuzzer" failures and 3 other failures. I did a check of the "AST fuzzer" failures and the logs show that the new testcase in this PR is being fuzz'ed and concurrently run during the failure. But the logs also show that the new code has not been entered before the time of failure (because of various reasons). The testcase inserts 1M rows in 5 parts, maybe something there.

@shankar-iyer
Copy link
Copy Markdown
Member Author

2 Failures -

  1. PR / Integration tests (release, 1/4) - Failure in test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 , unlikely due to PR

  2. Stateless tests - Test 02020_alter_table_modify_comment has timed out after 600s. Nothing much in the log apart from this

2025.03.28 08:38:58.263923 [ 78053 ] {866c0874-4f59-4b58-b8fa-42ea8ca3513b} <Debug> executeQuery: (from [::1]:34054) (comment: 02020_alter_table_modify_comment.sh) (query 1, line 1) ALTER TABLE comment_test_table MODIFY COMMENT ''; (stage: Complete)
...
2025.03.28 08:58:12.520765 [ 406 ] {} <Trace> CancellationChecker: Cancelling the task because of the timeout: 600000 ms, query: ALTER TABLE comment_test_table MODIFY COMMENT '';

@nickitat
Copy link
Copy Markdown
Member

Greetings @nickitat ! I have incorporated review feedback from #70210 and raised this PR. Regarding the suggestion to not introduce a new index type in EXPLAIN , please check below outputs comparing PrimayKeyExpand and Skip. In case of Skip, we cannot fill an index name because multiple skip indexes could have been used to perform the filtering. Hence we need to fill something like <internal>. I can also improve the name and description for PrimaryKeyExpand. Let me know your opinion, this PR right now retains PrimaryKeyExpand.

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/510. │             Granules: 7812502/781250211. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 114. │             Parts: 3/515. │             Granules: 3/781250216. │           PrimaryKeyExpand                                         │
17. │             Description: PrimaryKeyExpandForFinal                  │
18. │             Parts: 5/319. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

changed to

EXPLAIN indexes = 1
SELECT count(id1)
FROM xtx
FINAL
WHERE v = 688899949
SETTINGS use_skip_indexes_if_final_exact_mode = 1, use_skip_indexes_if_final = 1

Query id: be54c2c3-393f-43fc-9794-dfd549b60545

    ┌─explain────────────────────────────────────────────────────────────┐
 1. │ Expression ((Project names + Projection))                          │
 2. │   Aggregating                                                      │
 3. │     Expression (Before GROUP BY)                                   │
 4. │       Filter ((WHERE + Change column names to column identifiers)) │
 5. │         ReadFromMergeTree (default.xtx)                            │
 6. │         Indexes:                                                   │
 7. │           PrimaryKey                                               │
 8. │             Condition: true                                        │
 9. │             Parts: 5/510. │             Granules: 7812502/781250211. │           Skip                                                     │
12. │             Name: secondaryidx                                     │
13. │             Description: minmax GRANULARITY 114. │             Parts: 3/515. │             Granules: 3/781250216. │           Skip                                                     │
17. |             Name: <internal>                                       |
18. │             Description: Find primary keys in all parts for FINAL  │
19. │             Parts: 5/320. │             Granules: 14/3                                         │
    └────────────────────────────────────────────────────────────────────┘

I'm fine with both. Let's make PrimaryKeyExpand description more clear. Maybe "Selects all granules that intersect by PK values with the previous skip indexes selection".

@nickitat
Copy link
Copy Markdown
Member

As for additional testing. I have two ideas:

  1. try to randomise use_skip_indexes_if_final, but set it only along with use_skip_indexes_in_final_exact_mode. It could break tests that check the number of read marks or explain indexes, but hopefully there are not too many such tests.
  2. create a randomised test (but reproducible knowing the seed) and check output with and without use_skip_indexes_if_final.

@shankar-iyer
Copy link
Copy Markdown
Member Author

Implemented the change in the description of EXPLAIN, now looks like -

$ explain indexes=1 select id from tx FINAL where v = 33 order by id SETTINGS use_skip_indexes_if_final=1,use_skip_indexes_if_final_exact_mode=1;

EXPLAIN indexes = 1
SELECT id
FROM tx
FINAL
WHERE v = 33
ORDER BY id ASC
SETTINGS use_skip_indexes_if_final = 1, use_skip_indexes_if_final_exact_mode = 1

Query id: 4c830c90-4cc0-4afc-8d41-e869e18ee485

    ┌─explain────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
 1. │ Expression (Project names)                                                                                         │
 2. │   Sorting (Sorting for ORDER BY)                                                                                   │
 3. │     Expression ((Before ORDER BY + Projection))                                                                    │
 4. │       Filter ((WHERE + Change column names to column identifiers))                                                 │
 5. │         ReadFromMergeTree (default.tx)                                                                             │
 6. │         Indexes:                                                                                                   │
 7. │           PrimaryKey                                                                                               │
 8. │             Condition: true                                                                                        │
 9. │             Parts: 4/410. │             Granules: 13087317/1308731711. │           Skip                                                                                                     │
12. │             Name: vx                                                                                               │
13. │             Description: minmax GRANULARITY 114. │             Parts: 4/415. │             Granules: 55/1308731716. │           PrimaryKeyExpand                                                                                         │
17. │             Description: Selects all granules that intersect by PK values with the previous skip indexes selection │
18. │             Parts: 4/419. │             Granules: 492/55                                                                                       │
    └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

19 rows in set. Elapsed: 10.353 sec.

Testing -

CREATE TABLE tx (id UInt32, v UInt32, INDEX vx v TYPE minmax) ENGINE = ReplacingMergeTree ORDER BY (id) SETTINGS index_granularity=64;

SYSTEM STOP MERGES x;

insert into tx select abs(floor(randUniform(1,100000000))),floor(randExponential(1 / 2)) from numbers(10000000);
..
insert into tx select abs(floor(randUniform(1,100000000))),floor(randExponential(1 / 2)) from numbers(10000000);

Inserted millions of rows and created 13.08 million granules. Then ran queries in set Q1 and set Q2 and verified consolidated output is exactly the same.

Q1

set use_skip_indexes=0;
set use_skip_indexes_if_final=0;
set use_skip_indexes_if_final_exact_mode=0;

select id from tx FINAL where v = 33 order by id;
select id from tx FINAL where v = 32 order by id;
…
///100 queries with =, >, < and different predicate values

Q2

set use_skip_indexes=1;
set use_skip_indexes_if_final=1;
set use_skip_indexes_if_final_exact_mode=1;

select id from tx FINAL where v = 33 order by id;
select id from tx FINAL where v = 32 order by id;
//exact same queries as Q1

Verified output is exactly byte by byte same form multiple runs with varying predicate values in Q1 & Q2

Ran clickhouse-test with random use_skip_indexes_if_final True/False and random use_skip_indexes_if_final_exact_mode True/False - no problems, except for testcase 02202_use_skip_indexes_if_final which now works correctly because _exact_mode.

@shankar-iyer
Copy link
Copy Markdown
Member Author

4 failures -

  1. Stateless tests (release, ParallalReplicas, s3 storage)
    Test failed is 03271_s3_table_function_asterisk_glob with error message -

2025-04-07 04:05:59 [1878c3c9d982] 2025.04.07 10:05:57.562727 [ 14526 ] {d21667f6-d62b-4aa9-8f01-7dc1068d445b} <Warning> StorageS3(_table_function.s3Cluster): Failed to list object storage, cannot use hive partitioning. Error: Poco::Exception. Code: 1000, e.code() = 0, Timeout, Stack trace (when copying this message, always include the lines below):

  1. PR / AST fuzzer (debug)
    Clickhouse assert with Logical error: 'Can't set alias of * of Asterisk'.

  2. PR / Integration tests (release 1/4)
    Failure is in test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_

  3. PR / Stress test(debug)
    CH crashed and callstack is same as Segfault in RefreshTask::alterRefreshParams #78693

Failures unrelated to work in PR.

@nickitat
Copy link
Copy Markdown
Member

nickitat commented Apr 7, 2025

My bad; I should have explained myself clearly. We have a way to randomize settings in tests:

"use_query_condition_cache": lambda: random.randint(0, 1),

To test different combinations. Let's support new settings there.

Regarding the randomized test case, let's also add it to the whole test fleet to avoid regression.

@@ -1,4 +1,6 @@
-- This tests will show the difference in data with use_skip_indexes_if_final and w/o
-- Tags: no-random-settings
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we better fix values for specific settings than disable randomization completely:

set use_skip_indexes_if_final=0,use_skip_indexes_if_final_exact_mode = 0;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I will correct that!

CREATE TABLE st (id Int32, v Int32, r Int32, INDEX bfv v TYPE bloom_filter) ENGINE=ReplacingMergeTree ORDER BY (id) SETTINGS index_granularity = 64;
SYSTEM STOP MERGES st;

INSERT INTO st SELECT id % 9999999, if(id % 729 = 0, 4, v), 1 FROM (SELECT * FROM generateRandom('id UInt32, v UInt32', toUnixTimestamp(now())) limit 1000000) SETTINGS max_threads = 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a way to know the seed to reproduce the failure. I think the simplest solution is to rework the test as a bash script, save the seed into a variable, and print it in case of failure.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose toUnixTimestamp(now())) as seed to get randomness literally in every CI run. And in case of unlikely test failure tomorrow, deduce the seed from the timestamp in the failure log. I will try to write the bash script and upload back.

@shankar-iyer
Copy link
Copy Markdown
Member Author

Only failure in recent CI run is

test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_1 |  
test_storage_rabbitmq/test_failed_connection.py::test_rabbitmq_restore_failed_connection_without_losses_2 |  

#71049

@shankar-iyer
Copy link
Copy Markdown
Member Author

Not sure what happened for cancellation in amd_darwin build and PR / Stateless tests (release)

Only other failure in"Srateless tests (msan, 2/4)"is flaky #75876

@shankar-iyer shankar-iyer added this pull request to the merge queue Apr 16, 2025
Merged via the queue into ClickHouse:master with commit 659da48 Apr 16, 2025
119 of 120 checks passed
@shankar-iyer shankar-iyer deleted the additional_primary_key_scan_for_final_with_skip_index branch April 16, 2025 05:36
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 16, 2025
github-merge-queue bot pushed a commit that referenced this pull request Jun 3, 2025
…inal_exact_mode

Improvements skip index final exact mode (follow up to PR #78350)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Secondary indexes should be used for queries with FINAL

3 participants