-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Follow up work for top-K query processing (PR #89835) #91645
Description
Following items came up during design / review discussions of #89835
e.g anonymised production query
SELECT
A, B, C, D , E
FROM tableA
WHERE (F = 'cluster1') AND ((hour >= 472624) AND (hour <= 472625))
ORDER BY dateTime DESC
LIMIT 5
5 rows in set. Elapsed: 24.739 sec. Processed 211.51 million rows, 200.90 GB (8.55 million rows/s., 8.12 GB/s.)
Peak memory usage: 929.44 MiB.
211 million rows is about 27000 granules. Assume minmax index is available on dateTime column
-
To help the current optimizations in the PR 89834, it would help if the reads are enqueued in an order that helps identifying good threshold very early in the sorting. So, using the minmax index on
dateTime, identify the granules which contain higher / recent values ofdateTimeand read those granules first and send those rows up the filter/sorting pipeline. Of course we cannot degenerate into random reads, so we can do some batches of sequential reads. Outcome : If a good threshold is identified by reading and sorting the first 10 million or 20 million rows, it will lead to vast data skipping and fast execution (rather than finding out the good threshold when we have read and sorted almost 150 million of the 210 million rows). -
Assume the 27000 granules were shortlisted based on skip indexes on the 2 columns
Fandhour. Guido Moerkotte's original paper on SMA had stated that granules evaluated against a skip index should be classified as qualified / disqualified / ambivalent. If theFcolumn has a minmax or bloom filter index, then the shortlisted granule(s) is ambivalent - we don't know if a matching row will be found forcluster1or not found in a shortlisted granule. But if the skip index onFwas asetindex, then any shortlisted granule becomes qualified. Now if thehourcolumn has a minmax index, then all the shortlisted granules are qualified - because this time it is a range comparision. We will just check if all the conditions were evaluated using skip indexes such that all granules were qualified. Then based ondateTimeminmax index, we just need to read the top-5 granules. it will require a bit of logic to identify the top-5 granules : not just select the granules with the 5 highest max values, but get the 5 highest min values and shortlist all the granules whose max value is greater than the 5th ranked min value). This could turn out to be very useful in queries on logs/full-text index, e.g
SELECT * FROM logs WHERE hasToken(message, 'ECONNREFUSED') ORDER BY timestamp DESC LIMIT 100 -
With a bit of work we should be able to support higher K's - e.g
LIMIT 100000,LIMIT 1000000. Need to locate theK'ththreshold value inMergeSortingTransformwhen there are multiple outputChunks. Currently works well whenKis withinmax_block_size(65409 default). -
Default enable both
use_skip_indexes_for_top_kanduse_top_k_dynamic_filtering-use_top_k_dynamic_filteringcan cause regression if there is no work to do i.e no rows to sort after the user providedWHEREclause -
Lot of rows to sort, use_top_k_dynamic_filtering does very well -
select title,view_count from youtube where title ilike '%hindi%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=0;
Query id: b4c4aaa7-8fe2-4963-a0b3-cdd3d493f4f5
┌─title─────────────────────────────────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ लकड़ी की काठी | Lakdi ki kathi | Popular Hindi Children Songs | Animated Songs by JingleToons │ 2765085691 │
2. │ Humpty the Train on a Fruits Ride | हम्प्टी ट्रैन और उसके फल दोस्तों से मिलिए | Kiddiestv Hindi │ 2763617397 │
3. │ Tip Tip Baarish Aayee | Hindi Rhymes for Children | Infobells │ 866609516 │
4. │ Hathi Raja Kahan Chale Nursery Rhyme | हाथी राजा कहाँ चले | FunForKidsTV - Hindi │ 583155452 │
5. │ Jaya Janaki Nayaka KHOONKHAR | Full Hindi Dubbed Movie | Bellamkonda Sreenivas, Rakul Preet Singh │ 536930367 │
└───────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 1.764 sec. Processed 587.31 million rows, 39.25 GB (332.88 million rows/s., 22.25 GB/s.)
Peak memory usage: 200.93 MiB.
select title,view_count from youtube where title ilike '%hindi%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=1;
Query id: d285ce82-7cde-4143-8153-ba56d891cf29
┌─title─────────────────────────────────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ लकड़ी की काठी | Lakdi ki kathi | Popular Hindi Children Songs | Animated Songs by JingleToons │ 2765085691 │
2. │ Humpty the Train on a Fruits Ride | हम्प्टी ट्रैन और उसके फल दोस्तों से मिलिए | Kiddiestv Hindi │ 2763617397 │
3. │ Tip Tip Baarish Aayee | Hindi Rhymes for Children | Infobells │ 866609516 │
4. │ Hathi Raja Kahan Chale Nursery Rhyme | हाथी राजा कहाँ चले | FunForKidsTV - Hindi │ 583155452 │
5. │ Jaya Janaki Nayaka KHOONKHAR | Full Hindi Dubbed Movie | Bellamkonda Sreenivas, Rakul Preet Singh │ 536930367 │
└───────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 0.247 sec. Processed 587.31 million rows, 5.76 GB (2.38 billion rows/s., 23.30 GB/s.)
Peak memory usage: 40.50 MiB.
Decent number of rows to sort, elapsed time is okay with use_top_k_dynamic_filtering, but note higher volume of data processed
select title,view_count from youtube where title ilike '%C++%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=0;
Query id: 54b51e7a-1f32-44cf-94b9-ede1dc232a23
┌─title──────────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ C++ in a Nutshell │ 1598629 │
2. │ How to Set up Visual Studio Code for C and C++ Programming │ 1510064 │
3. │ LEARN OPENCV C++ in 4 HOURS | Including 3x Projects | Computer Vision │ 1336942 │
4. │ Should you Learn C++ in 2018? │ 1296614 │
5. │ Code-It-Yourself! Tetris - Programming from Scratch (Quick and Simple C++) │ 1280601 │
└────────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 1.548 sec. Processed 587.31 million rows, 36.33 GB (379.51 million rows/s., 23.48 GB/s.)
Peak memory usage: 182.06 MiB.
select title,view_count from youtube where title ilike '%C++%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=1;
Query id: 8928c8d6-5a86-4d72-9896-cd8b61492f7b
┌─title──────────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ C++ in a Nutshell │ 1598629 │
2. │ How to Set up Visual Studio Code for C and C++ Programming │ 1510064 │
3. │ LEARN OPENCV C++ in 4 HOURS | Including 3x Projects | Computer Vision │ 1336942 │
4. │ Should you Learn C++ in 2018? │ 1296614 │
5. │ Code-It-Yourself! Tetris - Programming from Scratch (Quick and Simple C++) │ 1280601 │
└────────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 1.379 sec. Processed 587.31 million rows, 40.37 GB (425.99 million rows/s., 29.28 GB/s.)
Peak memory usage: 159.99 MiB.
Not many rows to sort, regression with use_top_k_dynamic_filtering -
select title,view_count from youtube where title ilike '%informix%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=0;
Query id: 6d2a53c6-3c68-4c5a-9adb-b1401b8d69cc
┌─title──────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ Modernizing the IBM Informix 4GL "stores demo" with Genero │ 7237 │
2. │ Learning about Informix and the Open Admin Tool (OAT) - Webcast Replay │ 5082 │
3. │ Informix 4GL to Java Migration Demo │ 4988 │
4. │ INFORMIXCLUB.COM First Movie │ 4685 │
5. │ installing informix 11.50 on centos 5.5 part2 │ 3911 │
└────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 1.647 sec. Processed 587.31 million rows, 35.74 GB (356.67 million rows/s., 21.71 GB/s.)
Peak memory usage: 156.43 MiB.
select title,view_count from youtube where title ilike '%informix%' order by view_count desc limit 5 settings use_query_condition_cache=0,use_top_k_dynamic_filtering=1;
Query id: 710464c2-2258-4a0c-b640-ea203cefbed8
┌─title──────────────────────────────────────────────────────────────────┬─view_count─┐
1. │ Modernizing the IBM Informix 4GL "stores demo" with Genero │ 7237 │
2. │ Learning about Informix and the Open Admin Tool (OAT) - Webcast Replay │ 5082 │
3. │ Informix 4GL to Java Migration Demo │ 4988 │
4. │ INFORMIXCLUB.COM First Movie │ 4685 │
5. │ installing informix 11.50 on centos 5.5 part2 │ 3911 │
└────────────────────────────────────────────────────────────────────────┴────────────┘
5 rows in set. Elapsed: 1.912 sec. Processed 587.31 million rows, 40.44 GB (307.19 million rows/s., 21.15 GB/s.)
Peak memory usage: 199.33 MiB.
Use case
Check above
Describe the solution you'd like
Improvements detailed above
Describe alternatives you've considered
No response
Additional context
No response