-
Notifications
You must be signed in to change notification settings - Fork 8.3k
Slow queries when reading in order #40583
Description
Describe the situation
We have a certain data-set with user logs, trying to get all data by certain tags spanning several hours.
Queries with ordering by part of ordering keys work slower than queries which don't use ordering by order-key
How to reproduce
Tried on ClickHouse 22.3 initially, same behavior on 22.7.4.16 too
CREATE statement:
CREATE TABLE logs_time_dt
(
`time` DateTime64(9) Codec(Delta, ZSTD(7)),
`project` LowCardinality(String) CODEC(ZSTD(7)),
`service` LowCardinality(String) CODEC(ZSTD(7)),
`message` String CODEC(ZSTD(7)),
`tags_hash` Array(UInt64) CODEC(ZSTD(7)),
INDEX idx_message message TYPE ngrambf_v1(3, 512, 2, 0) GRANULARITY 3,
INDEX idx_tags_hash tags_hash TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toStartOfHour(time)
ORDER BY (project, service, time)
SETTINGS index_granularity = 1024
example dataset, several hours worth of data, the result set by particular tag is 20k rows:
insert into logs_time_dt
(time, project, service, message, tags_hash)
select
fromUnixTimestamp64Nano(toInt64(toUnixTimestamp64Nano(toDateTime64('2022-08-01',9))+number/(2777)*1e9)),
'test' as project,
'test' as service,
'foo',
[ number % 3000 ]
from system.numbers
limit 60*1e6
slow query reading by key in reverse order:
SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY time DESC
FORMAT `Null`
Query id: f3277eec-3a99-4442-849d-1209ad402548
Ok.
0 rows in set. Elapsed: 8.770 sec. Processed 29.27 million rows, 1.11 GB (3.34 million rows/s., 126.91 MB/s.)
reading in order seems to be faster in this particular example, but still slow:
SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY time ASC
FORMAT `Null`
Query id: a3ce44be-5ea2-4ffc-9c72-7f4fc656a5e1
Ok.
0 rows in set. Elapsed: 3.788 sec. Processed 29.27 million rows, 1.11 GB (7.73 million rows/s., 293.59 MB/s.)
sorting by negative timestamp works a lot faster with this case:
SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY -toUnixTimestamp64Nano(time) ASC
FORMAT `Null`
Query id: e5415b8d-556c-4ffa-910a-81902b252f55
Ok.
0 rows in set. Elapsed: 0.442 sec. Processed 29.27 million rows, 1.11 GB (66.28 million rows/s., 2.52 GB/s.)
Expected performance
I expect the "correct" query with ordering by primary key suffix wouldn't be slower than ordering by a computed expression.