Skip to content

Slow queries when reading in order #40583

@holycheater

Description

@holycheater

Describe the situation
We have a certain data-set with user logs, trying to get all data by certain tags spanning several hours.
Queries with ordering by part of ordering keys work slower than queries which don't use ordering by order-key

How to reproduce
Tried on ClickHouse 22.3 initially, same behavior on 22.7.4.16 too

CREATE statement:

CREATE TABLE logs_time_dt
(
    `time` DateTime64(9) Codec(Delta, ZSTD(7)),
    `project` LowCardinality(String) CODEC(ZSTD(7)),
    `service` LowCardinality(String) CODEC(ZSTD(7)),
    `message` String CODEC(ZSTD(7)),
    `tags_hash` Array(UInt64) CODEC(ZSTD(7)),
    INDEX idx_message message TYPE ngrambf_v1(3, 512, 2, 0) GRANULARITY 3,
    INDEX idx_tags_hash tags_hash TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toStartOfHour(time)
ORDER BY (project, service, time)
SETTINGS index_granularity = 1024

example dataset, several hours worth of data, the result set by particular tag is 20k rows:

insert into logs_time_dt
(time, project, service, message, tags_hash)
select
fromUnixTimestamp64Nano(toInt64(toUnixTimestamp64Nano(toDateTime64('2022-08-01',9))+number/(2777)*1e9)),
'test' as project,
'test' as service,
'foo',
[ number % 3000 ]
from system.numbers
limit 60*1e6

slow query reading by key in reverse order:

SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY time DESC
FORMAT `Null`

Query id: f3277eec-3a99-4442-849d-1209ad402548

Ok.

0 rows in set. Elapsed: 8.770 sec. Processed 29.27 million rows, 1.11 GB (3.34 million rows/s., 126.91 MB/s.)

reading in order seems to be faster in this particular example, but still slow:

SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY time ASC
FORMAT `Null`

Query id: a3ce44be-5ea2-4ffc-9c72-7f4fc656a5e1

Ok.

0 rows in set. Elapsed: 3.788 sec. Processed 29.27 million rows, 1.11 GB (7.73 million rows/s., 293.59 MB/s.)

sorting by negative timestamp works a lot faster with this case:

SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY -toUnixTimestamp64Nano(time) ASC
FORMAT `Null`

Query id: e5415b8d-556c-4ffa-910a-81902b252f55

Ok.

0 rows in set. Elapsed: 0.442 sec. Processed 29.27 million rows, 1.11 GB (66.28 million rows/s., 2.52 GB/s.)

Expected performance
I expect the "correct" query with ordering by primary key suffix wouldn't be slower than ordering by a computed expression.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions