Skip to content

MergingSortedTransform has low performance in read in order optimization  #40675

@canhld94

Description

@canhld94

Describe the situation
Queries taken from #40583

CREATE TABLE logs_time_dt
(
    `time` DateTime64(9) Codec(Delta, ZSTD(7)),
    `project` LowCardinality(String) CODEC(ZSTD(7)),
    `service` LowCardinality(String) CODEC(ZSTD(7)),
    `message` String CODEC(ZSTD(7)),
    `tags_hash` Array(UInt64) CODEC(ZSTD(7)),
    INDEX idx_message message TYPE ngrambf_v1(3, 512, 2, 0) GRANULARITY 3,
    INDEX idx_tags_hash tags_hash TYPE bloom_filter(0.01) GRANULARITY 1
)
ENGINE = MergeTree
PARTITION BY toStartOfHour(time)
ORDER BY (project, service, time)
SETTINGS index_granularity = 1024;

insert into logs_time_dt
(time, project, service, message, tags_hash)
select
fromUnixTimestamp64Nano(toInt64(toUnixTimestamp64Nano(toDateTime64('2022-08-01',9))+number/(2777)*1e9)),
'test' as project,
'test' as service,
'foo',
[ number % 3000 ]
from system.numbers
limit 60*1e6;

SELECT *
FROM logs_time_dt
WHERE (project = 'test') AND (service = 'test') AND has(tags_hash, 42)
ORDER BY time
FORMAT `Null`;

Query pipeline:

(Expression)
ExpressionTransform
  (Sorting)
  MergingSortedTransform 451
    (Expression)
    ExpressionTransform × 45
      (Filter)
      FilterTransform × 45
        (ReadFromMergeTree)
        MergeTreeInOrder × 45 01

Performance:

0 rows in set. Elapsed: 3.425 sec. Processed 29.23 million rows, 1.11 GB (8.54 million rows/s., 324.36 MB/s.)

For read in order optimization, sometime the inputs of MergingSortedTransform are already sorted. Therefore, when executing, the processor will read all chunks from input port 0, then all chunks from input port 1, and so on so far (the indices may be not correct, just understand that the processor will scan the input ports sequentially).

Try a simple fix by adding buffer for each input:

(Expression)
ExpressionTransform
 (Sorting)
 MergingSortedTransform 451
   QueueBuffer × 45
     (Expression)
     ExpressionTransform × 45
       (Filter)
       FilterTransform × 45
         (ReadFromMergeTree)
         MergeTreeInOrder × 45 01

Performance:

0 rows in set. Elapsed: 0.281 sec. Processed 29.23 million rows, 1.11 GB (104.21 million rows/s., 3.96 GB/s.)

However, this simple fix has 2 main drawbacks:

  1. Cannot take advantage of limit pushdown (can fix easily)
  2. When the result set is big, OOM can happens, so need to add on-disk buffer.

We can discuss a solution to make read in order and normal read to have similar performance.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions