Skip to content

[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree#7924

Merged
baibaichen merged 4 commits intoapache:mainfrom
baibaichen:feature/partition_mergetree
Nov 13, 2024
Merged

[GLUTEN-7028][CH][Part-8] Support one pipeline write for partition mergetree#7924
baibaichen merged 4 commits intoapache:mainfrom
baibaichen:feature/partition_mergetree

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Nov 12, 2024

What changes were proposed in this pull request?

(Fixes: #7028)
The following digram shows the current class hierarchy, SparkPartitionedBaseSink inherits from ch's DB::PartitionedSink

WriteStatsBase
  |- MergeTreeStats  <--- collect stats at finish  -----------------------|
  |- WriteStats      <--- collect stats at consume ---|                   |
                                                      |                   |
SparkPartitionedBaseSink                              |                   |
  |- SubstraitPartitionedFileSink      ---create --> SubstraitFileSink    |
  |- SparkMergeTreePartitionedFileSink ---create --> SparkMergeTreeSink --|

The partition MergeTree in pipeline write looks like this, it squashes block before partitiion for whole input:

  // spark 3.5
  Input pipeline 
    => PlanSquashingTransform 
      => ApplySquashingTransform 
       => SparkMergeTreePartitionedFileSink
          => SparkMergeTreeSink
          => SparkMergeTreeSink
          => ...
        => MergeTreeStats

It differs from spark 3.3 which squashes block after partitiion for each partition, since parition is triggerd by JVM.

The new implemwentation is same as clickhouse.

How was this patch tested?

Using existed UTs

@github-actions
Copy link
Copy Markdown

#7028

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@baibaichen baibaichen merged commit c1a3f7b into apache:main Nov 13, 2024
@baibaichen baibaichen deleted the feature/partition_mergetree branch November 13, 2024 01:32
philo-he pushed a commit to philo-he/gluten that referenced this pull request Nov 13, 2024
…rgetree (apache#7924)

* [Refactor] simple refactor
* [Refactor] Remove setStats
* [Refactor] SparkPartitionedBaseSink and WriteStatsBase
* [Refactor] Add explicit SparkMergeTreeWriteSettings(const DB::ContextPtr & context);
* [New] Support writing partition mergetree in one pipeline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Fully Support writing parquet and mergetree in spark 3.5.x with delta protocol

2 participants