Skip to content

[GLUTEN-7028][CH][Part-5] Refactor: add NativeOutputWriter to unify CHDatasourceJniWrapper#7395

Merged
baibaichen merged 11 commits intoapache:mainfrom
baibaichen:feature/one-pipeline-native_out
Oct 9, 2024
Merged

[GLUTEN-7028][CH][Part-5] Refactor: add NativeOutputWriter to unify CHDatasourceJniWrapper#7395
baibaichen merged 11 commits intoapache:mainfrom
baibaichen:feature/one-pipeline-native_out

Conversation

@baibaichen
Copy link
Copy Markdown
Contributor

@baibaichen baibaichen commented Sep 30, 2024

What changes were proposed in this pull request?

(Fixes: #7028), This is last refactor PR, we unfiy how pass info between 3.3 and 3.5.

  1. Add NativeOutputWriter, so we can unify CHDatasourceJniWrapper
NativeOutputWriter
   | - NormalFileWriter           --> for file based parquet and orc
   | - SparkMergeTreeWriter  --> for mergetree, based on clickhouse storage  
  1. Using Configuration to pass config from driver to worker, this is standard way which spark used, and hence we can use the same CHMergeTreeWriterInjects::createOutputWriter definition.
  2. Now, we use WriteRel to pass info from jvm to cpp, see below data structure. optimization is Any, messge Write is added in write_optimization.proto
WriteRel
   |- tableSchema
   |- namedTable
   |--- advancedExtension
   |----- optimization : Any => Write


message Write {
  message Common {
    string format = 1;
  }
  message ParquetWrite{}
  message OrcWrite{}
  message MergeTreeWrite{
   // ...
  }

  Common common = 1;
  oneof file_format {
    ParquetWrite parquet = 2;
    OrcWrite orc = 3;
    MergeTreeWrite mergetree = 4;
  }
}

How was this patch tested?

UTs

@github-actions
Copy link
Copy Markdown

#7028

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI

@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 4, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from 9a13f5f to def1e47 Compare October 8, 2024 01:53
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 8, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from def1e47 to 552e819 Compare October 8, 2024 09:20
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 8, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from 552e819 to c139b8d Compare October 8, 2024 16:51
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 8, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from c139b8d to ef03755 Compare October 9, 2024 01:20
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 9, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from ef03755 to 17bb049 Compare October 9, 2024 03:35
@baibaichen baibaichen marked this pull request as ready for review October 9, 2024 03:35
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 9, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen force-pushed the feature/one-pipeline-native_out branch from 17bb049 to 898498e Compare October 9, 2024 06:46
@github-actions
Copy link
Copy Markdown

github-actions bot commented Oct 9, 2024

Run Gluten Clickhouse CI

@baibaichen baibaichen changed the title [GLUTEN-7028][CH][Part-5] [GLUTEN-7028][CH][Part-5] Refactor: add NativeOutputWriter to unify CHDatasourceJniWrapper Oct 9, 2024
@baibaichen baibaichen merged commit 5d28de6 into apache:main Oct 9, 2024
@baibaichen baibaichen deleted the feature/one-pipeline-native_out branch October 9, 2024 09:38
baibaichen added a commit to Kyligence/gluten that referenced this pull request Oct 15, 2024
(cherry picked from commit 94e1837a922d5a092226b195d6c3079d320878cb)
baibaichen added a commit that referenced this pull request Oct 15, 2024
* [GLUTEN-1632][CH]Daily Update Clickhouse Version (20241015)

* Fix Build due to ClickHouse/ClickHouse#70135

* Resovle conflict with #7322

* gtest skip since plan is chagned due to #7395

(cherry picked from commit 94e1837a922d5a092226b195d6c3079d320878cb)

---------

Co-authored-by: kyligence-git <gluten@kyligence.io>
Co-authored-by: Chang Chen <baibaichen@gmail.com>
sharkdtu pushed a commit to sharkdtu/gluten that referenced this pull request Nov 11, 2024
…HDatasourceJniWrapper (apache#7395)

* Add NativeOutputWriter

* refactor CHDatasourceJniWrapper

* WriteConfiguration

* using hadoop Configuration to pass parameter

* Implement CHMergeTreeWriterInjects::createNativeWrite

* Rename datasources.clickhouse.ClickhouseMetaSerializer => datasources.mergetree.MetaSerializer

* delete MergeTreeDeltaUtil and move its functionality to StorageMeta

* WriteConfiguration => StorageConfigProvider

* fix prefixof

* WriteConfiguration => StorageConfigProvider 2

* withStorageID
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CH] Fully Support writing parquet and mergetree in spark 3.5.x with delta protocol

2 participants