[GLUTEN-2877][VL]Feat: Support read iceberg cow table#3043
[GLUTEN-2877][VL]Feat: Support read iceberg cow table#3043liujiayi771 wants to merge 4 commits intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
|
Run Gluten Clickhouse CI |
| override def filterExprs(): Seq[Expression] = scan match { | ||
| case fileScan: FileScan => | ||
| fileScan.dataFilters ++ pushdownFilters | ||
| case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" => |
There was a problem hiding this comment.
Why not use isInstanceOf ?
There was a problem hiding this comment.
SparkBatchQueryScan is protected class in iceberg, we can not import this class here.
gluten-core/src/main/scala/org/apache/iceberg/spark/source/GlutenSparkBatchQueryScan.scala
Outdated
Show resolved
Hide resolved
| tasks.map(_.asCombinedScanTask()).foreach { | ||
| task => | ||
| val file = task.files().asScala.head.file() | ||
| file.format() match { |
There was a problem hiding this comment.
Iceberg files maybe the mix of parquet and orc, how can we handle this?
There was a problem hiding this comment.
Maybe this is a constraint, we can document it. Check all the files is a too high expense and this situation is rarely.
There was a problem hiding this comment.
Yes, we should clarify the constraints.
| ) | ||
|
|
||
| val locations = SoftAffinityUtil.getFilePartitionLocations(partition) | ||
| val locations = SoftAffinityUtil.getFilePartitionLocations(partition, partition.files) |
There was a problem hiding this comment.
Can get partition.files in getFilePartitionLocations, don't need to add it as argument.
There was a problem hiding this comment.
The modification here is because the partition file of iceberg cannot be obtained directly through icebergPartition.files.
|
Can you add the test about iceberg table? |
|
I'm afraid it would be hard to maintain if we mixed iceberg code into everywhere on gluten. What this pr does is something like: I wonder if we can make it like: The main difference is that, we can create new modules for iceberg or other Spark downstream projects. We can expose the power of Gluten as extension as much as possile, then these modules can be developed base on Gluten extension. |
|
Run Gluten Clickhouse CI |
Add a new test case VeloxTPCHIcebergSuite. |
I agree with you. We need to add the ability to inject new rules into the scan processing part in gluten-core. We will optimize this later. Our team are also supporting delta, hudi and paimon. support for different lake formats requires a better architecture. |
|
Run Gluten Clickhouse CI |
| override def filterExprs(): Seq[Expression] = scan match { | ||
| case fileScan: FileScan => | ||
| fileScan.dataFilters ++ pushdownFilters | ||
| case scan if scan.getClass.getSimpleName == "SparkBatchQueryScan" => |
There was a problem hiding this comment.
Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name
There was a problem hiding this comment.
Comparing simple name may be dangerous. What if Delta Lake (or other formats) has the same class name
You are right, it will be more accurate to use a name containing package name, but basically the scan names of different formats are different. Gluten uses simple name when judging the scan type.
|
@liujiayi771 Do you work on refactor this PR or plan to do so based on discussion? |
We will post the design in the next couple of days as it is almost finished. After discussing it, we will refactor this pull request. |
|
@yma11 Apologies for the delay. We will completely decouple iceberg, delta, and gluten-core. Many areas will require modifications. |
I see. Look forward for your design. Thanks. |
|
Run Gluten Clickhouse CI |
|
Close this PR as related implementation all done in later PRs. |
What changes were proposed in this pull request?
Support read iceberg cow table in gluten. Resolve the first step in #2877.
How was this patch tested?
Run tpcds benchmark in 1T tpcds iceberg tables.