[VL] Unified design for data lake read support in Gluten + Velox

### Description

Currently there are 2 PRs opened in Gluten to support [Iceberg COW table read](https://github.com/oap-project/gluten/pull/3043) and [Delta Lake read](https://github.com/oap-project/gluten/pull/2902). There is also one hot discussion in Velox about [Iceberg read](https://github.com/facebookincubator/velox/issues/5977) support. By consolidating the ideas and based on Gluten's position, we would like to share a draft unified design for data lake read support in Gluten.

As addressed in this project's home page, one of Gluten key function is to transform Spark’s whole stage physical plan to Substrait plan and send to native. It applies to data lake read support, thus:

1. We'd better avoid hacking of original Spark physical plan node. Gluten core has plan transformer to generate correct plan info into substrait format and then pass it to Velox for read and computation. So no matter what kind of the hack is, it should can be done in the transformer layer, such as column mapping. IMO, we should try best pass original info in spark plan to Velox as a bridge and do correct consumption at Velox side, unless it's not doable or velox can't support. By the way, one issue for feature like column mapping is, it's a common feature for kinds of file format reading, velox can handle this at its datasource level and the community has plan to do so. 

3. Clear transformer hierarchy is need for different data lake backends. In the [Iceberg COW table read](https://github.com/oap-project/gluten/pull/3043) PR, a new branch is added to do specific process for Iceberg and leverage an utility class put in a dedicated folder, and in future, I believe more branches will be needed to support other cases, like MOR. So introducing a new transformer inherited from `BatchScanExecTransformer` would be a better way. The possible hierarchy should be like following:


```
                                IcebergDataSoure?                                     DeltaLakeDatasoure?                        ?
                                                    \                                     |                                 / 
                                                            \                             |                           /
                                                                   \                      |                      /
                                                                                       velox
                                                                                          |
                                                                                     substrait
                                                               /                          |                    \
                                                  /                                       |                             \
        SparkBatchQueryScanExecTransformer                           DeltaLakeScanExecTransformer                                 \
                                  |                                                       |
                  BatchScanExecTransformer                              FileSourceScanExecTransformer         HiveTableScanExecTransformer
                                    \                                                     |                                    /
                                                                         BasicScanExecTransformer
```
@YannByron @felipepessoto @liujiayi771 @ulysses-you, please give comments on above suggestions. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VL] Unified design for data lake read support in Gluten + Velox #3378

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[VL] Unified design for data lake read support in Gluten + Velox #3378

Description

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions