Skip to content

[VL] Unified design for data lake read support in Gluten + Velox #3378

@yma11

Description

@yma11

Description

Currently there are 2 PRs opened in Gluten to support Iceberg COW table read and Delta Lake read. There is also one hot discussion in Velox about Iceberg read support. By consolidating the ideas and based on Gluten's position, we would like to share a draft unified design for data lake read support in Gluten.

As addressed in this project's home page, one of Gluten key function is to transform Spark’s whole stage physical plan to Substrait plan and send to native. It applies to data lake read support, thus:

  1. We'd better avoid hacking of original Spark physical plan node. Gluten core has plan transformer to generate correct plan info into substrait format and then pass it to Velox for read and computation. So no matter what kind of the hack is, it should can be done in the transformer layer, such as column mapping. IMO, we should try best pass original info in spark plan to Velox as a bridge and do correct consumption at Velox side, unless it's not doable or velox can't support. By the way, one issue for feature like column mapping is, it's a common feature for kinds of file format reading, velox can handle this at its datasource level and the community has plan to do so.

  2. Clear transformer hierarchy is need for different data lake backends. In the Iceberg COW table read PR, a new branch is added to do specific process for Iceberg and leverage an utility class put in a dedicated folder, and in future, I believe more branches will be needed to support other cases, like MOR. So introducing a new transformer inherited from BatchScanExecTransformer would be a better way. The possible hierarchy should be like following:

                                IcebergDataSoure?                                     DeltaLakeDatasoure?                        ?
                                                    \                                     |                                 / 
                                                            \                             |                           /
                                                                   \                      |                      /
                                                                                       velox
                                                                                          |
                                                                                     substrait
                                                               /                          |                    \
                                                  /                                       |                             \
        SparkBatchQueryScanExecTransformer                           DeltaLakeScanExecTransformer                                 \
                                  |                                                       |
                  BatchScanExecTransformer                              FileSourceScanExecTransformer         HiveTableScanExecTransformer
                                    \                                                     |                                    /
                                                                         BasicScanExecTransformer

@YannByron @felipepessoto @liujiayi771 @ulysses-you, please give comments on above suggestions. Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Track

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions