-
Notifications
You must be signed in to change notification settings - Fork 588
[VL] Unified design for data lake read support in Gluten + Velox #3378
Description
Description
Currently there are 2 PRs opened in Gluten to support Iceberg COW table read and Delta Lake read. There is also one hot discussion in Velox about Iceberg read support. By consolidating the ideas and based on Gluten's position, we would like to share a draft unified design for data lake read support in Gluten.
As addressed in this project's home page, one of Gluten key function is to transform Spark’s whole stage physical plan to Substrait plan and send to native. It applies to data lake read support, thus:
-
We'd better avoid hacking of original Spark physical plan node. Gluten core has plan transformer to generate correct plan info into substrait format and then pass it to Velox for read and computation. So no matter what kind of the hack is, it should can be done in the transformer layer, such as column mapping. IMO, we should try best pass original info in spark plan to Velox as a bridge and do correct consumption at Velox side, unless it's not doable or velox can't support. By the way, one issue for feature like column mapping is, it's a common feature for kinds of file format reading, velox can handle this at its datasource level and the community has plan to do so.
-
Clear transformer hierarchy is need for different data lake backends. In the Iceberg COW table read PR, a new branch is added to do specific process for Iceberg and leverage an utility class put in a dedicated folder, and in future, I believe more branches will be needed to support other cases, like MOR. So introducing a new transformer inherited from
BatchScanExecTransformerwould be a better way. The possible hierarchy should be like following:
IcebergDataSoure? DeltaLakeDatasoure? ?
\ | /
\ | /
\ | /
velox
|
substrait
/ | \
/ | \
SparkBatchQueryScanExecTransformer DeltaLakeScanExecTransformer \
| |
BatchScanExecTransformer FileSourceScanExecTransformer HiveTableScanExecTransformer
\ | /
BasicScanExecTransformer
@YannByron @felipepessoto @liujiayi771 @ulysses-you, please give comments on above suggestions. Thanks.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status