[Gluten-core][VL] Supports DeltaLake 2.2 Read#3376
[Gluten-core][VL] Supports DeltaLake 2.2 Read#3376YannByron wants to merge 1 commit intoapache:mainfrom
Conversation
|
Thanks for opening a pull request! Could you open an issue for this pull request on Github Issues? https://github.com/oap-project/gluten/issues Then could you also rename commit message and pull request title in the following format? See also: |
|
Run Gluten Clickhouse CI |
|
Follow #2902, more previous discussion can be found there. |
|
Run Gluten Clickhouse CI |
5c4dc20 to
403eb40
Compare
|
Run Gluten Clickhouse CI |
403eb40 to
3d6a949
Compare
|
Run Gluten Clickhouse CI |
3d6a949 to
bc6f0f4
Compare
|
Run Gluten Clickhouse CI |
|
@YannByron do you have any suggestion how we can test how broken is Delta when running on Gluten/Velox? I started an experiment, but then realized it doesn't follow the same pattern Gluten uses to test Spark. It seems we extends the existing classes and adds GlutenSQLTestsTrait, example: https://github.com/oap-project/gluten/pull/1381/files. The experiment I started was to actually change Delta code to run the tests using Gluten: 1-) Cherry picked this to Delta 2.2. (this old workaround is not needed anymore and was messing the extension setting): delta-io/delta@975daae 3-) Added a reference to gluten jar in libraryDependencies for delta-core. But it it not a complete solution, most of tests are failing or not running. And the ones which passes, I can't say if they are passing because it is really working fine with Gluten/Velox, or because it falls back to non-native. |
|
Found that _metadata is not working for Delta when column mapping is enabled. _metadata has been recently changed to fall back: #2618 |
|
@YannByron will you work on update this PR based on discussion? |
|
Hi, @felipepessoto, sorry for the late reply. I run TPCDS based on delta table with gluten/velox in EMR. I use Delta 2.2 only with some internal commits (there aren't related to gluten/velox), and add the delta jar in
I'm not really aware of this _metadata. Where to use this during querying delta in gluten/velox env? If _metadata is used in parsing delta log, i think we can ignore for now because there are other factors (like UDFs) which make parsing delta log unsupported and the time cost for this phase was low. |
|
I think _metadata is not heavily used in 2.2, or not at all, but may be needed to replace the input_file_name UDF. In recent versions like 2.4 it is used for deletion vectors, as it needs the _metadata_row_index. I created this repro: It fails because we try to replace every column, and _metadata fields are not in the mapping:
|
What changes were proposed in this pull request?
(Fixes: #2891)
How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)