SPARK-3968 Use parquet-mr filter2 api in spark sql#2841
SPARK-3968 Use parquet-mr filter2 api in spark sql#2841saucam wants to merge 10 commits intoapache:masterfrom
Conversation
|
Can one of the admins verify this patch? |
|
This PR also fixes : |
|
Can one of the admins verify this patch? |
|
ok to test |
|
QA tests have started for PR 2841 at commit
|
|
QA tests have finished for PR 2841 at commit
|
|
Test FAILed. |
…ns" since filtering on optional columns is now supported in filter2 api This reverts commit 98eecf7108b45030d298f04b0ed0d7a80db58761.
2. Use the serialization/deserialization from Parquet library for filter pushdown
|
Test build #22211 has started for PR 2841 at commit
|
|
Test build #22211 has finished for PR 2841 at commit
|
|
Test FAILed. |
|
Can someone help, where is the build failing ? I can make distribution without errors, also ran dev/lint-scala successfully ... |
|
There are style violations. Run |
|
Spaces are required before comments: |
|
Test build #22227 has started for PR 2841 at commit
|
|
Test build #22227 has finished for PR 2841 at commit
|
|
Test FAILed. |
…e a record filter
|
Test build #22238 has started for PR 2841 at commit
|
|
Test build #22238 has finished for PR 2841 at commit
|
|
Test PASSed. |
|
Test build #22299 has started for PR 2841 at commit
|
|
Test build #22299 has finished for PR 2841 at commit
|
|
Test PASSed. |
|
Added a unit test for filter pushdown on optional column |
|
Test build #22339 has started for PR 2841 at commit
|
|
Test build #22339 has finished for PR 2841 at commit
|
|
Test PASSed. |
There was a problem hiding this comment.
Small style thing: you should have spaces before { in here and a few other places (search the diff for ){).
|
Hey @saucam, I took a look at this too because I had tried upgrading to Parquet 1.6 in a different branch to use decimals. Made a few comments above. Apart that, this PR doesn't seem to have any tests for the new functionality (in particular skipping row groups) or for the methods that build up Parquet filters. Do you mind adding some of those? |
There was a problem hiding this comment.
Would be nice to add some tests with == null or >= null as well to make sure these filters work
There was a problem hiding this comment.
The nullable option is set when the field is optional. So adding tests for those.
|
Hi @mateiz , @marmbrus thanks for the suggestions, just a few points
in sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala suggestions please ? |
…g on optional columns
|
Added more tests for filtering on nullable columns |
|
Test build #22448 has started for PR 2841 at commit
|
|
Test build #22448 has finished for PR 2841 at commit
|
|
Test PASSed. |
|
Alright, thanks for adding the tests. Let's get Michael's feedback on the metadata thing, I don't fully understand it. I guess it allows tasks to query different subsets of the metadata in parallel? |
|
yes. In task side metadata strategy, the tasks are spawned first, and each task will then read the metadata and drop the row groups. So if I am using yarn, and data is huge (metadata is large) , the memory will be consumed on the yarn side , but in case of client side metadata strategy, whole of the metadata will be read before the tasks are spawned, on a single node. |
|
I talked to some twitter people and they were pretty excited about the task side metadata reading because with big datasets they were seeing lots of OOMs before even starting the jobs. It could also be pretty good for S3 if we can avoid doing so much work serially on the driver. That said, it seems like it would make features like merging multiple unique schema's impossible and its newer / less tested. So, we'll want to be able to configure this easily |
|
Also looks like they are switching the default in parquet to task side: https://issues.apache.org/jira/browse/PARQUET-122 |
|
Cool, that makes sense. Anyway if this looks good to you, Michael, you should merge it. |
|
Thanks! Merged to master. |
|
Thanks, closed it and assigned it to you. |
The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself