SPARK-3968 Use parquet-mr filter2 api in spark sql by saucam · Pull Request #2841 · apache/spark

saucam · 2014-10-18T13:20:44Z

The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself

AmplabJenkins · 2014-10-18T13:22:10Z

Can one of the admins verify this patch?

saucam · 2014-10-20T10:52:22Z

This PR also fixes :

https://issues.apache.org/jira/browse/SPARK-1847

AmplabJenkins · 2014-10-21T23:14:08Z

Can one of the admins verify this patch?

marmbrus · 2014-10-25T08:08:56Z

ok to test

SparkQA · 2014-10-25T08:15:16Z

QA tests have started for PR 2841 at commit e8bf033.

This patch does not merge cleanly.

SparkQA · 2014-10-25T08:19:03Z

QA tests have finished for PR 2841 at commit e8bf033.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-25T08:19:04Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22208/
Test FAILed.

…uet-mr

…ns" since filtering on optional columns is now supported in filter2 api This reverts commit 98eecf7108b45030d298f04b0ed0d7a80db58761.

2. Use the serialization/deserialization from Parquet library for filter pushdown

SparkQA · 2014-10-25T08:25:03Z

Test build #22211 has started for PR 2841 at commit cc7b596.

This patch merges cleanly.

SparkQA · 2014-10-25T08:29:09Z

Test build #22211 has finished for PR 2841 at commit cc7b596.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-25T08:29:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22211/
Test FAILed.

saucam · 2014-10-25T17:40:00Z

Can someone help, where is the build failing ? I can make distribution without errors, also ran dev/lint-scala successfully ...

marmbrus · 2014-10-25T18:04:24Z

There are style violations. Run sbt/sbt scalastyle to see what they are.

marmbrus · 2014-10-25T18:06:45Z

Spaces are required before comments:

[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:113:6: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:114:6: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:379:4: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:380:4: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:472:4: space.after.comment.start.message

SparkQA · 2014-10-26T02:29:57Z

Test build #22227 has started for PR 2841 at commit 48163c3.

This patch merges cleanly.

SparkQA · 2014-10-26T03:48:49Z

Test build #22227 has finished for PR 2841 at commit 48163c3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-26T03:48:52Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22227/
Test FAILed.

…e a record filter

SparkQA · 2014-10-26T07:39:41Z

Test build #22238 has started for PR 2841 at commit ec53e92.

This patch merges cleanly.

SparkQA · 2014-10-26T09:11:16Z

Test build #22238 has finished for PR 2841 at commit ec53e92.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-26T09:11:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22238/
Test PASSed.

marmbrus · 2014-10-26T21:33:25Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

Space between if and (.

SparkQA · 2014-10-27T15:14:44Z

Test build #22299 has started for PR 2841 at commit 5f4530e.

This patch merges cleanly.

SparkQA · 2014-10-27T16:46:13Z

Test build #22299 has finished for PR 2841 at commit 5f4530e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-27T16:46:17Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22299/
Test PASSed.

saucam · 2014-10-28T05:56:50Z

Added a unit test for filter pushdown on optional column

SparkQA · 2014-10-28T06:00:13Z

Test build #22339 has started for PR 2841 at commit 515df1c.

This patch merges cleanly.

SparkQA · 2014-10-28T07:30:12Z

Test build #22339 has finished for PR 2841 at commit 515df1c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-28T07:30:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22339/
Test PASSed.

mateiz · 2014-10-28T19:50:03Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala

Small style thing: you should have spaces before { in here and a few other places (search the diff for ){).

mateiz · 2014-10-28T19:55:50Z

Hey @saucam, I took a look at this too because I had tried upgrading to Parquet 1.6 in a different branch to use decimals. Made a few comments above.

Apart that, this PR doesn't seem to have any tests for the new functionality (in particular skipping row groups) or for the methods that build up Parquet filters. Do you mind adding some of those?

mateiz · 2014-10-28T19:58:19Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetFilters.scala

Would be nice to add some tests with == null or >= null as well to make sure these filters work

The nullable option is set when the field is optional. So adding tests for those.

saucam · 2014-10-29T06:44:00Z

Hi @mateiz , @marmbrus thanks for the suggestions, just a few points

Need to know which strategy to be kept as default (currently we use a different one than the default one in parquet library)
This PR is adding support to use filter2 api from the parquet library which supports row group filtering. Do we need to add tests to ensure that ? because such test cases already exist in the parquet library :

https://github.com/Parquet/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/test/java/parquet/filter2/compat/TestRowGroupFilter.java

We already have tests for methods that build the parquet Filters. They passed.

in sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala
test("create RecordFilter for simple predicates")

suggestions please ?

…g on optional columns

saucam · 2014-10-29T10:00:51Z

Added more tests for filtering on nullable columns

SparkQA · 2014-10-29T10:04:45Z

Test build #22448 has started for PR 2841 at commit 8282ba0.

This patch merges cleanly.

SparkQA · 2014-10-29T11:39:02Z

Test build #22448 has finished for PR 2841 at commit 8282ba0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-10-29T11:39:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22448/
Test PASSed.

mateiz · 2014-10-29T23:54:28Z

Alright, thanks for adding the tests. Let's get Michael's feedback on the metadata thing, I don't fully understand it. I guess it allows tasks to query different subsets of the metadata in parallel?

saucam · 2014-10-30T06:51:57Z

yes. In task side metadata strategy, the tasks are spawned first, and each task will then read the metadata and drop the row groups. So if I am using yarn, and data is huge (metadata is large) , the memory will be consumed on the yarn side , but in case of client side metadata strategy, whole of the metadata will be read before the tasks are spawned, on a single node.

marmbrus · 2014-10-30T18:03:57Z

I talked to some twitter people and they were pretty excited about the task side metadata reading because with big datasets they were seeing lots of OOMs before even starting the jobs. It could also be pretty good for S3 if we can avoid doing so much work serially on the driver. That said, it seems like it would make features like merging multiple unique schema's impossible and its newer / less tested. So, we'll want to be able to configure this easily

marmbrus · 2014-10-30T19:37:32Z

Also looks like they are switching the default in parquet to task side: https://issues.apache.org/jira/browse/PARQUET-122

mateiz · 2014-10-31T00:16:33Z

Cool, that makes sense. Anyway if this looks good to you, Michael, you should merge it.

marmbrus · 2014-10-31T00:18:08Z

Thanks! Merged to master.

saucam · 2014-10-31T04:32:27Z

@marmbrus , @mateiz thanks for all the help !
@marmbrus you may want to close this ticket as well :

https://issues.apache.org/jira/browse/SPARK-1847

mateiz · 2014-11-01T05:40:05Z

Thanks, closed it and assigned it to you.

saucam force-pushed the master branch from 6adaf9b to 98eecf7 Compare October 18, 2014 13:34

Yash Datta added 4 commits October 25, 2014 13:50

SPARK-3968: Change parquet filter pushdown to use filter2 api of parq…

9d09741

…uet-mr

SPARK-3968: Not pushing the filters in case of OPTIONAL columns

49703c9

Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL colum…

caed851

…ns" since filtering on optional columns is now supported in filter2 api This reverts commit 98eecf7108b45030d298f04b0ed0d7a80db58761.

SPARK-3968: 1. Fix RowGroupFiltering not working

cc7b596

2. Use the serialization/deserialization from Parquet library for filter pushdown

saucam force-pushed the master branch from e8bf033 to cc7b596 Compare October 25, 2014 08:21

SPARK-3968: Code cleanup

48163c3

SPARK-3968: No push down should result in case we are unable to creat…

ec53e92

…e a record filter

marmbrus reviewed Oct 26, 2014
View reviewed changes

SPARK-3968: Add a test case for filter pushdown on optional column

515df1c

mateiz reviewed Oct 28, 2014
View reviewed changes

SPARK-3968: fix scala code style and add some more tests for filterin…

8282ba0

…g on optional columns

asfgit closed this in 2e35e24 Oct 31, 2014

Conversation

saucam commented Oct 18, 2014

Uh oh!

AmplabJenkins commented Oct 18, 2014

Uh oh!

saucam commented Oct 20, 2014

Uh oh!

AmplabJenkins commented Oct 21, 2014

Uh oh!

marmbrus commented Oct 25, 2014

Uh oh!

SparkQA commented Oct 25, 2014

Uh oh!

SparkQA commented Oct 25, 2014

Uh oh!

AmplabJenkins commented Oct 25, 2014

Uh oh!

SparkQA commented Oct 25, 2014

Uh oh!

SparkQA commented Oct 25, 2014

Uh oh!

AmplabJenkins commented Oct 25, 2014

Uh oh!

saucam commented Oct 25, 2014

Uh oh!

marmbrus commented Oct 25, 2014

Uh oh!

marmbrus commented Oct 25, 2014

Uh oh!

SparkQA commented Oct 26, 2014

Uh oh!

SparkQA commented Oct 26, 2014

Uh oh!

AmplabJenkins commented Oct 26, 2014

Uh oh!

SparkQA commented Oct 26, 2014

Uh oh!

SparkQA commented Oct 26, 2014

Uh oh!

AmplabJenkins commented Oct 26, 2014

Uh oh!

marmbrus Oct 26, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 27, 2014

Uh oh!

SparkQA commented Oct 27, 2014

Uh oh!

AmplabJenkins commented Oct 27, 2014

Uh oh!

saucam commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 28, 2014

Uh oh!

SparkQA commented Oct 28, 2014

Uh oh!

AmplabJenkins commented Oct 28, 2014

Uh oh!

mateiz Oct 28, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz commented Oct 28, 2014

Uh oh!

mateiz Oct 28, 2014

Choose a reason for hiding this comment

Uh oh!

saucam Oct 29, 2014

Choose a reason for hiding this comment

Uh oh!

saucam commented Oct 29, 2014

Uh oh!

saucam commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

SparkQA commented Oct 29, 2014

Uh oh!

AmplabJenkins commented Oct 29, 2014

Uh oh!

mateiz commented Oct 29, 2014