Skip to content

SPARK-3968 Use parquet-mr filter2 api in spark sql#2841

Closed
saucam wants to merge 10 commits intoapache:masterfrom
saucam:master
Closed

SPARK-3968 Use parquet-mr filter2 api in spark sql#2841
saucam wants to merge 10 commits intoapache:masterfrom
saucam:master

Conversation

@saucam
Copy link
Copy Markdown

@saucam saucam commented Oct 18, 2014

The parquet-mr project has introduced a new filter api (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
We can leverage that to further improve performance of queries with filters.
Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 20, 2014

This PR also fixes :

https://issues.apache.org/jira/browse/SPARK-1847

@AmplabJenkins
Copy link
Copy Markdown

Can one of the admins verify this patch?

@marmbrus
Copy link
Copy Markdown
Contributor

ok to test

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 25, 2014

QA tests have started for PR 2841 at commit e8bf033.

  • This patch does not merge cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 25, 2014

QA tests have finished for PR 2841 at commit e8bf033.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22208/
Test FAILed.

Yash Datta added 4 commits October 25, 2014 13:50
…ns" since filtering on optional columns is now

supported in filter2 api

This reverts commit 98eecf7108b45030d298f04b0ed0d7a80db58761.
            2. Use the serialization/deserialization from Parquet library for filter pushdown
@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 25, 2014

Test build #22211 has started for PR 2841 at commit cc7b596.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 25, 2014

Test build #22211 has finished for PR 2841 at commit cc7b596.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22211/
Test FAILed.

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 25, 2014

Can someone help, where is the build failing ? I can make distribution without errors, also ran dev/lint-scala successfully ...

@marmbrus
Copy link
Copy Markdown
Contributor

There are style violations. Run sbt/sbt scalastyle to see what they are.

@marmbrus
Copy link
Copy Markdown
Contributor

Spaces are required before comments:

[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:113:6: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:114:6: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:379:4: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:380:4: space.after.comment.start.message
[error] /Users/marmbrus/workspace/spark/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala:472:4: space.after.comment.start.message

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 26, 2014

Test build #22227 has started for PR 2841 at commit 48163c3.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 26, 2014

Test build #22227 has finished for PR 2841 at commit 48163c3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22227/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 26, 2014

Test build #22238 has started for PR 2841 at commit ec53e92.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 26, 2014

Test build #22238 has finished for PR 2841 at commit ec53e92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22238/
Test PASSed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space between if and (.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 27, 2014

Test build #22299 has started for PR 2841 at commit 5f4530e.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 27, 2014

Test build #22299 has finished for PR 2841 at commit 5f4530e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22299/
Test PASSed.

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 28, 2014

Added a unit test for filter pushdown on optional column

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 28, 2014

Test build #22339 has started for PR 2841 at commit 515df1c.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 28, 2014

Test build #22339 has finished for PR 2841 at commit 515df1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22339/
Test PASSed.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style thing: you should have spaces before { in here and a few other places (search the diff for ){).

@mateiz
Copy link
Copy Markdown
Contributor

mateiz commented Oct 28, 2014

Hey @saucam, I took a look at this too because I had tried upgrading to Parquet 1.6 in a different branch to use decimals. Made a few comments above.

Apart that, this PR doesn't seem to have any tests for the new functionality (in particular skipping row groups) or for the methods that build up Parquet filters. Do you mind adding some of those?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to add some tests with == null or >= null as well to make sure these filters work

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullable option is set when the field is optional. So adding tests for those.

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 29, 2014

Hi @mateiz , @marmbrus thanks for the suggestions, just a few points

  • Need to know which strategy to be kept as default (currently we use a different one than the default one in parquet library)
  • This PR is adding support to use filter2 api from the parquet library which supports row group filtering. Do we need to add tests to ensure that ? because such test cases already exist in the parquet library :

https://github.com/Parquet/parquet-mr/blob/parquet-1.6.0rc3/parquet-hadoop/src/test/java/parquet/filter2/compat/TestRowGroupFilter.java

  • We already have tests for methods that build the parquet Filters. They passed.

in sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala
test("create RecordFilter for simple predicates")

suggestions please ?

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 29, 2014

Added more tests for filtering on nullable columns

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 29, 2014

Test build #22448 has started for PR 2841 at commit 8282ba0.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Oct 29, 2014

Test build #22448 has finished for PR 2841 at commit 8282ba0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22448/
Test PASSed.

@mateiz
Copy link
Copy Markdown
Contributor

mateiz commented Oct 29, 2014

Alright, thanks for adding the tests. Let's get Michael's feedback on the metadata thing, I don't fully understand it. I guess it allows tasks to query different subsets of the metadata in parallel?

@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 30, 2014

yes. In task side metadata strategy, the tasks are spawned first, and each task will then read the metadata and drop the row groups. So if I am using yarn, and data is huge (metadata is large) , the memory will be consumed on the yarn side , but in case of client side metadata strategy, whole of the metadata will be read before the tasks are spawned, on a single node.

@marmbrus
Copy link
Copy Markdown
Contributor

I talked to some twitter people and they were pretty excited about the task side metadata reading because with big datasets they were seeing lots of OOMs before even starting the jobs. It could also be pretty good for S3 if we can avoid doing so much work serially on the driver. That said, it seems like it would make features like merging multiple unique schema's impossible and its newer / less tested. So, we'll want to be able to configure this easily

@marmbrus
Copy link
Copy Markdown
Contributor

Also looks like they are switching the default in parquet to task side: https://issues.apache.org/jira/browse/PARQUET-122

@mateiz
Copy link
Copy Markdown
Contributor

mateiz commented Oct 31, 2014

Cool, that makes sense. Anyway if this looks good to you, Michael, you should merge it.

@marmbrus
Copy link
Copy Markdown
Contributor

Thanks! Merged to master.

@asfgit asfgit closed this in 2e35e24 Oct 31, 2014
@saucam
Copy link
Copy Markdown
Author

saucam commented Oct 31, 2014

@marmbrus , @mateiz thanks for all the help !
@marmbrus you may want to close this ticket as well :

https://issues.apache.org/jira/browse/SPARK-1847

@mateiz
Copy link
Copy Markdown
Contributor

mateiz commented Nov 1, 2014

Thanks, closed it and assigned it to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants