Skip to content

Conversation

@moomindani
Copy link
Contributor

What changes were proposed in this pull request?

This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark.

Why are the changes needed?

Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql.
Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark (and Apache Hive).
On the other hand, prestodb and prestosql support SymlinkTextInputFormat with ORC/Parquet files.
This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark.

See details in the JIRA. SPARK-32432

Does this PR introduce any user-facing change?

Yes.
Currently Spark returns exceptions if users try to use SymlinkTextInputFormat with ORC/Parquet files.
With this patch, Spark can handle symlink which indicates locations of ORC/Parquet files.

How was this patch tested?

I added a new test suite SymlinkSuite and confirmed it passed.

$ ./build/sbt "project hive" "test-only org.apache.spark.sql.hive.SymlinkSuite"

@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126954 has finished for PR 29330 at commit c97f003.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@moomindani
Copy link
Contributor Author

retest this please

@maropu maropu changed the title [SPARK-32432] Added support for reading ORC/Parquet files with SymlinkTextInputFormat [SPARK-32432][SQL] Added support for reading ORC/Parquet files with SymlinkTextInputFormat Aug 3, 2020
@SparkQA
Copy link

SparkQA commented Aug 3, 2020

Test build #126962 has finished for PR 29330 at commit c97f003.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@moomindani
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 5, 2020

Test build #127093 has finished for PR 29330 at commit c97f003.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@moomindani
Copy link
Contributor Author

Could you please take a look and review this PR?

@manuzhang
Copy link
Member

cc @cloud-fan @gengliangwang @viirya

@SparkQA
Copy link

SparkQA commented Nov 11, 2020

Test build #130938 has finished for PR 29330 at commit c97f003.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link

@CHENXCHEN CHENXCHEN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary files generated by hadoop will be read. @moomindani

fileSystem: FileSystem,
symlinkDir: Path): Seq[Path] = {

val symlinkIterator = fileSystem.listFiles(symlinkDir, true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, you should ignore the name start with "_" or ".", just like this hive implement, cause hadoop will generate temp file start with "_" or ".", we should ignore these files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, I will reflect your comments on this.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Mar 12, 2021
@moomindani
Copy link
Contributor Author

Let me keep this PFR open.

@github-actions github-actions bot closed this Mar 13, 2021
@maropu maropu removed the Stale label Mar 13, 2021
@maropu maropu reopened this Mar 13, 2021
@maropu
Copy link
Member

maropu commented Mar 13, 2021

ok to test

@gatorsmile
Copy link
Member

cc @c21 Is it supported in FB?

@c21
Copy link
Contributor

c21 commented Apr 28, 2021

@gatorsmile - we indeed support it, but the number of query everyday is very minimal, after checking the data.

@gatorsmile
Copy link
Member

@c21 Can you help review the PR or open source your internal fix?

@c21
Copy link
Contributor

c21 commented Apr 28, 2021

@gatorsmile - sure I will take a look, thanks.

@jles01
Copy link

jles01 commented Jul 20, 2021

Hi,

Is there any updates on this ?

@SparkQA
Copy link

SparkQA commented Oct 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48459/

@SparkQA
Copy link

SparkQA commented Oct 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48459/

@gengliangwang
Copy link
Member

cc @sadikovi as well

Copy link
Contributor

@sadikovi sadikovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not that many details on the JIRA ticket and the provided links don't work - would appreciate it if you could update the description. I will do another pass once the inline comments are addressed

}
PartitionDirectory(values, files)
// Check leaf files since they might be symlink targets
if (files == Nil) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isEmpty?

val rootPaths: Seq[Path] = if (lazyPruningEnabled) {
Seq(tablePath)
if (SymlinkTextInputFormatUtil.isSymlinkTextFormat(inputFormat)) {
symlinkTargets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be replaced with SymlinkTextInputFormatUtil.getTargetPathsFromSymlink(fs, tablePath). I don't think we need the if statement above.

@@ -0,0 +1 @@
1,2,3 No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be other examples of a CSV file in the resources, can you use those ones? The same applies to Parquet and ORC.


val testSymlinkPath1 = s"${temp.getCanonicalPath}/symlink_orc1.txt"
val testSymlinkPath2 = s"${temp.getCanonicalPath}/symlink_orc2.txt"
val prefix = System.getProperty("user.dir")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain? Is it possible to use a temporary path here?

import com.google.common.io.CharStreams
import org.apache.hadoop.fs.{FileSystem, Path}

object SymlinkTextInputFormatUtil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Javadoc to the class and the methods.

inputFormat.equals("org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat")
}

// Mostly copied from SymlinkTextInputFormat#getTargetPathsFromSymlinksDirs of Hive 3.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it compatible with other versions of Hive?

@CHENXCHEN
Copy link

Any update about this? thanks

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 27, 2022
@github-actions github-actions bot closed this May 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants