-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32432][SQL] Added support for reading ORC/Parquet files with SymlinkTextInputFormat #29330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #126954 has finished for PR 29330 at commit
|
|
retest this please |
|
Test build #126962 has finished for PR 29330 at commit
|
|
retest this please |
|
Test build #127093 has finished for PR 29330 at commit
|
|
Could you please take a look and review this PR? |
|
Test build #130938 has finished for PR 29330 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary files generated by hadoop will be read. @moomindani
| fileSystem: FileSystem, | ||
| symlinkDir: Path): Seq[Path] = { | ||
|
|
||
| val symlinkIterator = fileSystem.listFiles(symlinkDir, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, you should ignore the name start with "_" or ".", just like this hive implement, cause hadoop will generate temp file start with "_" or ".", we should ignore these files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, I will reflect your comments on this.
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
|
Let me keep this PFR open. |
|
ok to test |
|
cc @c21 Is it supported in FB? |
|
@gatorsmile - we indeed support it, but the number of query everyday is very minimal, after checking the data. |
|
@c21 Can you help review the PR or open source your internal fix? |
|
@gatorsmile - sure I will take a look, thanks. |
|
Hi, Is there any updates on this ? |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
cc @sadikovi as well |
sadikovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not that many details on the JIRA ticket and the provided links don't work - would appreciate it if you could update the description. I will do another pass once the inline comments are addressed
| } | ||
| PartitionDirectory(values, files) | ||
| // Check leaf files since they might be symlink targets | ||
| if (files == Nil) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isEmpty?
| val rootPaths: Seq[Path] = if (lazyPruningEnabled) { | ||
| Seq(tablePath) | ||
| if (SymlinkTextInputFormatUtil.isSymlinkTextFormat(inputFormat)) { | ||
| symlinkTargets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be replaced with SymlinkTextInputFormatUtil.getTargetPathsFromSymlink(fs, tablePath). I don't think we need the if statement above.
| @@ -0,0 +1 @@ | |||
| 1,2,3 No newline at end of file | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be other examples of a CSV file in the resources, can you use those ones? The same applies to Parquet and ORC.
|
|
||
| val testSymlinkPath1 = s"${temp.getCanonicalPath}/symlink_orc1.txt" | ||
| val testSymlinkPath2 = s"${temp.getCanonicalPath}/symlink_orc2.txt" | ||
| val prefix = System.getProperty("user.dir") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain? Is it possible to use a temporary path here?
| import com.google.common.io.CharStreams | ||
| import org.apache.hadoop.fs.{FileSystem, Path} | ||
|
|
||
| object SymlinkTextInputFormatUtil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add Javadoc to the class and the methods.
| inputFormat.equals("org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat") | ||
| } | ||
|
|
||
| // Mostly copied from SymlinkTextInputFormat#getTargetPathsFromSymlinksDirs of Hive 3.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it compatible with other versions of Hive?
|
Any update about this? thanks |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark.
Why are the changes needed?
Hive style symlink (SymlinkTextInputFormat) is commonly used in different analytic engines including prestodb and prestosql.
Currently SymlinkTextInputFormat works with JSON/CSV files but does not work with ORC/Parquet files in Apache Spark (and Apache Hive).
On the other hand, prestodb and prestosql support SymlinkTextInputFormat with ORC/Parquet files.
This pull-request is to add support for reading ORC/Parquet files with SymlinkTextInputFormat in Apache Spark.
See details in the JIRA. SPARK-32432
Does this PR introduce any user-facing change?
Yes.
Currently Spark returns exceptions if users try to use SymlinkTextInputFormat with ORC/Parquet files.
With this patch, Spark can handle symlink which indicates locations of ORC/Parquet files.
How was this patch tested?
I added a new test suite
SymlinkSuiteand confirmed it passed.