-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInputFormat table #35734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Could you please take a look and review this PR ? |
|
This PR is a bit like this but is too old to support Spark 3.2. |
|
ok to test |
|
Can one of the admins verify this patch? |
|
cc @cloud-fan could you help take a look when you have time? Thanks. |
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala
Show resolved
Hide resolved
|
I think the problem today is we don't have a good abstraction for this feature at the framework level. This is a special hive file format that changes the behavior of file listing, while in Spark the This PR simply adds special handling of |
| PartitionDirectory(values, files) | ||
| // Check leaf files since they might be symlink targets | ||
| if (files.isEmpty) { | ||
| val status: Seq[FileStatus] = leafFiles.get(path) match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are Symlink targets in leaf files? I think leaf files are listed from table root?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SymlinkTextInputFormat table specify the list of files for a table/partition based on the content of a text file.
As the example in the PR description shows, the real data file is not in the table root
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the real data file is not in the table root
Yea, but leafFiles are listed from table root, how symlink targets might be in leaf files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the partition table, we have the same behavior, there is also symbolic in the root directory of the partition
|
Same feeling here. Not sure if |
| val isSymlinkTextFormat = SymlinkTextInputFormatUtil.isSymlinkTextFormat(relation.tableMeta) | ||
|
|
||
| val symlinkTargets = if (isSymlinkTextFormat) { | ||
| SymlinkTextInputFormatUtil.getTargetPathsFromSymlink(fs, tablePath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When table is partitioned and lazyPruningEnabled = false, this value is not used, I think we don't need to get this value earliear
|
Agree with @cloud-fan that we don't have a good abstraction for this feature. Then we can directly list files according to file format, when in all other place, we can just call this API. Then we don't need to judge if it's |
|
|
||
| if (paths.isEmpty) { | ||
| Seq(tablePath) | ||
| } else if (isSymlinkTextFormat) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AngersZhuuuu also used in here when lazyPruningEnabled = false
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Why are the changes needed?
manifestfile size is less thanspark.sql.autoBroadcastJoinThreshold, and after this PR, analyze the table size will be 19999we have files:
content of
hdfs:///path/to/table/manifest:table ddl:
See details in the JIRA. SPARK-32432
Does this PR introduce any user-facing change?
Yes
manifestfile size, as described aboveHow was this patch tested?
Added Unit Test: org.apache.spark.sql.hive.SymlinkSuite