Skip to content

Conversation

@CHENXCHEN
Copy link

@CHENXCHEN CHENXCHEN commented Mar 4, 2022

What changes were proposed in this pull request?

  1. Add support for reading ORC/Parquet files of SymlinkTextInputFormat table
  2. Fix analyze table size for SymlinkTextInputFormat

Why are the changes needed?

  1. Both trino(prestosql) and prestodb support reading ORC/Parquet files of SymlinkTextInputFormat table, but Spark does not, Spark support other formats except ORC/Parquet format, and this PR add support for reading ORC/Parquet files of SymlinkTextInputFormat table
  2. As follows example, analyze the table size will be 100 instead of 19999, which affects the join optimization, and it will attempt to deliver a large table broadcast if manifest file size is less than spark.sql.autoBroadcastJoinThreshold, and after this PR, analyze the table size will be 19999

we have files:

size   filepath
100    hdfs:///path/to/table/manifest
9999   hdfs:///path/to/other/part-1.parquet.orc
10000  hdfs:///path/to/other/part-2.parquet.orc

content of hdfs:///path/to/table/manifest :

hdfs:///path/to/other/part-1.parquet.orc
hdfs:///path/to/other/part-2.parquet.orc

table ddl:

CREATE EXTERNAL TABLE symlink_orc ( name STRING, version DOUBLE, sort INT )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs:///path/to/table';

See details in the JIRA. SPARK-32432

Does this PR introduce any user-facing change?

Yes

  1. Before this pr, an exception was thrown when reading the ORC/Parquet files of SymlinkTextInputFormat table.
  2. Analyze SymlinkTextInputFormat table, will parse its real file size(such as ORC/Parquet files size), not the manifest file size, as described above

How was this patch tested?

Added Unit Test: org.apache.spark.sql.hive.SymlinkSuite

@CHENXCHEN CHENXCHEN changed the title [SPARK-32432][SQL] Add support for reading ORC/Parquet files with SymlinkTextInputFormat And Fix Analyze for SymlinkTextInputFormat [SPARK-32432][SQL][WIP] Add support for reading ORC/Parquet files with SymlinkTextInputFormat And Fix Analyze for SymlinkTextInputFormat Mar 4, 2022
@github-actions github-actions bot added the SQL label Mar 4, 2022
@CHENXCHEN CHENXCHEN changed the title [SPARK-32432][SQL][WIP] Add support for reading ORC/Parquet files with SymlinkTextInputFormat And Fix Analyze for SymlinkTextInputFormat [SPARK-32432][SQL] Add support for reading ORC/Parquet files with SymlinkTextInputFormat And Fix Analyze for SymlinkTextInputFormat Mar 4, 2022
@CHENXCHEN
Copy link
Author

Could you please take a look and review this PR ?

@CHENXCHEN CHENXCHEN changed the title [SPARK-32432][SQL] Add support for reading ORC/Parquet files with SymlinkTextInputFormat And Fix Analyze for SymlinkTextInputFormat [SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInputFormat table Mar 4, 2022
@CHENXCHEN
Copy link
Author

This PR is a bit like this but is too old to support Spark 3.2.
In addition, this PR add a fix to the analysis table size.

@CHENXCHEN
Copy link
Author

@CHENXCHEN
Copy link
Author

ok to test

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@CHENXCHEN
Copy link
Author

cc @cloud-fan could you help take a look when you have time? Thanks.

@cloud-fan
Copy link
Contributor

I think the problem today is we don't have a good abstraction for this feature at the framework level. This is a special hive file format that changes the behavior of file listing, while in Spark the FileIndex API assumes the file listing behavior is unrelated to the file format.

This PR simply adds special handling of SymlinkTextInputFormat in several places, and I'm OK with it if it's super hard to come up with a good abstraction for it, but we should give it a try first.

cc @viirya @dongjoon-hyun @AngersZhuuuu

PartitionDirectory(values, files)
// Check leaf files since they might be symlink targets
if (files.isEmpty) {
val status: Seq[FileStatus] = leafFiles.get(path) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are Symlink targets in leaf files? I think leaf files are listed from table root?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SymlinkTextInputFormat table specify the list of files for a table/partition based on the content of a text file.
As the example in the PR description shows, the real data file is not in the table root

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the real data file is not in the table root

Yea, but leafFiles are listed from table root, how symlink targets might be in leaf files?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the partition table, we have the same behavior, there is also symbolic in the root directory of the partition

@viirya
Copy link
Member

viirya commented Mar 7, 2022

Same feeling here. Not sure if SymlinkTextInputFormat is the only case with custom file listing behavior.

val isSymlinkTextFormat = SymlinkTextInputFormatUtil.isSymlinkTextFormat(relation.tableMeta)

val symlinkTargets = if (isSymlinkTextFormat) {
SymlinkTextInputFormatUtil.getTargetPathsFromSymlink(fs, tablePath)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When table is partitioned and lazyPruningEnabled = false, this value is not used, I think we don't need to get this value earliear

@AngersZhuuuu
Copy link
Contributor

Agree with @cloud-fan that we don't have a good abstraction for this feature.
IMO, we can add a new method under CatalogTable and CatalogTablePartition, such as

getFileList(fileFormat: String, fs: FileSystem)

Then we can directly list files according to file format, when in all other place, we can just call this API. Then we don't need to judge if it's SymlinkTextFormat every where.

WDYT @cloud-fan @viirya @dongjoon-hyun


if (paths.isEmpty) {
Seq(tablePath)
} else if (isSymlinkTextFormat) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu also used in here when lazyPruningEnabled = false

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 27, 2022
@github-actions github-actions bot closed this Sep 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants