[SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInputFormat table #35734

CHENXCHEN · 2022-03-04T16:19:48Z

What changes were proposed in this pull request?

Add support for reading ORC/Parquet files of SymlinkTextInputFormat table
Fix analyze table size for SymlinkTextInputFormat

Why are the changes needed?

Both trino(prestosql) and prestodb support reading ORC/Parquet files of SymlinkTextInputFormat table, but Spark does not, Spark support other formats except ORC/Parquet format, and this PR add support for reading ORC/Parquet files of SymlinkTextInputFormat table
As follows example, analyze the table size will be 100 instead of 19999, which affects the join optimization, and it will attempt to deliver a large table broadcast if manifest file size is less than spark.sql.autoBroadcastJoinThreshold, and after this PR, analyze the table size will be 19999

we have files:

size   filepath
100    hdfs:///path/to/table/manifest
9999   hdfs:///path/to/other/part-1.parquet.orc
10000  hdfs:///path/to/other/part-2.parquet.orc

content of hdfs:///path/to/table/manifest :

hdfs:///path/to/other/part-1.parquet.orc
hdfs:///path/to/other/part-2.parquet.orc

table ddl:

CREATE EXTERNAL TABLE symlink_orc ( name STRING, version DOUBLE, sort INT )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs:///path/to/table';

See details in the JIRA. SPARK-32432

Does this PR introduce any user-facing change?

Yes

Before this pr, an exception was thrown when reading the ORC/Parquet files of SymlinkTextInputFormat table.
Analyze SymlinkTextInputFormat table, will parse its real file size(such as ORC/Parquet files size), not the manifest file size, as described above

How was this patch tested?

Added Unit Test: org.apache.spark.sql.hive.SymlinkSuite

…ymlinkTextInputFormat

CHENXCHEN · 2022-03-04T16:39:27Z

Could you please take a look and review this PR ?

CHENXCHEN · 2022-03-04T16:49:28Z

This PR is a bit like this but is too old to support Spark 3.2.
In addition, this PR add a fix to the analysis table size.

CHENXCHEN · 2022-03-04T16:51:48Z

cc @cloud-fan @gengliangwang @viirya @sadikovi

CHENXCHEN · 2022-03-06T06:48:49Z

ok to test

AmplabJenkins · 2022-03-06T16:42:36Z

Can one of the admins verify this patch?

CHENXCHEN · 2022-03-07T06:12:32Z

cc @cloud-fan could you help take a look when you have time? Thanks.

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala

cloud-fan · 2022-03-07T14:39:31Z

I think the problem today is we don't have a good abstraction for this feature at the framework level. This is a special hive file format that changes the behavior of file listing, while in Spark the FileIndex API assumes the file listing behavior is unrelated to the file format.

This PR simply adds special handling of SymlinkTextInputFormat in several places, and I'm OK with it if it's super hard to come up with a good abstraction for it, but we should give it a try first.

cc @viirya @dongjoon-hyun @AngersZhuuuu

viirya · 2022-03-07T19:02:24Z

...e/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala

-          PartitionDirectory(values, files)
+          // Check leaf files since they might be symlink targets
+          if (files.isEmpty) {
+            val status: Seq[FileStatus] = leafFiles.get(path) match {


Are Symlink targets in leaf files? I think leaf files are listed from table root?

SymlinkTextInputFormat table specify the list of files for a table/partition based on the content of a text file.
As the example in the PR description shows, the real data file is not in the table root

the real data file is not in the table root

Yea, but leafFiles are listed from table root, how symlink targets might be in leaf files?

At the partition table, we have the same behavior, there is also symbolic in the root directory of the partition

viirya · 2022-03-07T19:08:36Z

Same feeling here. Not sure if SymlinkTextInputFormat is the only case with custom file listing behavior.

AngersZhuuuu · 2022-03-09T02:07:26Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala

+    val isSymlinkTextFormat = SymlinkTextInputFormatUtil.isSymlinkTextFormat(relation.tableMeta)
+
+    val symlinkTargets = if (isSymlinkTextFormat) {
+      SymlinkTextInputFormatUtil.getTargetPathsFromSymlink(fs, tablePath)


When table is partitioned and lazyPruningEnabled = false, this value is not used, I think we don't need to get this value earliear

AngersZhuuuu · 2022-03-09T02:29:23Z

Agree with @cloud-fan that we don't have a good abstraction for this feature.
IMO, we can add a new method under CatalogTable and CatalogTablePartition, such as

getFileList(fileFormat: String， fs: FileSystem)

Then we can directly list files according to file format, when in all other place, we can just call this API. Then we don't need to judge if it's SymlinkTextFormat every where.

WDYT @cloud-fan @viirya @dongjoon-hyun

CHENXCHEN · 2022-03-11T02:24:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala


        if (paths.isEmpty) {
          Seq(tablePath)
+        } else if (isSymlinkTextFormat) {


@AngersZhuuuu also used in here when lazyPruningEnabled = false

github-actions · 2022-09-27T00:28:04Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

CHENXCHEN added 2 commits March 4, 2022 17:11

[SPARK-32432][SQL] Added support for reading ORC/Parquet files with S…

1e48d5c

…ymlinkTextInputFormat

add analyze symlink table fix and Suite case

119109d

github-actions bot added the SQL label Mar 4, 2022

cloud-fan reviewed Mar 7, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Mar 7, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/command/CommandUtils.scala Show resolved Hide resolved

viirya reviewed Mar 7, 2022

View reviewed changes

format code

e725b16

CHENXCHEN force-pushed the SPARK-32432 branch from 50b4736 to e725b16 Compare March 8, 2022 03:34

AngersZhuuuu reviewed Mar 9, 2022

View reviewed changes

CHENXCHEN commented Mar 11, 2022

View reviewed changes

github-actions bot added the Stale label Sep 27, 2022

github-actions bot closed this Sep 28, 2022

[SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInputFormat table #35734

[SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInputFormat table #35734

Uh oh!

Conversation

CHENXCHEN commented Mar 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

CHENXCHEN commented Mar 4, 2022

Uh oh!

CHENXCHEN commented Mar 4, 2022

Uh oh!

CHENXCHEN commented Mar 4, 2022

Uh oh!

CHENXCHEN commented Mar 6, 2022

Uh oh!

AmplabJenkins commented Mar 6, 2022

Uh oh!

CHENXCHEN commented Mar 7, 2022

Uh oh!

Uh oh!

Uh oh!

cloud-fan commented Mar 7, 2022

Uh oh!

viirya Mar 7, 2022

Choose a reason for hiding this comment

Uh oh!

CHENXCHEN Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

viirya Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

CHENXCHEN Mar 8, 2022

Choose a reason for hiding this comment

Uh oh!

viirya commented Mar 7, 2022

Uh oh!

AngersZhuuuu Mar 9, 2022

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Mar 9, 2022

Uh oh!

CHENXCHEN Mar 11, 2022

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CHENXCHEN commented Mar 4, 2022 •

edited

Loading