[SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized #33828

AngersZhuuuu · 2021-08-25T03:14:48Z

What changes were proposed in this pull request?

Consider such cases:

we close a job when it is doing dynamic partition insert, it will remain such staging dir under table's path. So we make the staging dir customized like hive can avoid remain such staging dir under table path.
In hive's API, if we specify a staging dir, not use default staging dir (under table path), it can directly rename to target path and can avoid many hdfs file operations. In spark currently only dynamic partition insert support staging dir, we can do this like [SPARK-36563][SQL] dynamicPartitionOverwrite can direct rename to targetPath instead of partition path one by one when targetPath is empty #33811
We can support add a file commit protocol that support staging dir for all types of insert, then when we use that commit protocol, wen can do:
- Insert into non-partitioned table form it self
- Insert into partition table's statistic partition and read data from target partition
- Insert into different partition using statistic partition together

Why are the changes needed?

Make spark data source insert's stagingDir can be customized and then we can do more optimize base on this.

Does this PR introduce any user-facing change?

User can define staging dir by spark.exec.stagingDir

How was this patch tested?

Added UT

AngersZhuuuu · 2021-08-25T03:14:57Z

ping @dongjoon-hyun

SparkQA · 2021-08-25T04:56:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47250/

SparkQA · 2021-08-25T05:19:06Z

Test build #142750 has finished for PR 33828 at commit 2031f5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-25T05:36:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47250/

AngersZhuuuu · 2021-08-25T07:08:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(true)

+  val FILE_COMMIT_STAGING_DIR =


Since always SQL related, so add this config in SQL part, although it's used in core part

It's reasonable, but it's an assumption based on this PR AS-IS scope. Some other PRs may try to use it later.

It's reasonable, but it's an assumption based on this PR AS-IS scope. Some other PRs may try to use it later.

I'm planning to optimize the Commit protocol too and discuss with @cloud-fan https://docs.google.com/document/d/13yzpIUAmgQaJ1Jnu0kqQ4DORDxQoZJmWaJMAVMajdi0/edit#
If we can do like that, I think it's will be more convenient for us to do more optimize on sql part

It's reasonable, but it's an assumption based on this PR AS-IS scope. Some other PRs may try to use it later.

@dongjoon-hyun I think current code could be reviewed now and have update the desc detailly.

SparkQA · 2021-08-25T07:50:09Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47254/

SparkQA · 2021-08-25T08:04:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47256/

SparkQA · 2021-08-25T08:15:33Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47256/

SparkQA · 2021-08-25T08:46:11Z

Test build #142754 has finished for PR 33828 at commit c29f55e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-25T09:10:31Z

Test build #142756 has finished for PR 33828 at commit 2604c9f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-08-25T12:34:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47264/

SparkQA · 2021-08-25T13:28:29Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47264/

SparkQA · 2021-08-25T14:38:19Z

Test build #142764 has finished for PR 33828 at commit 71f6b17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2021-08-27T03:58:14Z

cc @mridulm , @Ngone51 for core part addition.

SparkQA · 2021-10-12T05:20:46Z

Test build #144123 has finished for PR 33828 at commit 1947cbf.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-10-12T05:35:14Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48600/

SparkQA · 2021-10-12T06:53:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/48604/

SparkQA · 2021-12-13T05:50:47Z

Test build #146116 has finished for PR 33828 at commit 632d725.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2022-02-23T06:17:31Z

cc @steveloughran FYI

steveloughran · 2022-03-09T22:13:52Z

ooh, commit protocols. wonderful and scary, course, the future is delta and iceberg, isn't it?

steveloughran

As I do the new committer for abfs and gcs (please review this week! at apache/hadoop#2971), I've been wondering how we could support staging dirs in filesystems where file renames are fast and correct (dir rename is non-atomic on gcs, file rename fails really dramatically on abfs when caller exceeds allocated capacity, hence throttling and recovery by etag checks in the new committer).
what if you defined a standard method name getStagingDir(): String which you look for through reflection and pick up. alternatively, if you want the ability to probe a committer to see if is on an fs where rename works, we could coordinate using StreamCapabilites to add a probe you can use. this is in hadoop-2 so you can use it in your code to comoile everywhere, i can add it in the new committer.

Can I also note that it is time Spark moved to the v2 committer API and org.apache.hadoop.mapreduce.OutputCommitter over the mapred package. it will simplify that bridging i have to do.

steveloughran · 2022-03-09T22:17:29Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

+      engineType: String,
+      jobId: String): Path = {
+    val extURI = path.toUri
+    if (extURI.getScheme == "viewfs") {


always a bit brittle using the uri scheme. any way to avoid?

always a bit brittle using the uri scheme. any way to avoid?

To be honest, I don't know why we need to handle viewfs separately and after searching origin pull request of this part, don't see the discussion, so I am not sure about this. And this part is just moved to here.

probably there to ensure you don't create staging dirs in a different fs

probably there to ensure you don't create staging dirs in a different fs

Yea, I remembered that hive will check tempLocation and targetOutputPath's FS and EZ when call HiveMetastore loadTable/loadPartition, Should move that logical to here

probably there to ensure you don't create staging dirs in a different fs

How about current?

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

steveloughran · 2022-03-09T22:19:38Z

sql/core/src/test/scala/org/apache/spark/sql/sources/StagingInsertSuite.scala

+class StagingInsertSuite extends QueryTest with SharedSparkSession {
+  import testImplicits._
+
+  val stagingDir = Utils.createTempDir()


be nice if this could be subclassed so we could add tests which tried this on, say abfs or gcs

be nice if this could be subclassed so we could add tests which tried this on, say abfs or gcs

You mean, use abfsm gcs as staging dir?

i mean write tests which can be subclassed and then retargeted at a remote store. a lot of the spark relations test assume local fs everywhere, for example, and i had to copy and paste for cloud storage testing

i mean write tests which can be subclassed and then retargeted at a remote store. a lot of the spark relations test assume local fs everywhere, for example, and i had to copy and paste for cloud storage testing

Got it, let me try

@steveloughran How about current? Is it what you want?

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

steveloughran · 2022-03-10T13:11:04Z

propose using "spark.sql.sources.writeJobUUID as the job id when set; more uniqueness and it should be set everywhere.
core design looks ok. but i don't see why you couldn't support concurrent jobs just by having different subdirs of __temporary for different job IDs/UUIDs, and an option to disable cleanup. (and instructions to do it later, which you'd need to do anyway).
because that use of __temporary/0 on file output committer is only because on a restart of the MR AM lets the committer use __temporary/1 (using app attempt number for the subdir) then moving the committed task data from job attempt 0 to its own dir, so recover all existing work. spark doesn't need that.
it'd be good for you to try out my manifest committer against hdfs with your workloads. it is designed to be a lot faster in job commit because all listing of task output directory trees is done in task commit, and job commit does everything in parallel (listing of manifests, loading of manifests, creating dest dirs, file rename). some of the options you don't need for hdfs (parallel delete of task attempt temp dirs)j, but I still expect a massive speedup of job commit, though not as much as for stores where listing and rename is slower.

The reason i don't explicitly target HDFS is it means I can cut out that testing/QE and focus on abfs and gcs, using benchmarks from there to tune the algorithm. For example it turns out that mkdirs on gcs is slow so you should check for existence first; that is now done in task commits, which adds duplicate probes in task commit, but there, knowing abfs does async page prefetch on a 'listStatusIterator()call, i can do thegetFileStatus(destDir)` call after making the list call and have it done while the first page of list results is coming in.
https://github.com/steveloughran/hadoop/blob/mr/MAPREDUCE-7341-manifest-committer/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/committer/manifest/stages/TaskAttemptScanDirectoryStage.java#L150

numbers for HDFS would only distract me, but you will see much faster parallel job commits on "real world" partitioned trees

steveloughran · 2022-03-10T13:17:33Z

oh, also, I'm thinking of making some gcs enhancements which turn off some checks under __temporary/ paths, breaking "strict" fs semantics but delivering performance through reduced io

skipping all overwrite/parent is dir/dest is not a directory checks when creating a file
not worrying about recreating parent dir markers after renaming or deleting files
... etc. S3A will do the same under paths with __magic an element above it, saves a HEAD and a LIST for every parquet file written (it sets overwrite=false when creating files, for no reason at all)

so you should always use _temporary as one path element in your staging dir to get any of those benefits

AngersZhuuuu · 2022-03-14T07:57:42Z

propose using "spark.sql.sources.writeJobUUID as the job id when set; more uniqueness and it should be set everywhere.

Now all place use spark's job id, I can do this after this pr since it's not the same thing.

core design looks ok. but i don't see why you couldn't support concurrent jobs just by having different subdirs of __temporary for different job IDs/UUIDs, and an option to disable cleanup. (and instructions to do it later, which you'd need to do anyway).

Since if two job write to same table's different partition, the have same output path ${table_location}/temporary/0....
If one job succeed , it will delete that path, then another job's data is lossed.

because that use of __temporary/0 on file output committer is only because on a restart of the MR AM lets the committer use __temporary/1 (using app attempt number for the subdir) then moving the committed task data from job attempt 0 to its own dir, so recover all existing work. spark doesn't need that.

This is caused that spark still use FileOutputCommitter, still keep this, if we can rewrite a commit protocol, we can avoid this.

it'd be good for you to try out my manifest committer against hdfs with your workloads. it is designed to be a lot faster in job commit because all listing of task output directory trees is done in task commit, and job commit does everything in parallel (listing of manifests, loading of manifests, creating dest dirs, file rename). some of the options you don't need for hdfs (parallel delete of task attempt temp dirs)j, but I still expect a massive speedup of job commit, though not as much as for stores where listing and rename is slower.

Yea, I will try this later, it's a very useful design and can reduce hdfs's pressure a lot. I need to check this with our hdfs team too.

AngersZhuuuu · 2022-03-14T08:00:55Z

oh, also, I'm thinking of making some gcs enhancements which turn off some checks under __temporary/ paths, breaking "strict" fs semantics but delivering performance through reduced io

skipping all overwrite/parent is dir/dest is not a directory checks when creating a file

not worrying about recreating parent dir markers after renaming or deleting files
... etc. S3A will do the same under paths with __magic an element above it, saves a HEAD and a LIST for every parquet file written (it sets overwrite=false when creating files, for no reason at all)

so you should always use _temporary as one path element in your staging dir to get any of those benefits

Having checked how to use mainfest commit protocol, but I found a problem is that

class PathOutputCommitProtocol(
    jobId: String,
    dest: String,
    dynamicPartitionOverwrite: Boolean = false)
  extends HadoopMapReduceCommitProtocol(jobId, dest, false) with Serializable {

  if (dynamicPartitionOverwrite) {
    // until there's explicit extensions to the PathOutputCommitProtocols
    // to support the spark mechanism, it's left to the individual committer
    // choice to handle partitioning.
    throw new IOException(PathOutputCommitProtocol.UNSUPPORTED)
  }

In current Spark's code, dynamicPartitionOverwrite can't support this, it means we can't use your feature in case of dynamic partition overwriting .

We need to do some change to support this. WDYT cc @steveloughran @HyukjinKwon @cloud-fan @viirya

steveloughran · 2022-03-17T11:09:26Z

that dynamic partition worked overlapped with the committer extension work and the s3a committer.

It broke the merge; those lines you've found show the workaround.

A key problem with the spark code is that it assumes file rename is a good way to commit work. AFAIK, it doesn't assume that directory renames are atomic, but unless file renames work fast then performance is going to be unsatisfactory.

And on S3, file rename is O(data), so applications which use it to promote work (hello hive!) really suffer.
I that's why I have never looked for a good solution here.
Things are different on azure and google cloud, where file rename usually(*) works. This means that we could look about what needs to be done.

Is there anything written up on this commit protocol I could look at to see what could be done?

t the very least we could have the known committer implementations support StreamCapabilities.hasCapability() with some capabilities for the FS we could indirectly ask for related to rename (fast file rename, fast dir rename, atomic dir rename), which would let spark know what was actually viable at all. but those are really fs capabilities, you can't really expect the committer itself to know what the fs does, except in the case of the s3a committer, which is hard coded to one fs, whose semantics are known (though amplidata and netapp s3 devices do have fast file copy/rename even there...)

look at all the code related to etags, rename recovery and preemptive rate limiting....

AngersZhuuuu · 2022-03-29T03:27:18Z

Is there anything written up on this commit protocol I could look at to see what could be done?

Here https://github.com/apache/spark/pull/35319/files is the whole plan.
And you can help to check the new added committer (SQLPathOutputCommitter)'s logic. This committer changed from FileOutputCommitter and it will write file to staging path then when commit data, it commit data to workpath.
So it's file operator nums is same and can avoid the conflicts I mentioned in the pr description.
Also, you can help to check if there is any problem to optimize the committer (SQLPathOutputCommitter) to use some of your idea about avoid unnecessary operations.

AngersZhuuuu · 2022-03-31T02:47:02Z

@steveloughran Have seen your comment on #35319, I think we should do the change (#35319 (comment) and #35319 (comment) have done))in this pr as the first step. Ok?

steveloughran · 2022-04-06T18:53:40Z

Ok

github-actions · 2022-09-30T00:36:26Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

beatbull · 2022-11-11T06:54:35Z

Hi, sadly this PR got closed (automatically due to inactivity). We'd be interested in this feature & config option since the ".spark-staging-*" folders are causing trouble e.g. when using hive-partitioned tables in BigQuery (via external table) or BigTable (same issue). AFAIS there is no setting in Bigquery & bigtable to ignore the folders starting with "." or some other pattern.

We have dynamic partitioning on an inbetween level, which generates the .spark-staging folders in the hive-partition path. E.g. partitioning would regularly look like a=foo/b=42/c=2022-11-11. If c is dynamic partitioned, we get a=foo/b=42/.spark-staging-3f91233b-4992-4e05-baec-3b4533535b9d/c=2022-11-11. Even if the spark job succeeds, these .spark-staging folders still can cause queries to fail in e.g. Bigquery/bigtable as they exist temporarily. The error in Bigquery is something like:

error message: Incompatible partition schemas.
Expected schema ([a:TYPE_STRING, b:TYPE_INT64, c:TYPE_DATE]) has 3 columns. Observed schema ([a, b]) has 2 columns.
File: a=foo/b=42/.spark-staging-3f91233b-4992-4e05-baec-3b4533535b9d/c=2022-11-11

Obviously this could be also seen as a problem of BigQuery & BigTable, but having the staging dir configurable in Spark and move it outside of the table would definitely help here. Any chance this PR could be revived?

[SPARK-36579][SQL] Make spark source stagingDir can use user defined

2031f5b

github-actions bot added the SQL label Aug 25, 2021

AngersZhuuuu added 2 commits August 25, 2021 14:27

Update SQLHadoopMapReduceCommitProtocol.scala

c29f55e

Update

2604c9f

github-actions bot added CORE STRUCTURED STREAMING labels Aug 25, 2021

AngersZhuuuu commented Aug 25, 2021

View reviewed changes

fix ut

71f6b17

Merge branch 'master' into SPARK-36579

1947cbf

AngersZhuuuu added 3 commits October 12, 2021 13:38

Update SaveAsHiveFile.scala

30113d2

update

6f405dc

update

361263b

AngersZhuuuu added 2 commits October 12, 2021 14:55

update

9ee6ee5

update

7773fb2

AngersZhuuuu added 2 commits January 11, 2022 10:26

Merge branch 'master' into SPARK-36579

10e029f

Merge branch 'master' into SPARK-36579

db66beb

steveloughran reviewed Mar 9, 2022

View reviewed changes

AngersZhuuuu added 5 commits March 10, 2022 10:05

Merge branch 'master' into SPARK-36579

b07e816

Update SQLConf.scala

cc8a4c3

Update FileCommitProtocol.scala

1c5a70a

Update SQLHadoopMapReduceCommitProtocol.scala

1e8424e

Update SQLHadoopMapReduceCommitProtocol.scala

3da83f1

AngersZhuuuu mentioned this pull request Mar 10, 2022

[SPARK-36571][SQL] Add new SQLPathHadoopMapReduceCommitProtocol resolve conflict when write into partition table's different partition #35319

Closed

AngersZhuuuu added 5 commits March 14, 2022 13:47

update

0a729f0

update

9392bdb

update

1178d1b

update

9dba590

Update InsertSuite.scala

b998ec9

AngersZhuuuu mentioned this pull request Apr 7, 2022

[SPARK-31675][CORE] Fix rename and delete files with different filesystem #36070

Closed

github-actions bot added the Stale label Sep 30, 2022

github-actions bot closed this Oct 1, 2022

[SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized #33828

[SPARK-36579][CORE][SQL] Make spark source stagingDir can be customized #33828

Uh oh!

Conversation

AngersZhuuuu commented Aug 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AngersZhuuuu commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

SparkQA commented Aug 25, 2021

Uh oh!

dongjoon-hyun commented Aug 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

SparkQA commented Oct 12, 2021

Uh oh!

SparkQA commented Dec 13, 2021

Uh oh!

HyukjinKwon commented Feb 23, 2022

Uh oh!

steveloughran commented Mar 9, 2022

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AngersZhuuuu commented Aug 25, 2021 •

edited

Loading

dongjoon-hyun commented Aug 27, 2021 •

edited

Loading

steveloughran commented Mar 17, 2022 •

edited

Loading