[CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar #2482

SteNicholas · 2024-04-28T18:01:23Z

What changes were proposed in this pull request?

SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar via --jar or spark.jars.

Why are the changes needed?

When spark.executor.userClassPathFirst is enabled with ShuffleManager defined in user jar, the ClassLoader of handle is ChildFirstURLClassLoader, which is different from CelebornShuffleHandle of which the ClassLoader is AppClassLoader in SparkShuffleManager#getWriter/getReader. The local test log is as follows:

./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader

It causes that SparkShuffleManager fallback to vanilla Spark SortShuffleManager for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar before apache/spark#43627. After SPARK-45762, the ClassLoader of handle and CelebornShuffleHandle are both ChildFirstURLClassLoader.

24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader

Therefore, SparkShuffleManager should print warning log to remind for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

SteNicholas · 2024-04-28T18:02:18Z

Ping @mridulm, @lifulong, @FMX.

mridulm · 2024-04-29T05:56:47Z

We ask users to copy celeborn client side jar to spark jars directory - that should ensure this problem does not occur, right ?

So this would apply only when users are leveraging --jars/spark.jars in order to specify celeborn jar ?
If yes, detect and add warning only when that difference exists ?

SteNicholas · 2024-04-29T06:03:58Z

@mridulm, I left some points according to your reply:

We could not ask user to copy celeborn client side jar to spark jars directory. It's required that the celeborn client jar is placed in HDFS directory instead of spark jars directory in some production environment like bilibili. We could allow user to specify the celeborn client jar and provide the warning log for the cases that fallback to SortManager.
This problem only occurs in the case that spark.executor.userClassPathFirst=true and use --jars/spark.jars to specify the celeborn client jar, which has been given reproduce command in description. The company Zhihu @lifulong has met this problem in their production practice. The solution of this problem has two ways: disable spark.executor.userClassPathFirst or cherry pick [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order spark#43627.
Meanwhile, I have mentioned this point in warning log that fallback to SortManager when setting spark.executor.userClassPathFirst to true and use ShuffleManager defined in user jar for kindly reminder.
BTW, I have also mentioned that this problem does not occur after SPARK-45762.

Do you have any suggestion to the warning log?

lifulong · 2024-04-29T07:27:37Z

Ping @mridulm, @lifulong, @FMX.

LGTM

lifulong

LGTM

mridulm · 2024-04-30T05:35:16Z

@SteNicholas That makes sense, there are definitely scenarios where we cannot update the spark distribution with additional jars.

The reason why I brought that up is, currently the warning will always be emitted ... even if celeborn jars are part of spark distribution and so not susceptible to the issue.

Given this, can be add to the if condition to check if this is a problematic case ?

SteNicholas · 2024-05-05T12:00:28Z

@mridulm, I got your main point of concern. I have addressed above comment and reduced the scope of printing warning logs. PTAL.

FMX

LGTM. Thanks.

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java

README.md

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java

SteNicholas · 2024-05-06T12:56:25Z

@cxzl25, I have followed the check checkUserClassPathFirst in spark-2 module. PTAL.

SteNicholas · 2024-05-07T05:52:52Z

@mridulm, PTAL.

README.md

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java

docs/deploy.md

SteNicholas · 2024-05-07T13:05:50Z

@mridulm, @pan3793, @cfmcgrady, @cxzl25, I have address above comments. PTAL.

mridulm · 2024-05-08T06:03:02Z

Hi @SteNicholas, which modes did you test this in ?
I did a quick check against local cluster, spark standalone and local yarn - and keep getting this and variants of this with 3.1 (which is what I would expect - since spark does not support it).

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1956)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:207)
	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:275)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:451)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	... 4 more

This should cause the executor to fail, and so wont get to this PR.
Am I missing something here ?

I tested this with:
./bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.jars=./celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar --master 'yarn' --deploy-mode cluster --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.mridul.test.Test ./test.jar

(basic celeborn config is there in my defaults config)

mridulm · 2024-05-08T06:08:33Z

~~Let me recheck that - I think the celeborn diff did not apply in my local build~~

Works fine with celeborn jar in spark jar directory, not with --jars for 3.1: which is what I would expect actually (given SPARK-45762).

How were you able to test/validate this @SteNicholas ? It is possible I am missing some steps here ?

SteNicholas · 2024-05-08T07:41:05Z

@mridulm, I just tested with Spark SQL via above command. Did you ever try to test Spark SQL to verify?
Meanwhile, why does the above application fail with Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager? The celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar should be in classpath.

mridulm · 2024-05-08T22:58:36Z

Hi @SteNicholas , I was able to reproduce this behavior when celeborn jar is there in both spark platform jars and is specified by users via --jars.
Does this match your deployment experience ?

If yes, in this case, it might simply be easier to ask users not to add celeborn jar to their --jars ?
(sc.parallelize(1 to 40).repartition(20).flatMap(_ => (1 to 1500000).iterator.map(num => num)).repartition(25).count() - though the specific query should not matter in this case)

Please let me know if I am missing anything.

SteNicholas · 2024-05-09T07:21:14Z

@mridulm, my deployment experience is that the celeborn client jar is not placed in jars directory but placed in HDFS which path is set to spark.jars for using SparkShuffleManager. The reason why put the celeborn client jar to HDFS is that we need to decouple the celebrity client release from the spark release. Meanwhile, we could not ask the user not to do this behavior when user need to use --jar way. This is just internal deployment experience. Please let me know if anything wrong.

mridulm · 2024-05-09T11:14:49Z

Before 4.0, Apache Spark does not support shuffle manager from user jars - so I am not sure how this is working for you ... unless the fix was cherry picked/independently fixed in the Spark fork you are running perhaps ?

Once 4.0 is released, this patch will become relevant: so the PR is useful and I am in favor of merging it ... I want to make sure we have the right expectations: given spark does not support this feature before 4.0, trying this will fail for users 😃

cxzl25

LGTM. Because the final fallback is to Spark's SortShuffleWriter, this may lead to inaccurate data. See SPARK-48037

…tor.userClassPathFirst=true with ShuffleManager defined in user jar

SteNicholas · 2024-05-16T06:44:55Z

@mridulm, I uses the internal Spark 3.1 version to verify this. Anyway, this pull request could help user to confirm whether to use Celeborn in SparkShuffleManager.
@mridulm, @pan3793, @cfmcgrady, @cxzl25, could this pull request be merged?

SteNicholas · 2024-05-17T03:04:15Z

Merged to main(v0.5.0).

lifulong approved these changes Apr 29, 2024

View reviewed changes

SteNicholas force-pushed the CELEBORN-1402 branch from 883ca9c to 7e09a35 Compare May 5, 2024 11:58

FMX approved these changes May 6, 2024

View reviewed changes

cxzl25 reviewed May 6, 2024

View reviewed changes

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java Outdated Show resolved Hide resolved

cfmcgrady reviewed May 6, 2024

View reviewed changes

README.md Show resolved Hide resolved

cfmcgrady reviewed May 6, 2024

View reviewed changes

client-spark/spark-3/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java Outdated Show resolved Hide resolved

SteNicholas force-pushed the CELEBORN-1402 branch 2 times, most recently from 4c47827 to cc39797 Compare May 6, 2024 12:50

SteNicholas force-pushed the CELEBORN-1402 branch from cc39797 to 2912c21 Compare May 6, 2024 15:53

SteNicholas requested review from cfmcgrady and cxzl25 May 6, 2024 15:58

mridulm reviewed May 7, 2024

View reviewed changes

README.md Outdated Show resolved Hide resolved

client-spark/spark-2/src/main/java/org/apache/spark/shuffle/celeborn/SparkShuffleManager.java Outdated Show resolved Hide resolved

pan3793 reviewed May 7, 2024

View reviewed changes

docs/deploy.md Outdated Show resolved Hide resolved

SteNicholas requested review from mridulm and pan3793 May 7, 2024 13:23

mridulm approved these changes May 9, 2024

View reviewed changes

cxzl25 approved these changes May 9, 2024

View reviewed changes

[CELEBORN-1402] SparkShuffleManager print warning log for spark.execu…

3bbdec4

…tor.userClassPathFirst=true with ShuffleManager defined in user jar

SteNicholas force-pushed the CELEBORN-1402 branch from 5729a71 to 3bbdec4 Compare May 16, 2024 06:43

SteNicholas closed this in 9908035 May 17, 2024

[CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar #2482

[CELEBORN-1402] SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar #2482

Uh oh!

Conversation

SteNicholas commented Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SteNicholas commented Apr 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented Apr 29, 2024

Uh oh!

SteNicholas commented Apr 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lifulong commented Apr 29, 2024

Uh oh!

lifulong left a comment

Choose a reason for hiding this comment

Uh oh!

mridulm commented Apr 30, 2024

Uh oh!

SteNicholas commented May 5, 2024

Uh oh!

FMX left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SteNicholas commented May 6, 2024

Uh oh!

SteNicholas commented May 7, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SteNicholas commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteNicholas commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteNicholas commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mridulm commented May 9, 2024

Uh oh!

cxzl25 left a comment

Choose a reason for hiding this comment

Uh oh!

SteNicholas commented May 16, 2024

Uh oh!

SteNicholas commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

SteNicholas commented Apr 28, 2024 •

edited

Loading

SteNicholas commented Apr 28, 2024 •

edited

Loading

SteNicholas commented Apr 29, 2024 •

edited

Loading

SteNicholas commented May 7, 2024 •

edited

Loading

mridulm commented May 8, 2024 •

edited

Loading

mridulm commented May 8, 2024 •

edited

Loading

SteNicholas commented May 8, 2024 •

edited

Loading

mridulm commented May 8, 2024 •

edited

Loading

SteNicholas commented May 9, 2024 •

edited

Loading

SteNicholas commented May 17, 2024 •

edited

Loading