Skip to content

Conversation

@SteNicholas
Copy link
Member

@SteNicholas SteNicholas commented Apr 28, 2024

What changes were proposed in this pull request?

SparkShuffleManager print warning log for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar via --jar or spark.jars.

Why are the changes needed?

When spark.executor.userClassPathFirst is enabled with ShuffleManager defined in user jar, the ClassLoader of handle is ChildFirstURLClassLoader, which is different from CelebornShuffleHandle of which the ClassLoader is AppClassLoader in SparkShuffleManager#getWriter/getReader. The local test log is as follows:

./bin/spark-sql --master yarn --deploy-mode client \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.jars=/tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false

./bin/spark-sql --master yarn --deploy-mode client --jars /tmp/celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar \
--conf spark.celeborn.master.endpoints=localhost:9099 \
--conf spark.executor.userClassPathFirst=true \
--conf spark.shuffle.manager=org.apache.spark.shuffle.celeborn.SparkShuffleManager \
--conf spark.shuffle.service.enabled=false
24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: sun.misc.Launcher$AppClassLoader

It causes that SparkShuffleManager fallback to vanilla Spark SortShuffleManager for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar before apache/spark#43627. After SPARK-45762, the ClassLoader of handle and CelebornShuffleHandle are both ChildFirstURLClassLoader.

24/04/28 18:03:31 [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] WARN SparkShuffleManager: [getWriter] handle classloader: org.apache.spark.util.ChildFirstURLClassLoader, CelebornShuffleHandle classloader: org.apache.spark.util.ChildFirstURLClassLoader

Therefore, SparkShuffleManager should print warning log to remind for spark.executor.userClassPathFirst=true with ShuffleManager defined in user jar.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test.

@SteNicholas
Copy link
Member Author

SteNicholas commented Apr 28, 2024

Ping @mridulm, @lifulong, @FMX.

@mridulm
Copy link
Contributor

mridulm commented Apr 29, 2024

We ask users to copy celeborn client side jar to spark jars directory - that should ensure this problem does not occur, right ?

So this would apply only when users are leveraging --jars/spark.jars in order to specify celeborn jar ?
If yes, detect and add warning only when that difference exists ?

@SteNicholas
Copy link
Member Author

SteNicholas commented Apr 29, 2024

@mridulm, I left some points according to your reply:

  • We could not ask user to copy celeborn client side jar to spark jars directory. It's required that the celeborn client jar is placed in HDFS directory instead of spark jars directory in some production environment like bilibili. We could allow user to specify the celeborn client jar and provide the warning log for the cases that fallback to SortManager.
  • This problem only occurs in the case that spark.executor.userClassPathFirst=true and use --jars/spark.jars to specify the celeborn client jar, which has been given reproduce command in description. The company Zhihu @lifulong has met this problem in their production practice. The solution of this problem has two ways: disable spark.executor.userClassPathFirst or cherry pick [SPARK-45762][CORE] Support shuffle managers defined in user jars by changing startup order spark#43627.
  • Meanwhile, I have mentioned this point in warning log that fallback to SortManager when setting spark.executor.userClassPathFirst to true and use ShuffleManager defined in user jar for kindly reminder.
  • BTW, I have also mentioned that this problem does not occur after SPARK-45762.

Do you have any suggestion to the warning log?

@lifulong
Copy link
Contributor

Ping @mridulm, @lifulong, @FMX.

LGTM

Copy link
Contributor

@lifulong lifulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mridulm
Copy link
Contributor

mridulm commented Apr 30, 2024

@SteNicholas That makes sense, there are definitely scenarios where we cannot update the spark distribution with additional jars.

The reason why I brought that up is, currently the warning will always be emitted ... even if celeborn jars are part of spark distribution and so not susceptible to the issue.

Given this, can be add to the if condition to check if this is a problematic case ?

@SteNicholas
Copy link
Member Author

@mridulm, I got your main point of concern. I have addressed above comment and reduced the scope of printing warning logs. PTAL.

Copy link
Contributor

@FMX FMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

@SteNicholas SteNicholas force-pushed the CELEBORN-1402 branch 2 times, most recently from 4c47827 to cc39797 Compare May 6, 2024 12:50
@SteNicholas
Copy link
Member Author

@cxzl25, I have followed the check checkUserClassPathFirst in spark-2 module. PTAL.

@SteNicholas
Copy link
Member Author

@mridulm, PTAL.

@SteNicholas
Copy link
Member Author

SteNicholas commented May 7, 2024

@mridulm, @pan3793, @cfmcgrady, @cxzl25, I have address above comments. PTAL.

@SteNicholas SteNicholas requested review from mridulm and pan3793 May 7, 2024 13:23
@mridulm
Copy link
Contributor

mridulm commented May 8, 2024

Hi @SteNicholas, which modes did you test this in ?
I did a quick check against local cluster, spark standalone and local yarn - and keep getting this and variants of this with 3.1 (which is what I would expect - since spark does not support it).

Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1956)
	at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
	at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager
	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:348)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:207)
	at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:275)
	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:338)
	at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:205)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:451)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
	at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1938)
	... 4 more

This should cause the executor to fail, and so wont get to this PR.
Am I missing something here ?

I tested this with:
./bin/spark-submit --conf spark.executor.userClassPathFirst=true --conf spark.jars=./celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar --master 'yarn' --deploy-mode cluster --num-executors 2 --executor-memory 1g --executor-cores 1 --class com.mridul.test.Test ./test.jar

(basic celeborn config is there in my defaults config)

@mridulm
Copy link
Contributor

mridulm commented May 8, 2024

Let me recheck that - I think the celeborn diff did not apply in my local build

Works fine with celeborn jar in spark jar directory, not with --jars for 3.1: which is what I would expect actually (given SPARK-45762).

How were you able to test/validate this @SteNicholas ? It is possible I am missing some steps here ?

@SteNicholas
Copy link
Member Author

SteNicholas commented May 8, 2024

@mridulm, I just tested with Spark SQL via above command. Did you ever try to test Spark SQL to verify?
Meanwhile, why does the above application fail with Caused by: java.lang.ClassNotFoundException: org.apache.spark.shuffle.celeborn.SparkShuffleManager? The celeborn-client-spark-3-shaded_2.12-0.5.0-SNAPSHOT.jar should be in classpath.

@mridulm
Copy link
Contributor

mridulm commented May 8, 2024

Hi @SteNicholas , I was able to reproduce this behavior when celeborn jar is there in both spark platform jars and is specified by users via --jars.
Does this match your deployment experience ?

If yes, in this case, it might simply be easier to ask users not to add celeborn jar to their --jars ?
(sc.parallelize(1 to 40).repartition(20).flatMap(_ => (1 to 1500000).iterator.map(num => num)).repartition(25).count() - though the specific query should not matter in this case)

Please let me know if I am missing anything.

@SteNicholas
Copy link
Member Author

SteNicholas commented May 9, 2024

@mridulm, my deployment experience is that the celeborn client jar is not placed in jars directory but placed in HDFS which path is set to spark.jars for using SparkShuffleManager. The reason why put the celeborn client jar to HDFS is that we need to decouple the celebrity client release from the spark release. Meanwhile, we could not ask the user not to do this behavior when user need to use --jar way. This is just internal deployment experience. Please let me know if anything wrong.

@mridulm
Copy link
Contributor

mridulm commented May 9, 2024

Before 4.0, Apache Spark does not support shuffle manager from user jars - so I am not sure how this is working for you ... unless the fix was cherry picked/independently fixed in the Spark fork you are running perhaps ?

Once 4.0 is released, this patch will become relevant: so the PR is useful and I am in favor of merging it ... I want to make sure we have the right expectations: given spark does not support this feature before 4.0, trying this will fail for users 😃

Copy link
Contributor

@cxzl25 cxzl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Because the final fallback is to Spark's SortShuffleWriter, this may lead to inaccurate data. See SPARK-48037

…tor.userClassPathFirst=true with ShuffleManager defined in user jar
@SteNicholas
Copy link
Member Author

@mridulm, I uses the internal Spark 3.1 version to verify this. Anyway, this pull request could help user to confirm whether to use Celeborn in SparkShuffleManager.
@mridulm, @pan3793, @cfmcgrady, @cxzl25, could this pull request be merged?

@SteNicholas
Copy link
Member Author

SteNicholas commented May 17, 2024

Merged to main(v0.5.0).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants