-
Notifications
You must be signed in to change notification settings - Fork 590
[VL] Flaky Celeborn tests #11103
Copy link
Copy link
Closed
Labels
Description
Backend
VL (Velox)
Bug description
https://github.com/apache/incubator-gluten/actions/runs/19399179485/job/55503865154?pr=11095
There are failed queries.
Query q1 failed by error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 28.0 failed 1 times, most recent failure: Lost task 2.0 in stage 28.0 (TID 34) (ea04227c6a83 executor driver): org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.celeborn.common.exception.CelebornIOException: Register shuffle failed for shuffle 1, reason: RESERVE_SLOTS_FAILED
at org.apache.celeborn.client.ShuffleClientImpl.registerShuffleInternal(ShuffleClientImpl.java:746)
at org.apache.celeborn.client.ShuffleClientImpl.registerShuffle(ShuffleClientImpl.java:547)
at org.apache.celeborn.client.ShuffleClientImpl.lambda$getPartitionLocation$4(ShuffleClientImpl.java:609)
at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
at org.apache.celeborn.common.util.JavaUtils$ConcurrentHashMapForJDK8.computeIfAbsent(JavaUtils.java:492)
at org.apache.celeborn.client.ShuffleClientImpl.getPartitionLocation(ShuffleClientImpl.java:605)
at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:970)
at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:1362)
at org.apache.spark.shuffle.CelebornPartitionPusher.pushPartitionData(CelebornPartitionPusher.scala:61)
at org.apache.gluten.vectorized.ShuffleWriterJniWrapper.stop(Native Method)
at org.apache.spark.shuffle.VeloxCelebornColumnarShuffleWriter.internalWrite(VeloxCelebornColumnarShuffleWriter.scala:100)
at org.apache.spark.shuffle.CelebornColumnarShuffleWriter.write(CelebornColumnarShuffleWriter.scala:113)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Gluten version
No response
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
Reactions are currently unavailable