Skip to content

[VL] Flaky Celeborn tests #11103

@zhouyuan

Description

@zhouyuan

Backend

VL (Velox)

Bug description

https://github.com/apache/incubator-gluten/actions/runs/19399179485/job/55503865154?pr=11095

There are failed queries.

Query q1 failed by error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 28.0 failed 1 times, most recent failure: Lost task 2.0 in stage 28.0 (TID 34) (ea04227c6a83 executor driver): org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.celeborn.common.exception.CelebornIOException: Register shuffle failed for shuffle 1, reason: RESERVE_SLOTS_FAILED
	at org.apache.celeborn.client.ShuffleClientImpl.registerShuffleInternal(ShuffleClientImpl.java:746)
	at org.apache.celeborn.client.ShuffleClientImpl.registerShuffle(ShuffleClientImpl.java:547)
	at org.apache.celeborn.client.ShuffleClientImpl.lambda$getPartitionLocation$4(ShuffleClientImpl.java:609)
	at java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1660)
	at org.apache.celeborn.common.util.JavaUtils$ConcurrentHashMapForJDK8.computeIfAbsent(JavaUtils.java:492)
	at org.apache.celeborn.client.ShuffleClientImpl.getPartitionLocation(ShuffleClientImpl.java:605)
	at org.apache.celeborn.client.ShuffleClientImpl.pushOrMergeData(ShuffleClientImpl.java:970)
	at org.apache.celeborn.client.ShuffleClientImpl.mergeData(ShuffleClientImpl.java:1362)
	at org.apache.spark.shuffle.CelebornPartitionPusher.pushPartitionData(CelebornPartitionPusher.scala:61)
	at org.apache.gluten.vectorized.ShuffleWriterJniWrapper.stop(Native Method)
	at org.apache.spark.shuffle.VeloxCelebornColumnarShuffleWriter.internalWrite(VeloxCelebornColumnarShuffleWriter.scala:100)
	at org.apache.spark.shuffle.CelebornColumnarShuffleWriter.write(CelebornColumnarShuffleWriter.scala:113)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Gluten version

No response

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions