[GLUTEN-11088][VL] fix `bloomFilter` in GlutenDataFrameStatSuite by marin-ma · Pull Request #11211 · apache/gluten

marin-ma · 2025-11-27T14:44:17Z

Due to the changes in apache/spark#43391, the bloom filter stat function is offloaded into Gluten when building the bloom filter. However df.stat.bloomFilter actually hard code and invokes BloomFilterImpl.readFrom which is Spark's bloom filter implementation to decode the bloom filter binary.

To use df.stat.bloomFilter, user should explicitly set spark.gluten.sql.native.bloomFilter=false

Related issue: #11088

marin-ma · 2025-11-27T14:47:19Z

I suppose this scenario is similar to the ANSI mode, and should be categorised as the limitation to the spark 4.0 support.

To use df.stat.bloomFilter, user should explicitly set spark.gluten.sql.native.bloomFilter=false

Perhaps we should document it somewhere. @zhouyuan @jinchengchenghh Any suggestions?

jinchengchenghh · 2025-11-27T15:04:29Z

Can we offload df.stat.bloomFilter?

zhztheplayer · 2025-11-27T15:07:19Z

Dit you meet Not yet implemented errors?

marin-ma · 2025-11-27T15:11:17Z

Can we offload df.stat.bloomFilter?

@jinchengchenghh It's being offloaded in spark 4.0 after SPARK-45565. The test failure is caused by the hard-coded BloomFilter.readFrom in Spark's code https://github.com/apache/spark/blob/v4.0.1/sql/api/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L583

Besides, this test case also tests unsupported API expectedFpp and bitSize in VeloxBloomFilter https://github.com/apache/spark/blob/v4.0.1/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L529

marin-ma · 2025-11-27T15:13:20Z

Dit you meet Not yet implemented errors?

@zhztheplayer Haven't run into this error but there are indeed some unimplemented API in VeloxBloomFilter, meaning we cannot use VeloxBloomFilter for the stat function.

jinchengchenghh · 2025-11-27T15:16:09Z

I don't know the path call BloomFilter.readFrom, can we update the plan or other ways to offload the df.stat.bloomFilter to native, then it can call VeloxBloomFilter, we can implement the unimplemented functions in it

marin-ma · 2025-11-27T15:19:35Z

@jinchengchenghh The bloom filter agg here is being offloaded in spark 4.0, but not in spark 3.x. BloomFilter.readFrom is not part of the plan.

zhztheplayer · 2025-11-27T15:24:51Z

I think there are two options to solve this issue:

Extend the CallerInfo utility to selectively offload the bloom filter expression.
(Perhaps eventually) Implement the unimplemented APIs in VeloxBloomFilter.

marin-ma · 2025-11-27T15:30:35Z

Thanks. @zhztheplayer @jinchengchenghh

Looks like 1 is the most feasible solution before we implement 2. Let me try to address 1 in this PR.

And we still need to figure out how to address the hardcoded BloomFilter.readFrom in Spark.

jinchengchenghh · 2025-11-27T15:40:24Z

Maybe all the data frame stats function has similar issue, the information is native information, we could supply a utility class to enhance current df.stat.xxx

✅ 方案 1（推荐）：使用隐式类扩展 df.stat 功能

不用修改 Spark 源码，也不会破坏 API。

class GlutenDataFrameStatFunctions(df: DataFrame)
  extends DataFrameStatFunctions(df) {

  def glutenApproxQuantile(cols: Seq[String]): DataFrame = {
    // your gluten implementation
    df // return something
  }
}

object GlutenStatImplicits {
  implicit class GlutenStatOps(df: DataFrame) {
    def glutenStat: GlutenDataFrameStatFunctions =
      new GlutenDataFrameStatFunctions(df)
  }
}


使用：

import GlutenStatImplicits._

df.glutenStat.glutenApproxQuantile(Seq("col1"))


✔ 不修改 Spark
✔ 和 Spark 版本兼容
✔ 可以加入任意 Gluten 优化

marin-ma · 2025-11-27T15:57:53Z

@jinchengchenghh If I understand it correctly, it requires users to modify the code to use Gluten's implementation? In this case directly call df.stat.xxx in Gluten still fails.

jinchengchenghh · 2025-11-27T16:13:38Z

yes

jinchengchenghh

Looks good

marin-ma · 2025-11-27T17:58:36Z

Run Gluten Clickhouse CI on x86

cannot fix

c2c620d

github-actions bot added the CORE works for Gluten Core label Nov 27, 2025

marin-ma marked this pull request as draft November 27, 2025 15:31

marin-ma changed the title ~~[GLUTEN-11088][VL] Disable bloomFilter in GlutenDataFrameStatSuite~~ [GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite Nov 27, 2025

fallback bloomfilter in CallerInfo

0af6d8d

marin-ma marked this pull request as ready for review November 27, 2025 15:58

github-actions bot added the VELOX label Nov 27, 2025

jinchengchenghh approved these changes Nov 27, 2025

View reviewed changes

marin-ma merged commit a54a803 into apache:main Nov 28, 2025
59 of 60 checks passed

marin-ma mentioned this pull request Dec 1, 2025

[VL] Track on Spark-4.0 failed unit tests #11088

Open

Conversation

marin-ma commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marin-ma commented Nov 27, 2025

Uh oh!

jinchengchenghh commented Nov 27, 2025

Uh oh!

zhztheplayer commented Nov 27, 2025

Uh oh!

marin-ma commented Nov 27, 2025

Uh oh!

marin-ma commented Nov 27, 2025

Uh oh!

jinchengchenghh commented Nov 27, 2025

Uh oh!

marin-ma commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhztheplayer commented Nov 27, 2025

Uh oh!

marin-ma commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinchengchenghh commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marin-ma commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jinchengchenghh commented Nov 27, 2025

Uh oh!

jinchengchenghh left a comment

Choose a reason for hiding this comment

Uh oh!

marin-ma commented Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

marin-ma commented Nov 27, 2025 •

edited

Loading

marin-ma commented Nov 27, 2025 •

edited

Loading

marin-ma commented Nov 27, 2025 •

edited

Loading

jinchengchenghh commented Nov 27, 2025 •

edited

Loading

marin-ma commented Nov 27, 2025 •

edited

Loading