Skip to content

[GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite#11211

Merged
marin-ma merged 2 commits intoapache:mainfrom
marin-ma:GlutenDataFrameStatSuite
Nov 28, 2025
Merged

[GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite#11211
marin-ma merged 2 commits intoapache:mainfrom
marin-ma:GlutenDataFrameStatSuite

Conversation

@marin-ma
Copy link
Copy Markdown
Contributor

@marin-ma marin-ma commented Nov 27, 2025

Due to the changes in apache/spark#43391, the bloom filter stat function is offloaded into Gluten when building the bloom filter. However df.stat.bloomFilter actually hard code and invokes BloomFilterImpl.readFrom which is Spark's bloom filter implementation to decode the bloom filter binary.

To use df.stat.bloomFilter, user should explicitly set spark.gluten.sql.native.bloomFilter=false

Related issue: #11088

@github-actions github-actions bot added the CORE works for Gluten Core label Nov 27, 2025
@marin-ma
Copy link
Copy Markdown
Contributor Author

I suppose this scenario is similar to the ANSI mode, and should be categorised as the limitation to the spark 4.0 support.

To use df.stat.bloomFilter, user should explicitly set spark.gluten.sql.native.bloomFilter=false

Perhaps we should document it somewhere. @zhouyuan @jinchengchenghh Any suggestions?

@jinchengchenghh
Copy link
Copy Markdown
Contributor

Can we offload df.stat.bloomFilter?

@zhztheplayer
Copy link
Copy Markdown
Member

Dit you meet Not yet implemented errors?

@marin-ma
Copy link
Copy Markdown
Contributor Author

Can we offload df.stat.bloomFilter?

@jinchengchenghh It's being offloaded in spark 4.0 after SPARK-45565. The test failure is caused by the hard-coded BloomFilter.readFrom in Spark's code https://github.com/apache/spark/blob/v4.0.1/sql/api/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L583

Besides, this test case also tests unsupported API expectedFpp and bitSize in VeloxBloomFilter https://github.com/apache/spark/blob/v4.0.1/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L529

@marin-ma
Copy link
Copy Markdown
Contributor Author

Dit you meet Not yet implemented errors?

@zhztheplayer Haven't run into this error but there are indeed some unimplemented API in VeloxBloomFilter, meaning we cannot use VeloxBloomFilter for the stat function.

@jinchengchenghh
Copy link
Copy Markdown
Contributor

I don't know the path call BloomFilter.readFrom, can we update the plan or other ways to offload the df.stat.bloomFilter to native, then it can call VeloxBloomFilter, we can implement the unimplemented functions in it

@marin-ma
Copy link
Copy Markdown
Contributor Author

marin-ma commented Nov 27, 2025

@jinchengchenghh The bloom filter agg here is being offloaded in spark 4.0, but not in spark 3.x. BloomFilter.readFrom is not part of the plan.

@zhztheplayer
Copy link
Copy Markdown
Member

I think there are two options to solve this issue:

  1. Extend the CallerInfo utility to selectively offload the bloom filter expression.
  2. (Perhaps eventually) Implement the unimplemented APIs in VeloxBloomFilter.

@marin-ma
Copy link
Copy Markdown
Contributor Author

marin-ma commented Nov 27, 2025

Thanks. @zhztheplayer @jinchengchenghh

Looks like 1 is the most feasible solution before we implement 2. Let me try to address 1 in this PR.

And we still need to figure out how to address the hardcoded BloomFilter.readFrom in Spark.

@marin-ma marin-ma marked this pull request as draft November 27, 2025 15:31
@jinchengchenghh
Copy link
Copy Markdown
Contributor

jinchengchenghh commented Nov 27, 2025

Maybe all the data frame stats function has similar issue, the information is native information, we could supply a utility class to enhance current df.stat.xxx

✅ 方案 1(推荐):使用隐式类扩展 df.stat 功能

不用修改 Spark 源码,也不会破坏 API。

class GlutenDataFrameStatFunctions(df: DataFrame)
  extends DataFrameStatFunctions(df) {

  def glutenApproxQuantile(cols: Seq[String]): DataFrame = {
    // your gluten implementation
    df // return something
  }
}

object GlutenStatImplicits {
  implicit class GlutenStatOps(df: DataFrame) {
    def glutenStat: GlutenDataFrameStatFunctions =
      new GlutenDataFrameStatFunctions(df)
  }
}


使用:

import GlutenStatImplicits._

df.glutenStat.glutenApproxQuantile(Seq("col1"))


✔ 不修改 Spark
✔ 和 Spark 版本兼容
✔ 可以加入任意 Gluten 优化

@marin-ma marin-ma changed the title [GLUTEN-11088][VL] Disable bloomFilter in GlutenDataFrameStatSuite [GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite Nov 27, 2025
@marin-ma
Copy link
Copy Markdown
Contributor Author

marin-ma commented Nov 27, 2025

@jinchengchenghh If I understand it correctly, it requires users to modify the code to use Gluten's implementation? In this case directly call df.stat.xxx in Gluten still fails.

@marin-ma marin-ma marked this pull request as ready for review November 27, 2025 15:58
@github-actions github-actions bot added the VELOX label Nov 27, 2025
@jinchengchenghh
Copy link
Copy Markdown
Contributor

yes

Copy link
Copy Markdown
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@marin-ma
Copy link
Copy Markdown
Contributor Author

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma merged commit a54a803 into apache:main Nov 28, 2025
59 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants