[GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite#11211
[GLUTEN-11088][VL] fix bloomFilter in GlutenDataFrameStatSuite#11211marin-ma merged 2 commits intoapache:mainfrom
bloomFilter in GlutenDataFrameStatSuite#11211Conversation
|
I suppose this scenario is similar to the ANSI mode, and should be categorised as the limitation to the spark 4.0 support.
Perhaps we should document it somewhere. @zhouyuan @jinchengchenghh Any suggestions? |
|
Can we offload df.stat.bloomFilter? |
|
Dit you meet |
@jinchengchenghh It's being offloaded in spark 4.0 after SPARK-45565. The test failure is caused by the hard-coded Besides, this test case also tests unsupported API |
@zhztheplayer Haven't run into this error but there are indeed some unimplemented API in |
|
I don't know the path call BloomFilter.readFrom, can we update the plan or other ways to offload the df.stat.bloomFilter to native, then it can call VeloxBloomFilter, we can implement the unimplemented functions in it |
|
@jinchengchenghh The bloom filter agg here is being offloaded in spark 4.0, but not in spark 3.x. |
|
I think there are two options to solve this issue:
|
|
Thanks. @zhztheplayer @jinchengchenghh Looks like 1 is the most feasible solution before we implement 2. Let me try to address 1 in this PR. And we still need to figure out how to address the hardcoded |
|
Maybe all the data frame stats function has similar issue, the information is native information, we could supply a utility class to enhance current df.stat.xxx |
bloomFilter in GlutenDataFrameStatSuitebloomFilter in GlutenDataFrameStatSuite
|
@jinchengchenghh If I understand it correctly, it requires users to modify the code to use Gluten's implementation? In this case directly call |
|
yes |
|
Run Gluten Clickhouse CI on x86 |
Due to the changes in apache/spark#43391, the bloom filter stat function is offloaded into Gluten when building the bloom filter. However
df.stat.bloomFilteractually hard code and invokesBloomFilterImpl.readFromwhich is Spark's bloom filter implementation to decode the bloom filter binary.To use
df.stat.bloomFilter, user should explicitly setspark.gluten.sql.native.bloomFilter=falseRelated issue: #11088