[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan#33955
[SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan#33955AngersZhuuuu wants to merge 24 commits intoapache:masterfrom
Conversation
|
ping @cloud-fan |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
Outdated
Show resolved
Hide resolved
|
|
||
| private var containsNull = false | ||
| private var containsDoubleNaN = false | ||
| private var containsFloatNaN = false |
There was a problem hiding this comment.
The data added to this set will always be the same data type. I think we can just have a single containsNaN flag.
There was a problem hiding this comment.
The data added to this set will always be the same data type. I think we can just have a single
containsNaNflag.
I have thought about this too, but since it can support any type, so keep this may be better?
There was a problem hiding this comment.
Maybe we should do the null/nan check at the caller side
class SQLOpenHashSet ... {
def add(k: T)
def addNull()
def addNaN()
}
// caller side
if (row.isNullAt...) {
set.addNull()
} else {
...
if (java.lang.Double.isNaN(value)) {
set.addNaN()
}
}
There was a problem hiding this comment.
How about current
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
Outdated
Show resolved
Hide resolved
|
Test build #143142 has finished for PR 33955 at commit
|
|
Test build #143152 has finished for PR 33955 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #143165 has finished for PR 33955 at commit
|
|
ping @cloud-fan |
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
|
Test build #143182 has finished for PR 33955 at commit
|
|
Test build #143186 has finished for PR 33955 at commit
|
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
…rsZhuuuu/spark into SPARK-36702-WrapOpenHashSet
|
Kubernetes integration test unable to build dist. exiting with code: 1 |
...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
Outdated
Show resolved
Hide resolved
|
@HyukjinKwon This commit pass the check ac8bce8 https://github.com/apache/spark/pull/34006/commits |
|
I fetched the latest master and the test passed on my side. |
|
Thanks guys.This is possibly flaky. I'll keep my eyes on the build. |
|
I still see this test failure, see https://github.com/apache/spark/runs/3628995384. Shall we revert this PR? |
|
Actually there are some more: https://github.com/apache/spark/runs/3619357249 |
|
This is so weird. There is no randomness in the test. How frequently do we see the test failure? |
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…lder in array functions In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. Fix a potential bug. Somehow we can hit this bug sometimes after #33955 . No existing tests Closes #34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…at.Nan
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
Fix bug
ArrayDistinct won't show duplicated `NaN` value
Added UT
Closes #33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…oat.NaN
### What changes were proposed in this pull request?
For query
```
select array_intersect(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN], but it should return [].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayIntersect won't show equal `NaN` value
### How was this patch tested?
Added UT
Closes #33995 from AngersZhuuuu/SPARK-36754.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2fc7f2f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on #33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a7cbe69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
….NaN
### What changes were proposed in this pull request?
For query
```
select array_except(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [NaN, 1d], but it should return [1d].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache/spark#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayExcept won't show handle equal `NaN` value
### How was this patch tested?
Added UT
Closes #33994 from AngersZhuuuu/SPARK-36753.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes apache#33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit f71f377)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…and Float.NaN ### What changes were proposed in this pull request? According to apache#33955 (comment) use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes apache#34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 6380859) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…lder in array functions ### What changes were proposed in this pull request? In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element. ### Why are the changes needed? Fix a potential bug. Somehow we can hit this bug sometimes after apache#33955 . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#34029 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 4145498) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…at.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on apache#33955
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes apache#33993 from AngersZhuuuu/SPARK-36741.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e356f6a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
For query
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by
OpenHashSetcan't handleDouble.NaNandFloat.NaNtoo.In this pr we add a wrap for OpenHashSet that can handle
null,Double.NaN,Float.NaNtogetherWhy are the changes needed?
Fix bug
Does this PR introduce any user-facing change?
ArrayUnion won't show duplicated
NaNvalueHow was this patch tested?
Added UT