[SPARK-45599][CORE] Use object equality in OpenHashSet#45036
[SPARK-45599][CORE] Use object equality in OpenHashSet#45036nchammas wants to merge 19 commits intoapache:masterfrom
Conversation
|
|
|
Build looks good. @revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.) |
|
Yes, I was watching this ticket because it is a correctness issue. cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1. |
|
I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this. |
| } | ||
| } | ||
|
|
||
| test("SPARK-45599: 0.0 and -0.0 are equal but not the same") { |
There was a problem hiding this comment.
This is a bit tricky and it's better if we can find a reference system that defines this semantic. In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.
There was a problem hiding this comment.
Yes, probably make sense if we just fix this particular issue as OpenHashSet is used at many other places.
There was a problem hiding this comment.
This is a bit tricky and it's better if we can find a reference system that defines this semantic.
scala> import java.util.HashSet
import java.util.HashSet
scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []
scala> h.add(0.0)
val res0: Boolean = true
scala> h.add(-0.0)
val res1: Boolean = true
scala> h.size()
val res2: Int = 2The doc for HashSet.add states:
More formally, adds the specified element e to this set if this set contains no element e2 such that Objects.equals(e, e2). If this set already contains the element, the call leaves the set unchanged and returns false.
In other words, java.util.HashSet uses equals and not ==, and therefore it considers 0.0 and -0.0 distinct elements.
So this PR brings OpenHashSet more in line with the semantics of java.util.HashSet.
There was a problem hiding this comment.
In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.
This PR does not change this behavior. I noticed, however, that we do not have any tests currently to check that -0.0 is normalized and grouped as you describe, so I went ahead and added such a test in 2bfc605.
Does this address your concern? Or are you suggesting that we should normalize -0.0 to 0.0 across the board?
There was a problem hiding this comment.
Consider another interesting case where java.util.HashSet and OpenHashSet differ:
scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []
scala> h.add(Double.NaN)
val res9: Boolean = true
scala> h.add(Double.NaN)
val res10: Boolean = false
scala> h.contains(Double.NaN)
val res11: Boolean = true
scala> h.size()
val res12: Int = 1On master, OpenHashSet does something obviously wrong:
val set = new OpenHashSet[Double]()
set.add(Double.NaN)
set.add(Double.NaN)
set.size // returns 2
set.contains(Double.NaN) // returns falseThis could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.
There was a problem hiding this comment.
Note also that the docstring for OpenHashSet seems to imply that it is meant to be a faster but semantically equivalent alternative to java.util.HashSet:
If that's true, then we should perhaps add property based tests to ensure alignment between the two implementations, but I'll leave that as a potential future improvement.
There was a problem hiding this comment.
Spark's OpenHashSet does not have to match java.util.HashSet. What matters is the SQL semantic. Can you highlight which functions/operators are using this OpenHashSet and what is the impact of this change to the SQL semantic?
There was a problem hiding this comment.
Spark's
OpenHashSetdoes not have to matchjava.util.HashSet. What matters is the SQL semantic.
Whether or not OpenHashSet matches java.util.HashSet, I want to emphasize for the record that OpenHashSet mishandles 0.0/-0.0 and NaN. Its behavior is simply incorrect. These tests fail on master in ways that can only be described as bugs, regardless of whatever SQL semantics we want to preserve.
This comment explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in OpenHashSet.
But I understand what you are saying. Fixing bugs in OpenHashSet doesn't help us if it also breaks users' SQL.
Can you highlight which functions/operators are using this
OpenHashSet
I've updated the PR description with a summary of what uses OpenHashSet.
As a side note, I believe that if we accept the change proposed here, we should be able to eliminate SQLOpenHashSet. SQLOpenHashSet was created specifically to work around the bugs in OpenHashSet that we are addressing in this PR. See #33955 and #33993.
and what is the impact of this change to the SQL semantic?
I've updated the PR description with a diff of what tests pass or fail on master vs. this branch. Please take a look and let me know if you think we need any more tests. I know we are touching a sensitive code path and I appreciate the need for caution.
Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release. |
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala
Show resolved
Hide resolved
| /** | ||
| * Check if a key exists at the provided position using object equality rather than | ||
| * cooperative equality. Otherwise, hash sets will mishandle values for which `==` | ||
| * and `equals` return different results, like 0.0/-0.0 and NaN/NaN. |
There was a problem hiding this comment.
I was told that in scala == is the same as equals, but eq is a different operator. I need to refresh my knowledge now :)
There was a problem hiding this comment.
Yes, the differences are subtle:
scala> 0.0 == -0.0
val res0: Boolean = true
scala> 0.0 equals -0.0
val res1: Boolean = false
scala> Double.NaN == Double.NaN
val res2: Boolean = false
scala> Double.NaN equals Double.NaN
val res3: Boolean = trueThere is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes:
core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala
Show resolved
Hide resolved
| set.add(Double.NaN) | ||
| set.add(Double.NaN) | ||
| assert(set.contains(Double.NaN)) | ||
| assert(set.size == 1) |
There was a problem hiding this comment.
do we actually waste space in OpenHashSet to store all the NaN values?
There was a problem hiding this comment.
Yes sir. On master, this is the actual behavior of OpenHashSet:
// ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN
assert(set.contains(Double.NaN))
// ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1
assert(set.size == 1)Every NaN will get its own entry in OpenHashSet on master. So if we add 1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And .contains() will still return false. :D
| test("SPARK-45599: 0.0 and -0.0 should count distinctly") { | ||
| // Exactly these elements provided in roughly this order trigger a condition where lookups of | ||
| // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly | ||
| // and inconsistently if `==` is used to check for key equality. |
There was a problem hiding this comment.
shall we mention the NaN behavior as well? All NaN values are all the same.
There was a problem hiding this comment.
I tweaked the test name. Is that what you had in mind?
This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long).
|
@revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now? |
|
If we go this direction and change |
|
|
Yes, you are right. Probably the |
|
Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original fuzz test he ran now passes?) |
|
thanks, merging to master/3.5! |
### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fcc5dbc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Hi, @cloud-fan and @nchammas . This broke Apache Spark branch-3.5. |
|
Sorry but I reverted this from Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance. |
|
@dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on scala/scala#8120 and scala/scala#10446. I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against |
### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
Backport to 3.5: #45296 |
### What changes were proposed in this pull request? This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12. ### Why are the changes needed? This is a correctness bug fix. The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1]. To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority. [1]: scala/scala#8120 (comment) ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? CI + added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45296 from nchammas/SPARK-45599-OpenHashSet. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

What changes were proposed in this pull request?
Change
OpenHashSetto use object equality instead of cooperative equality when looking up keys.Why are the changes needed?
This brings the behavior of
OpenHashSetmore in line with the semantics ofjava.util.HashSet, and fixes its behavior when comparing values for whichequalsand==return different results, like 0.0/-0.0 and NaN/NaN.For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the wrong position in the set. This leads to the bug described in SPARK-45599 and summarized in this comment.
Does this PR introduce any user-facing change?
Yes, it resolves the bug described in SPARK-45599.
OpenHashSetis used widely under the hood, including in:OpenHashMap, which itself backs:TypedImperativeAggregatepercentileandmodeSQLOpenHashSet, which backs array functions likearray_unionandarray_distinctHowever, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.
How was this patch tested?
New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.
New tests failing on
masterbut passing on this branch:OpenHashSetOpenHashSetOpenHashMapNew tests passing both on
masterand this branch:array_unionarray_distinctGROUP BYWas this patch authored or co-authored using generative AI tooling?
No.