[SPARK-45599][CORE] Use object equality in OpenHashSet by nchammas · Pull Request #45036 · apache/spark

nchammas · 2024-02-06T02:56:39Z

What changes were proposed in this pull request?

Change OpenHashSet to use object equality instead of cooperative equality when looking up keys.

Why are the changes needed?

This brings the behavior of OpenHashSet more in line with the semantics of java.util.HashSet, and fixes its behavior when comparing values for which equals and == return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the wrong position in the set. This leads to the bug described in SPARK-45599 and summarized in this comment.

Does this PR introduce any user-facing change?

Yes, it resolves the bug described in SPARK-45599.

OpenHashSet is used widely under the hood, including in:

OpenHashMap, which itself backs:
- TypedImperativeAggregate
- aggregate functions like percentile and mode
- many algorithms in ML and MLlib
SQLOpenHashSet, which backs array functions like array_union and array_distinct

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on master but passing on this branch:

New tests passing both on master and this branch:

Was this patch authored or co-authored using generative AI tooling?

No.

nchammas · 2024-02-06T02:57:48Z

core/testOnly org.apache.spark.util.collection.* passes on my machine, but I haven't run the full test suite yet. I will monitor the full test results here on GitHub.

nchammas · 2024-02-06T15:25:10Z

Build looks good.

@revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.)

revans2

Looks good to me

peter-toth · 2024-02-06T16:59:05Z

Yes, I was watching this ticket because it is a correctness issue.
The fix looks good to me.

cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1.

HeartSaVioR · 2024-02-07T04:41:11Z

I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this.

cloud-fan · 2024-02-07T05:09:15Z

core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala

    }
  }
+
+  test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {


This is a bit tricky and it's better if we can find a reference system that defines this semantic. In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.

Yes, probably make sense if we just fix this particular issue as OpenHashSet is used at many other places.

This is a bit tricky and it's better if we can find a reference system that defines this semantic.

scala> import java.util.HashSet import java.util.HashSet scala> val h = new HashSet[Double]() val h: java.util.HashSet[Double] = [] scala> h.add(0.0) val res0: Boolean = true scala> h.add(-0.0) val res1: Boolean = true scala> h.size() val res2: Int = 2

The doc for HashSet.add states:

More formally, adds the specified element e to this set if this set contains no element e2 such that Objects.equals(e, e2). If this set already contains the element, the call leaves the set unchanged and returns false.

In other words, java.util.HashSet uses equals and not ==, and therefore it considers 0.0 and -0.0 distinct elements.

So this PR brings OpenHashSet more in line with the semantics of java.util.HashSet.

In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.

This PR does not change this behavior. I noticed, however, that we do not have any tests currently to check that -0.0 is normalized and grouped as you describe, so I went ahead and added such a test in 2bfc605.

Does this address your concern? Or are you suggesting that we should normalize -0.0 to 0.0 across the board?

Consider another interesting case where java.util.HashSet and OpenHashSet differ:

scala> val h = new HashSet[Double]() val h: java.util.HashSet[Double] = [] scala> h.add(Double.NaN) val res9: Boolean = true scala> h.add(Double.NaN) val res10: Boolean = false scala> h.contains(Double.NaN) val res11: Boolean = true scala> h.size() val res12: Int = 1

On master, OpenHashSet does something obviously wrong:

val set = new OpenHashSet[Double]() set.add(Double.NaN) set.add(Double.NaN) set.size // returns 2 set.contains(Double.NaN) // returns false

This could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.

Note also that the docstring for OpenHashSet seems to imply that it is meant to be a faster but semantically equivalent alternative to java.util.HashSet:

spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

Lines 31 to 32 in 0d5c2ce

* storage for four primitive types (Long, Int, Double, and Float). It is much faster than Java's

* standard HashSet while incurring much less memory overhead. This can serve as building blocks

If that's true, then we should perhaps add property based tests to ensure alignment between the two implementations, but I'll leave that as a potential future improvement.

Spark's OpenHashSet does not have to match java.util.HashSet. What matters is the SQL semantic. Can you highlight which functions/operators are using this OpenHashSet and what is the impact of this change to the SQL semantic?

Spark's OpenHashSet does not have to match java.util.HashSet. What matters is the SQL semantic.

Whether or not OpenHashSet matches java.util.HashSet, I want to emphasize for the record that OpenHashSet mishandles 0.0/-0.0 and NaN. Its behavior is simply incorrect. These tests fail on master in ways that can only be described as bugs, regardless of whatever SQL semantics we want to preserve.

This comment explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in OpenHashSet.

But I understand what you are saying. Fixing bugs in OpenHashSet doesn't help us if it also breaks users' SQL.

Can you highlight which functions/operators are using this OpenHashSet

I've updated the PR description with a summary of what uses OpenHashSet.

As a side note, I believe that if we accept the change proposed here, we should be able to eliminate SQLOpenHashSet. SQLOpenHashSet was created specifically to work around the bugs in OpenHashSet that we are addressing in this PR. See #33955 and #33993.

and what is the impact of this change to the SQL semantic?

I've updated the PR description with a diff of what tests pass or fail on master vs. this branch. Please take a look and let me know if you think we need any more tests. I know we are touching a sensitive code path and I appreciate the need for caution.

peter-toth · 2024-02-07T16:10:01Z

I can hold cutting the tag of RC1 for this.

Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release.

core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala

core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala

cloud-fan · 2024-02-15T12:55:01Z

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

+  /**
+   * Check if a key exists at the provided position using object equality rather than
+   * cooperative equality. Otherwise, hash sets will mishandle values for which `==`
+   * and `equals` return different results, like 0.0/-0.0 and NaN/NaN.


I was told that in scala == is the same as equals, but eq is a different operator. I need to refresh my knowledge now :)

Yes, the differences are subtle:

scala> 0.0 == -0.0 val res0: Boolean = true scala> 0.0 equals -0.0 val res1: Boolean = false scala> Double.NaN == Double.NaN val res2: Boolean = false scala> Double.NaN equals Double.NaN val res3: Boolean = true

There is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes:

Can we get rid of cooperative equality?

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala

core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala

cloud-fan · 2024-02-15T12:59:48Z

core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala

+    set.add(Double.NaN)
+    set.add(Double.NaN)
+    assert(set.contains(Double.NaN))
+    assert(set.size == 1)


do we actually waste space in OpenHashSet to store all the NaN values?

Yes sir. On master, this is the actual behavior of OpenHashSet:

// ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN assert(set.contains(Double.NaN)) // ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1 assert(set.size == 1)

Every NaN will get its own entry in OpenHashSet on master. So if we add 1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And .contains() will still return false. :D

cloud-fan · 2024-02-16T04:02:36Z

core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala

+  test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
+    // Exactly these elements provided in roughly this order trigger a condition where lookups of
+    // 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
+    // and inconsistently if `==` is used to check for key equality.


shall we mention the NaN behavior as well? All NaN values are all the same.

I tweaked the test name. Is that what you had in mind?

This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long).

nchammas · 2024-02-16T05:32:31Z

@revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now?

peter-toth · 2024-02-16T06:37:06Z

If we go this direction and change OpenHashSet do we still need SQLOpenHashSet?

cloud-fan · 2024-02-16T06:44:55Z

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

peter-toth · 2024-02-16T06:58:58Z

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

Yes, you are right. Probably the NaN special handling can be remvoved though.

nchammas · 2024-02-22T16:30:15Z

Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original fuzz test he ran now passes?)

cloud-fan · 2024-02-26T08:02:55Z

thanks, merging to master/3.5!

### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit fcc5dbc) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

dongjoon-hyun · 2024-02-27T16:38:37Z

Hi, @cloud-fan and @nchammas .

This broke Apache Spark branch-3.5.

[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:136:4: Invalid message filter:
[error] Unknown category: `other-non-cooperative-equals`
[error]   @annotation.nowarn("cat=other-non-cooperative-equals")
[error]    ^
[error] one error found

dongjoon-hyun · 2024-02-27T16:40:54Z

Sorry but I reverted this from branch-3.5 to unblock the community PRs.

cbf25fb

Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance.

nchammas · 2024-02-27T19:07:49Z

@dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on scala/scala#8120 and scala/scala#10446.

I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against branch-3.5.

### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

nchammas · 2024-02-27T23:37:48Z

Backport to 3.5: #45296

### What changes were proposed in this pull request? This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12. ### Why are the changes needed? This is a correctness bug fix. The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1]. To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority. [1]: scala/scala#8120 (comment) ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? CI + added tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45296 from nchammas/SPARK-45599-OpenHashSet. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

### What changes were proposed in this pull request? Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys. ### Why are the changes needed? This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN. For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1]. [wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283 [1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954 ### Does this PR introduce _any_ user-facing change? Yes, it resolves the bug described in SPARK-45599. `OpenHashSet` is used widely under the hood, including in: - `OpenHashMap`, which itself backs: - `TypedImperativeAggregate` - aggregate functions like `percentile` and `mode` - many algorithms in ML and MLlib - `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct` However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599. ### How was this patch tested? New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior. New tests failing on `master` but passing on this branch: - [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273) - [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302) - [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253) - [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092) New tests passing both on `master` and this branch: - [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793) - [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801) - [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107) - [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782) ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero. Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

use object equality in OpenHashSet

905e174

github-actions bot added the CORE label Feb 6, 2024

revans2 approved these changes Feb 6, 2024

View reviewed changes

cloud-fan reviewed Feb 7, 2024

View reviewed changes

nchammas added 2 commits February 7, 2024 15:50

tweak docstring

0d5c2ce

add GROUP BY test for ±0.0, NaN, and ±Infinity

2bfc605

github-actions bot added the SQL label Feb 7, 2024

nchammas mentioned this pull request Feb 7, 2024

[MINOR][DOCS] Show sort order of NaN relative to infinity #45047

Closed

nchammas added 2 commits February 7, 2024 18:27

add test for OpenHashMapSuite

0290455

explain details of bug in OpenHashSet

21cff3b

nchammas commented Feb 7, 2024

View reviewed changes

core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala Show resolved Hide resolved

core/src/test/scala/org/apache/spark/util/collection/OpenHashSetSuite.scala Show resolved Hide resolved

nchammas added 8 commits February 7, 2024 20:11

Merge branch 'master' into SPARK-45599-plus-and-minus-zero

16d9aef

add analyzer test output

0c47d4c

add percentile test that captures repro from jira

760cb49

add contains NaN test

7ac5e3e

remove prints

5bbde83

Merge branch 'master' into SPARK-45599-plus-and-minus-zero

3764705

add test to confirm -0 and -0.0 are normalized correctly

3b5118d

confirm array_union and _distinct handle 0.0/-0.0/NaN correctly

6b4f091

nchammas mentioned this pull request Feb 12, 2024

[MINOR][SQL] Show only number of test blocks when there is a mismatch #45083

Closed

cloud-fan reviewed Feb 15, 2024

View reviewed changes

core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 15, 2024

View reviewed changes

core/src/test/scala/org/apache/spark/util/collection/OpenHashMapSuite.scala Show resolved Hide resolved

cloud-fan reviewed Feb 15, 2024

View reviewed changes

remove commented out ==

9fe698b

nchammas added 3 commits February 15, 2024 10:11

check count of NaNs in map

60f698a

merge literals test into single query

15a091c

Merge branch 'master' into SPARK-45599-plus-and-minus-zero

a76629c

cloud-fan reviewed Feb 16, 2024

View reviewed changes

cloud-fan approved these changes Feb 16, 2024

View reviewed changes

tweak test name

3a0b194

peter-toth approved these changes Feb 16, 2024

View reviewed changes

Merge branch 'master' into SPARK-45599-plus-and-minus-zero

f0764c4

cloud-fan approved these changes Feb 26, 2024

View reviewed changes

cloud-fan closed this in fcc5dbc Feb 26, 2024

nchammas deleted the SPARK-45599-plus-and-minus-zero branch February 27, 2024 02:52

nchammas mentioned this pull request Feb 27, 2024

[SPARK-45599][CORE][3.5] Use object equality in OpenHashSet #45296

Closed

	* storage for four primitive types (Long, Int, Double, and Float). It is much faster than Java's
	* standard HashSet while incurring much less memory overhead. This can serve as building blocks

Conversation

nchammas commented Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

nchammas commented Feb 6, 2024

Uh oh!

nchammas commented Feb 6, 2024

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 6, 2024

Uh oh!

HeartSaVioR commented Feb 7, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas Feb 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Feb 7, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nchammas commented Feb 16, 2024

Uh oh!

peter-toth commented Feb 16, 2024

Uh oh!

cloud-fan commented Feb 16, 2024

Uh oh!

peter-toth commented Feb 16, 2024

Uh oh!

nchammas commented Feb 22, 2024

Uh oh!

cloud-fan commented Feb 26, 2024

Uh oh!

dongjoon-hyun commented Feb 27, 2024

Uh oh!

dongjoon-hyun commented Feb 27, 2024

Uh oh!

nchammas commented Feb 27, 2024

Uh oh!

nchammas commented Feb 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

nchammas commented Feb 6, 2024 •

edited

Loading

nchammas Feb 7, 2024 •

edited

Loading