Skip to content

[SPARK-45599][CORE] Use object equality in OpenHashSet#45036

Closed
nchammas wants to merge 19 commits intoapache:masterfrom
nchammas:SPARK-45599-plus-and-minus-zero
Closed

[SPARK-45599][CORE] Use object equality in OpenHashSet#45036
nchammas wants to merge 19 commits intoapache:masterfrom
nchammas:SPARK-45599-plus-and-minus-zero

Conversation

@nchammas
Copy link
Contributor

@nchammas nchammas commented Feb 6, 2024

What changes were proposed in this pull request?

Change OpenHashSet to use object equality instead of cooperative equality when looking up keys.

Why are the changes needed?

This brings the behavior of OpenHashSet more in line with the semantics of java.util.HashSet, and fixes its behavior when comparing values for which equals and == return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the wrong position in the set. This leads to the bug described in SPARK-45599 and summarized in this comment.

Does this PR introduce any user-facing change?

Yes, it resolves the bug described in SPARK-45599.

OpenHashSet is used widely under the hood, including in:

  • OpenHashMap, which itself backs:
    • TypedImperativeAggregate
    • aggregate functions like percentile and mode
    • many algorithms in ML and MLlib
  • SQLOpenHashSet, which backs array functions like array_union and array_distinct

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on master but passing on this branch:

New tests passing both on master and this branch:

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Feb 6, 2024
@nchammas
Copy link
Contributor Author

nchammas commented Feb 6, 2024

core/testOnly org.apache.spark.util.collection.* passes on my machine, but I haven't run the full test suite yet. I will monitor the full test results here on GitHub.

@nchammas
Copy link
Contributor Author

nchammas commented Feb 6, 2024

Build looks good.

@revans2 and @peter-toth - What do you think? (Tagging you since I noticed you are watching the linked JIRA ticket.)

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@peter-toth
Copy link
Contributor

Yes, I was watching this ticket because it is a correctness issue.
The fix looks good to me.

cc @cloud-fan and @HeartSaVioR as it would be great to include the fix in 3.5.1.

@HeartSaVioR
Copy link
Contributor

I have no expertise on this area so will leave the decision on merging the change to which version(s), and whether we want to safeguard or not, to introduce an escape-hatch on behavioral change. I can hold cutting the tag of RC1 for this.

}
}

test("SPARK-45599: 0.0 and -0.0 are equal but not the same") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky and it's better if we can find a reference system that defines this semantic. In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably make sense if we just fix this particular issue as OpenHashSet is used at many other places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit tricky and it's better if we can find a reference system that defines this semantic.

scala> import java.util.HashSet
import java.util.HashSet

scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []

scala> h.add(0.0)
val res0: Boolean = true

scala> h.add(-0.0)
val res1: Boolean = true

scala> h.size()
val res2: Int = 2

The doc for HashSet.add states:

More formally, adds the specified element e to this set if this set contains no element e2 such that Objects.equals(e, e2). If this set already contains the element, the call leaves the set unchanged and returns false.

In other words, java.util.HashSet uses equals and not ==, and therefore it considers 0.0 and -0.0 distinct elements.

So this PR brings OpenHashSet more in line with the semantics of java.util.HashSet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark, 0.0 == -0.0, and in GROUP BY, 0.0 and -0.0 are considered to be in the same group and normalized to 0.0.

This PR does not change this behavior. I noticed, however, that we do not have any tests currently to check that -0.0 is normalized and grouped as you describe, so I went ahead and added such a test in 2bfc605.

Does this address your concern? Or are you suggesting that we should normalize -0.0 to 0.0 across the board?

Copy link
Contributor Author

@nchammas nchammas Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider another interesting case where java.util.HashSet and OpenHashSet differ:

scala> val h = new HashSet[Double]()
val h: java.util.HashSet[Double] = []

scala> h.add(Double.NaN)
val res9: Boolean = true

scala> h.add(Double.NaN)
val res10: Boolean = false

scala> h.contains(Double.NaN)
val res11: Boolean = true

scala> h.size()
val res12: Int = 1

On master, OpenHashSet does something obviously wrong:

val set = new OpenHashSet[Double]()
set.add(Double.NaN)
set.add(Double.NaN)
set.size  // returns 2
set.contains(Double.NaN)  // returns false

This could possibly lead to a bug like the one reported in SPARK-45599 but in reverse, where a new NaN row is added rather than dropped. I will see if I can construct such a scenario as a demonstration. But regardless, I think this behavior is incorrect by itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note also that the docstring for OpenHashSet seems to imply that it is meant to be a faster but semantically equivalent alternative to java.util.HashSet:

* storage for four primitive types (Long, Int, Double, and Float). It is much faster than Java's
* standard HashSet while incurring much less memory overhead. This can serve as building blocks

If that's true, then we should perhaps add property based tests to ensure alignment between the two implementations, but I'll leave that as a potential future improvement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's OpenHashSet does not have to match java.util.HashSet. What matters is the SQL semantic. Can you highlight which functions/operators are using this OpenHashSet and what is the impact of this change to the SQL semantic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark's OpenHashSet does not have to match java.util.HashSet. What matters is the SQL semantic.

Whether or not OpenHashSet matches java.util.HashSet, I want to emphasize for the record that OpenHashSet mishandles 0.0/-0.0 and NaN. Its behavior is simply incorrect. These tests fail on master in ways that can only be described as bugs, regardless of whatever SQL semantics we want to preserve.

This comment explains the root cause. Basically, it is a mistake to combine hash code-based lookups with cooperative equality, at least in the way we are doing it in OpenHashSet.

But I understand what you are saying. Fixing bugs in OpenHashSet doesn't help us if it also breaks users' SQL.

Can you highlight which functions/operators are using this OpenHashSet

I've updated the PR description with a summary of what uses OpenHashSet.

As a side note, I believe that if we accept the change proposed here, we should be able to eliminate SQLOpenHashSet. SQLOpenHashSet was created specifically to work around the bugs in OpenHashSet that we are addressing in this PR. See #33955 and #33993.

and what is the impact of this change to the SQL semantic?

I've updated the PR description with a diff of what tests pass or fail on master vs. this branch. Please take a look and let me know if you think we need any more tests. I know we are touching a sensitive code path and I appreciate the need for caution.

@peter-toth
Copy link
Contributor

I can hold cutting the tag of RC1 for this.

Thanks @HeartSaVioR. BTW I don't think this should be a blocker of 3.5.1 as this is not a regression, but if we can find a quick fix for this issue it would be good to include it in the release.

/**
* Check if a key exists at the provided position using object equality rather than
* cooperative equality. Otherwise, hash sets will mishandle values for which `==`
* and `equals` return different results, like 0.0/-0.0 and NaN/NaN.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was told that in scala == is the same as equals, but eq is a different operator. I need to refresh my knowledge now :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the differences are subtle:

scala> 0.0 == -0.0
val res0: Boolean = true

scala> 0.0 equals -0.0
val res1: Boolean = false

scala> Double.NaN == Double.NaN
val res2: Boolean = false

scala> Double.NaN equals Double.NaN
val res3: Boolean = true

There is a long discussion on the Scala forums from 2017 about this difference and some of the problems it causes:

Can we get rid of cooperative equality?

set.add(Double.NaN)
set.add(Double.NaN)
assert(set.contains(Double.NaN))
assert(set.size == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we actually waste space in OpenHashSet to store all the NaN values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir. On master, this is the actual behavior of OpenHashSet:

// ...OpenHashSet$mcD$sp@21b327e6 did not contain NaN
assert(set.contains(Double.NaN))

// ...OpenHashSet$mcD$sp@1f09db1e had size 2 instead of expected size 1
assert(set.size == 1)

Every NaN will get its own entry in OpenHashSet on master. So if we add 1,000,000 NaNs to the set, NaN will have 1,000,000 entries in there. And .contains() will still return false. :D

test("SPARK-45599: 0.0 and -0.0 should count distinctly") {
// Exactly these elements provided in roughly this order trigger a condition where lookups of
// 0.0 and -0.0 in the bitset happen to collide, causing their counts to be merged incorrectly
// and inconsistently if `==` is used to check for key equality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention the NaN behavior as well? All NaN values are all the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked the test name. Is that what you had in mind?

This comment explains why we need exactly the following elements to trigger the 0.0/-0.0 miscount. It doesn't always happen (which is part of what kept this bug hidden for so long).

@nchammas
Copy link
Contributor Author

@revans2 - Just to confirm, have you rerun your tests against this branch (the ones that exposed this bug) and do they pass now?

@peter-toth
Copy link
Contributor

If we go this direction and change OpenHashSet do we still need SQLOpenHashSet?

@cloud-fan
Copy link
Contributor

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

@peter-toth
Copy link
Contributor

SQLOpenHashSet also handles null differently. Not sure if OpenHashSet already covers it.

Yes, you are right. Probably the NaN special handling can be remvoved though.

@nchammas
Copy link
Contributor Author

Is there anything more we're waiting on here? (Perhaps just an extra confirmation from @revans2 that the original fuzz test he ran now passes?)

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in fcc5dbc Feb 26, 2024
cloud-fan pushed a commit that referenced this pull request Feb 26, 2024
### What changes were proposed in this pull request?

Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys.

### Why are the changes needed?

This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1].

[wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
[1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954

### Does this PR introduce _any_ user-facing change?

Yes, it resolves the bug described in SPARK-45599.

`OpenHashSet` is used widely under the hood, including in:
- `OpenHashMap`, which itself backs:
    - `TypedImperativeAggregate`
    - aggregate functions like `percentile` and `mode`
    - many algorithms in ML and MLlib
- `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct`

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

### How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on `master` but passing on this branch:
- [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
- [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
- [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
- [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)

New tests passing both on `master` and this branch:
- [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
- [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
- [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
- [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45036 from nchammas/SPARK-45599-plus-and-minus-zero.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit fcc5dbc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@nchammas nchammas deleted the SPARK-45599-plus-and-minus-zero branch February 27, 2024 02:52
@dongjoon-hyun
Copy link
Member

Hi, @cloud-fan and @nchammas .

This broke Apache Spark branch-3.5.

Screenshot 2024-02-27 at 08 38 03

[error] /home/runner/work/spark/spark/core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala:136:4: Invalid message filter:
[error] Unknown category: `other-non-cooperative-equals`
[error]   @annotation.nowarn("cat=other-non-cooperative-equals")
[error]    ^
[error] one error found

@dongjoon-hyun
Copy link
Member

Sorry but I reverted this from branch-3.5 to unblock the community PRs.

Please make a backporting PR to branch-3.5 again after fixing the above annotation compilation. Thank you in advance.

@nchammas
Copy link
Contributor Author

@dongjoon-hyun - I think this warning category was added in Scala 2.13 and then backported to Scala 2.12. Trying to confirm right now over on scala/scala#8120 and scala/scala#10446.

I see that Spark 3.5 supports both Scala 2.12 and 2.13, so I will figure out some method that is compatible across both versions and open a PR against branch-3.5.

nchammas added a commit to nchammas/spark that referenced this pull request Feb 27, 2024
### What changes were proposed in this pull request?

Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys.

### Why are the changes needed?

This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1].

[wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
[1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954

### Does this PR introduce _any_ user-facing change?

Yes, it resolves the bug described in SPARK-45599.

`OpenHashSet` is used widely under the hood, including in:
- `OpenHashMap`, which itself backs:
    - `TypedImperativeAggregate`
    - aggregate functions like `percentile` and `mode`
    - many algorithms in ML and MLlib
- `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct`

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

### How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on `master` but passing on this branch:
- [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
- [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
- [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
- [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)

New tests passing both on `master` and this branch:
- [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
- [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
- [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
- [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@nchammas
Copy link
Contributor Author

Backport to 3.5: #45296

dongjoon-hyun pushed a commit that referenced this pull request Feb 28, 2024
### What changes were proposed in this pull request?

This is a backport of fcc5dbc / #45036 with a tweak so that it works on Scala 2.12.

### Why are the changes needed?

This is a correctness bug fix.

The original fix against `master` suppresses a warning category that doesn't exist on certain versions of Scala 2.13 and 2.12, and the exact versions are [not documented anywhere][1].

To be safe, this backport simply suppresses all warnings instead of just `other-non-cooperative-equals`. It would be interesting to see if `-Wconf:nowarn` complains, since the warning about non-cooperative equals itself is also not thrown on all versions of Scala, but I don't think that's a priority.

[1]: scala/scala#8120 (comment)

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

CI + added tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #45296 from nchammas/SPARK-45599-OpenHashSet.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
ericm-db pushed a commit to ericm-db/spark that referenced this pull request Mar 5, 2024
### What changes were proposed in this pull request?

Change `OpenHashSet` to use object equality instead of cooperative equality when looking up keys.

### Why are the changes needed?

This brings the behavior of `OpenHashSet` more in line with the semantics of `java.util.HashSet`, and fixes its behavior when comparing values for which `equals` and `==` return different results, like 0.0/-0.0 and NaN/NaN.

For example, in certain cases where both 0.0 and -0.0 are provided as keys to the set, lookups of one or the other key may return the [wrong position][wrong] in the set. This leads to the bug described in SPARK-45599 and summarized in [this comment][1].

[wrong]: https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R277-R283
[1]: https://issues.apache.org/jira/browse/SPARK-45599?focusedCommentId=17806954&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17806954

### Does this PR introduce _any_ user-facing change?

Yes, it resolves the bug described in SPARK-45599.

`OpenHashSet` is used widely under the hood, including in:
- `OpenHashMap`, which itself backs:
    - `TypedImperativeAggregate`
    - aggregate functions like `percentile` and `mode`
    - many algorithms in ML and MLlib
- `SQLOpenHashSet`, which backs array functions like `array_union` and `array_distinct`

However, the user-facing changes should be limited to the kind of edge case described in SPARK-45599.

### How was this patch tested?

New and existing unit tests. Of the new tests added in this PR, some simply validate that we have not changed existing SQL semantics, while others confirm that we have fixed the specific bug reported in SPARK-45599 along with any related incorrect behavior.

New tests failing on `master` but passing on this branch:
- [Handling 0.0 and -0.0 in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R273)
- [Handling NaN in `OpenHashSet`](https://github.com/apache/spark/pull/45036/files#diff-894198a5fea34e5b7f07d0a4641eb09995315d5de3e0fded3743c15a3c8af405R302)
- [Handling 0.0 and -0.0 in `OpenHashMap`](https://github.com/apache/spark/pull/45036/files#diff-09400ec633b1f1322c5f7b39dc4e87073b0b0435b60b9cff93388053be5083b6R253)
- [Handling 0.0 and -0.0 when computing percentile](https://github.com/apache/spark/pull/45036/files#diff-bd3d5c79ede5675f4bf10d2efb313db893d57443d6d6d67b1f8766e6ce741271R1092)

New tests passing both on `master` and this branch:
- [Handling 0.0, -0.0, and NaN in `array_union`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R793)
- [Handling 0.0, -0.0, and NaN in `array_distinct`](https://github.com/apache/spark/pull/45036/files#diff-9e18a5ccf83ac94321e3a0ee8c5acf104c45734f3b35f1a0d4c15c4daa315ad5R801)
- [Handling 0.0, -0.0, and NaN in `GROUP BY`](https://github.com/apache/spark/pull/45036/files#diff-496edb8b03201f078c3772ca81f7c7f80002acc11dff00b1d06d288b87855264R1107)
- [Normalizing -0 and -0.0](https://github.com/apache/spark/pull/45036/files#diff-4bdd04d06a2d88049dd5c8a67715c5566dd68a1c4ebffc689dc74b6b2e0b3b04R782)

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#45036 from nchammas/SPARK-45599-plus-and-minus-zero.

Authored-by: Nicholas Chammas <nicholas.chammas@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants