-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[C++] hash_mean overflows if numeric sum is larger than int64 max #38833
Copy link
Copy link
Closed
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug
Milestone
Description
Describe the bug, including details regarding any error messages, version, and platform.
>>> df = pd.DataFrame({"A": pd.Series([True, True, True], dtype="bool[pyarrow]"), "B": pd.Series([9223372036854775805, 9223372036854775806, 9223372036854775807], dtype="int64[pyarrow]")})
>>> pa_table = pa.Table.from_pandas(df)
>>> pa.TableGroupBy(pa_table, ["A"]).aggregate([("B", "mean")])
pyarrow.Table
A: bool
B_mean: double
----
A: [[true]]
B_mean: [[3.0744573456182584e+18]]
>>>
I would expect B_mean to be 9.223372036854776e+18. Looks similar to #34909
The scalar aggregate works as expected:
>>> compute.mean(pa_table["B"])
<pyarrow.DoubleScalar: 9.223372036854776e+18>
So I would expect the vector aggregate with a single group to produce the same result.
>>> pa.__version__
'14.0.0'
>>>
Component(s)
C++, Python
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Component: C++Component: PythonCritical FixBugfixes for security vulnerabilities, crashes, or invalid data.Bugfixes for security vulnerabilities, crashes, or invalid data.Type: bug