Skip to content

[C++] hash_mean overflows if numeric sum is larger than int64 max #38833

@rohanjain101

Description

@rohanjain101

Describe the bug, including details regarding any error messages, version, and platform.

>>> df = pd.DataFrame({"A": pd.Series([True, True, True], dtype="bool[pyarrow]"), "B": pd.Series([9223372036854775805, 9223372036854775806, 9223372036854775807], dtype="int64[pyarrow]")})
>>> pa_table = pa.Table.from_pandas(df)
>>> pa.TableGroupBy(pa_table, ["A"]).aggregate([("B", "mean")])
pyarrow.Table
A: bool
B_mean: double
----
A: [[true]]
B_mean: [[3.0744573456182584e+18]]
>>>

I would expect B_mean to be 9.223372036854776e+18. Looks similar to #34909

The scalar aggregate works as expected:

>>> compute.mean(pa_table["B"])
<pyarrow.DoubleScalar: 9.223372036854776e+18>

So I would expect the vector aggregate with a single group to produce the same result.

>>> pa.__version__
'14.0.0'
>>>

Component(s)

C++, Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions