Skip to content

[C++][Python] pyarrow table group_by/aggregate results in multiple rows with the same group_by key #42231

@FreekPaans

Description

@FreekPaans

Describe the bug, including details regarding any error messages, version, and platform.

Originally posted here

I'm doing a simple group_by/aggregate on multiple keys, out of which one has null-values. This sometimes results in multiple result rows having the same values for the group_by keys, which I don't expect. Tested on pyarrow-16.1.0

Repro case:

import pyarrow as pa
def try_repro(size):
    repro = pa.table({"a": [0] * size,
                      "g": [None]*size},
                     schema=pa.schema([pa.field("a", "uint8"),
                                       pa.field("g", "date32")]))\
              .group_by(["a", "g"]).aggregate([([], "count_all")])

    if len(repro) != 1:
        print(f"{size} => {len(repro)}")
    return repro

for i in range(1,50):
    r = try_repro(i)

print()
print(r)

Output without AVX2 (expected):

$ ARROW_USER_SIMD_LEVEL=AVX python repro.py

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0]]
g: [[null]]
count_all: [[49]]

Output with AVX2 (not expected):

$ ARROW_USER_SIMD_LEVEL=AVX2 python repro.py
33 => 2
...
40 => 2
41 => 3
...
48 => 3
49 => 4

pyarrow.Table
a: uint8
g: date32[day]
count_all: int64
----
a: [[0,0,0,0]]
g: [[null,null,null,null]]
count_all: [[32,8,8,1]]

Some observations:

  • Grouping on only g doesn't have the problem
  • Swapping the order a and g in the group_by also removes the issue.
  • Looks like this starts happening as soon as the size of the tables hits 33, and then we get an extra group for every 8 rows we add (so at 33, 41, 49)
  • Having g be an int does not exhibit the problem, a float does.
  • Non-null values don't have the issue
  • Macbook Pro M2 is also fine

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions