Skip to content

[C++] Kernel to select subset of fields of a StructArray #31101

@asfimport

Description

@asfimport

Triggered by https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure. I thought there was already an issue about this, but don't directly find one.

Assume you have a struct array with some fields:

>>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
>>> arr.type
StructType(struct<a: int64, b: int64, c: int64>)

We have a kernel to select a single child field:

>>> pc.struct_field(arr, [0])
<pyarrow.lib.Int64Array object at 0x7ffa9e229940>
[
  1,
  2,
  3
]

But if you want to subset the StructArray to some of its fields, resulting in a new StructArray, that's not possible with struct_field, and doing this manually is a bit cumbersome:

>>> fields = ['a', 'c']
>>> arrays = [arr.field(n) for n in fields]
>>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
>>> arr_subset.type
StructType(struct<a: int64, c: int64>)

(this is still OK, but if you had a ChunkedArray, it certainly gets annoying)

One option could be to expand the existing struct_field to allow selecting multiple fields (although that probably gets ambigous/confusing with how you currently select a recursively nested field -> [0, 1] currently means "first child, second subchild" and not "first and second child").
Or a new kernel like "struct_subset" or some other name.

This might also overlap with general projection functionality? (cc @westonpace)

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Dhruv Vats / @dhruv9vats

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-15643. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions