Skip to content

RecordArray with duplicated field names cause issues with to_buffers, unpickling and loading from arrow #3247

@nikoladze

Description

@nikoladze

Version of Awkward Array

2.6.8

Description and code to reproduce

It seems RecordArray allows for duplicated fields, e.g. when constructing via the Layout API

>>> array = ak.Array(ak.contents.RecordArray([ak.contents.NumpyArray([1, 2, 3]), ak.contents.NumpyArray([1, 2, 3])], ["a", "a"]))
>>> array
<Array [{a: 1, a: 1}, {...}, {a: 3, a: 3}] type='3 * {a: int64, a: int64}'>

Another possibility this can happen is if one (like me, accidentally) repeats a record field twice when selecting multiple record fields:

>>> array = ak.zip({"a": [1, 2, 3]})[["a", "a"]]
>>> array
<Array [{a: 1, a: 1}, {...}, {a: 3, a: 3}] type='3 * {a: int64, a: int64}'>

Now, such arrays cause issues when

  1. exploding via to_buffers:
>>> ak.to_buffers(array)
(RecordForm([NumpyForm('int64', form_key='node1'), NumpyForm('int64', form_key='node2')], ['a', 'a'], form_key='node0'), 3, {'node1-data': array([1, 2, 3])})

one can see node2-data is missing

  1. which consequently leads to problems unpickling
>>> pickle.loads(pickle.dumps(array))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_pickle.py", line 107, in unpickle_array_schema_1
    return _impl(
           ^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 150, in _impl
    out = _reconstitute(form, length, container, getkey, backend, byteorder, simplify)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 405, in _reconstitute
    _reconstitute(
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_buffers.py", line 196, in _reconstitute
    raw_array = container[getkey(form, "data")]
                ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'node2-data'
  1. also a roundtrip to and from arrow doesn't work anymore:
>>> ak.from_arrow(ak.to_arrow(array))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_dispatch.py", line 39, in dispatch
    gen_or_result = func(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_arrow.py", line 45, in from_arrow
    return _impl(array, generate_bitmasks, highlevel, behavior, attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/operations/ak_from_arrow.py", line 55, in _impl
    out = awkward._connect.pyarrow.handle_arrow(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_connect/pyarrow/conversions.py", line 757, in handle_arrow
    out = popbuffers(
          ^^^^^^^^^^^
  File "/home/nikolai/.local/lib/python3.12/site-packages/awkward/_connect/pyarrow/conversions.py", line 370, in popbuffers
    paarray.field(field_name), a, b, buffers, generate_bitmasks
    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/array.pxi", line 3913, in pyarrow.lib.StructArray.field
KeyError: 'a'

This error occurred while calling

    ak.from_arrow(
        AwkwardArrowArray-instance
    )

Probably one could just not allow arrays with duplicated field names. I'm not sure if there is any useful application of this - when i discovered this in my code this was also actually something i did not intend to do (just accidentally repeated a field name), so writing this minimal reproducer was already worth it :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugThe problem described is something that must be fixed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions