Skip to content

GH-39855: [Python] ListView support for pa.array()#40160

Merged
jorisvandenbossche merged 8 commits intoapache:mainfrom
danepitkin:danepitkin/python-lv-array
Mar 1, 2024
Merged

GH-39855: [Python] ListView support for pa.array()#40160
jorisvandenbossche merged 8 commits intoapache:mainfrom
danepitkin:danepitkin/python-lv-array

Conversation

@danepitkin
Copy link
Copy Markdown
Member

@danepitkin danepitkin commented Feb 20, 2024

Rationale for this change

Add pa.array() instantiation support for ListView and LargeListView formats.

What changes are included in this PR?

  • pa.array() supports creating ListView and LargeListView types
  • ListArray, LargeListArray now have their size initialized before adding elements during python-to-arrow conversion. This allows these types to be convertible to ListViewArray and LargeListViewArray types.

Are these changes tested?

Yes, unit tested.

Are there any user-facing changes?

Yes, new feature added.

@github-actions
Copy link
Copy Markdown

⚠️ GitHub issue #39855 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting review Awaiting review label Feb 20, 2024
Copy link
Copy Markdown
Member Author

@danepitkin danepitkin Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ListArray.Append() was being indirectly invoked with Append(true, 0), which is a simple way to dynamically build a ListArray with unknown size. However ListViewArray needs the size included in the Append() API. In order for ListArray to be convertible to ListViewArray, it also needs the size assigned in Append(). That change is included here for ListArray and ListViewArray (MapArray and FixedSizeListArray don't support a parameterized Append() method). See this comment for more details:

/// [Large]ListView arrays, the list slot length will be exactly list_length, but if

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 20, 2024
@danepitkin
Copy link
Copy Markdown
Member Author

>>> import pyarrow as pa
>>> pa.array([[1, 2], [3, 4], None], type=pa.list_view(pa.int8()))
<pyarrow.lib.ListViewArray object at 0x13df6e740>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ],
  null
]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose at the moment we get here, we know for sure that the object has a size? (eg a generator is iterable, but has no length. We do support generators as the main value passed to pa.array(..) (through ConvertToSequenceAndInferSize), but not nested in a list array, I think)

Copy link
Copy Markdown
Member Author

@danepitkin danepitkin Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, AppendIterable() is only called for Sets and Dictionary values views. I confirmed generators are routed through AppendSequence() instead. I did give it a try as well:

>>> def gen():
...     yield [0]
...     yield [1]
...     yield [2]
...
>>> pa.array(gen(), type=pa.list_view(pa.int32()))
<pyarrow.lib.ListViewArray object at 0x12c96e6e0>
[
  [
    0
  ],
  [
    1
  ],
  [
    2
  ]
]
>>> pa.array([gen()], type=pa.list_view(pa.int32()))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/array.pxi", line 343, in pyarrow.lib.array
    result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
  File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
    chunked = GetResultValue(
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowTypeError: Could not convert <generator object gen at 0x12bed75e0> with type generator: was not a sequence or recognized null for conversion to list type

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)

Comment on lines 630 to 632
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that this test needs some slight changes to work for the list view types? Because those use 3 instead of 2 buffers for the main array (apart from the child values)?

@github-actions github-actions bot added awaiting changes Awaiting changes Component: Python awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Feb 21, 2024
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through and didn't see anything out of place! I appreciate the clarity of the tests and commitment to similarity with surrounding code. I can't guarantee that there isn't an offset or value out of place somewhere; however, I did by best to check them!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 22, 2024
Copy link
Copy Markdown
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just two small comments on the tests

Comment on lines 3690 to 3692
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
arr = pa.array(arr, type=list_type_factory(pa.int8()))
reconstructed_arr = list_array_type.from_arrays(
arr.offsets, arr.sizes, arr.values, mask=arr.is_null())
result = pa.array(arr, type=list_type_factory(pa.int8()))
assert result.to_pylist() == arr

Maybe something like this instead? Because right now it doesn't really guarantee it was constructed correctly

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this test might also not add really value compared to what is already tested in test_convert_builtin.py?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it, since I also think it is a redundant test. I duplicated this test from list/largelist unit tests unnecessarily.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep the case where type is None (inferred)? Eg add None to the parametrzation, and then after this line set ty to the default inferred one if it was None

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Will update

@danepitkin danepitkin force-pushed the danepitkin/python-lv-array branch from c47ebce to a289fb4 Compare February 29, 2024 22:12
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 29, 2024
@jorisvandenbossche jorisvandenbossche merged commit 7c4f4c2 into apache:main Mar 1, 2024
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label Mar 1, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label Mar 1, 2024
@conbench-apache-arrow
Copy link
Copy Markdown

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 7c4f4c2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

pitrou pushed a commit that referenced this pull request Mar 7, 2024
#40381)

### Rationale for this change

Fixing the hypothesis tests:

- fixup untested changes to the strategies from #40160
- fix a bug in the `byte_width` attribute discovered by hypothesis (introduced by #39592)

* GitHub Issue: #40379

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python] Add pa.array() support for ListView and LargeListView

3 participants