GH-39855: [Python] ListView support for pa.array() by danepitkin · Pull Request #40160 · apache/arrow

danepitkin · 2024-02-20T17:01:25Z

Rationale for this change

Add pa.array() instantiation support for ListView and LargeListView formats.

What changes are included in this PR?

pa.array() supports creating ListView and LargeListView types
ListArray, LargeListArray now have their size initialized before adding elements during python-to-arrow conversion. This allows these types to be convertible to ListViewArray and LargeListViewArray types.

Are these changes tested?

Yes, unit tested.

Are there any user-facing changes?

Yes, new feature added.

github-actions · 2024-02-20T17:02:06Z

⚠️ GitHub issue #39855 has been automatically assigned in GitHub to PR creator.

danepitkin · 2024-02-20T17:14:52Z

python/pyarrow/src/arrow/python/python_to_arrow.cc

ListArray.Append() was being indirectly invoked with Append(true, 0), which is a simple way to dynamically build a ListArray with unknown size. However ListViewArray needs the size included in the Append() API. In order for ListArray to be convertible to ListViewArray, it also needs the size assigned in Append(). That change is included here for ListArray and ListViewArray (MapArray and FixedSizeListArray don't support a parameterized Append() method). See this comment for more details:

arrow/cpp/src/arrow/array/builder_nested.h

Line 105 in 47f15b0

/// [Large]ListView arrays, the list slot length will be exactly list_length, but if

Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)

danepitkin · 2024-02-20T20:00:02Z

>>> import pyarrow as pa
>>> pa.array([[1, 2], [3, 4], None], type=pa.list_view(pa.int8()))
<pyarrow.lib.ListViewArray object at 0x13df6e740>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ],
  null
]

jorisvandenbossche · 2024-02-21T09:25:01Z

python/pyarrow/src/arrow/python/python_to_arrow.cc

I suppose at the moment we get here, we know for sure that the object has a size? (eg a generator is iterable, but has no length. We do support generators as the main value passed to pa.array(..) (through ConvertToSequenceAndInferSize), but not nested in a list array, I think)

At the moment, AppendIterable() is only called for Sets and Dictionary values views. I confirmed generators are routed through AppendSequence() instead. I did give it a try as well:

>>> def gen(): ... yield [0] ... yield [1] ... yield [2] ... >>> pa.array(gen(), type=pa.list_view(pa.int32())) <pyarrow.lib.ListViewArray object at 0x12c96e6e0> [ [ 0 ], [ 1 ], [ 2 ] ] >>> pa.array([gen()], type=pa.list_view(pa.int32())) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "pyarrow/array.pxi", line 343, in pyarrow.lib.array result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas) File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array chunked = GetResultValue( File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status return check_status(status) File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status raise convert_status(status) pyarrow.lib.ArrowTypeError: Could not convert <generator object gen at 0x12bed75e0> with type generator: was not a sequence or recognized null for conversion to list type

jorisvandenbossche · 2024-02-21T09:27:50Z

python/pyarrow/src/arrow/python/python_to_arrow.cc

Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)

jorisvandenbossche · 2024-02-21T09:31:32Z

python/pyarrow/tests/test_array.py

I would expect that this test needs some slight changes to work for the list view types? Because those use 3 instead of 2 buffers for the main array (apart from the child values)?

paleolimbot

I looked through and didn't see anything out of place! I appreciate the clarity of the tests and commitment to similarity with surrounding code. I can't guarantee that there isn't an offset or value out of place somewhere; however, I did by best to check them!

jorisvandenbossche

Looks good! Just two small comments on the tests

jorisvandenbossche · 2024-02-29T15:12:21Z

python/pyarrow/tests/test_array.py

Suggested change

arr = pa.array(arr, type=list_type_factory(pa.int8()))

reconstructed_arr = list_array_type.from_arrays(

arr.offsets, arr.sizes, arr.values, mask=arr.is_null())

result = pa.array(arr, type=list_type_factory(pa.int8()))

assert result.to_pylist() == arr

Maybe something like this instead? Because right now it doesn't really guarantee it was constructed correctly

Although this test might also not add really value compared to what is already tested in test_convert_builtin.py?

I will remove it, since I also think it is a redundant test. I duplicated this test from list/largelist unit tests unnecessarily.

jorisvandenbossche · 2024-02-29T15:15:46Z

python/pyarrow/tests/test_scalars.py

Can you keep the case where type is None (inferred)? Eg add None to the parametrzation, and then after this line set ty to the default inferred one if it was None

Good catch! Will update

…y() input

conbench-apache-arrow · 2024-03-01T12:05:49Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 7c4f4c2.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

#40381) ### Rationale for this change Fixing the hypothesis tests: - fixup untested changes to the strategies from #40160 - fix a bug in the `byte_width` attribute discovered by hypothesis (introduced by #39592) * GitHub Issue: #40379 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added the awaiting review Awaiting review label Feb 20, 2024

danepitkin commented Feb 20, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 20, 2024

jorisvandenbossche reviewed Feb 21, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes Component: Python awaiting change review Awaiting change review and removed awaiting committer review Awaiting committer review awaiting changes Awaiting changes labels Feb 21, 2024

paleolimbot reviewed Feb 22, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 22, 2024

danepitkin requested a review from jorisvandenbossche February 27, 2024 17:46

jorisvandenbossche reviewed Feb 29, 2024

View reviewed changes

Dane Pitkin added 8 commits February 29, 2024 11:13

ListView support for pa.array()

b0cbc5a

Fix tests

25418a0

Add comments

2090474

bugfix

5bab67a

Fix linter

124af4f

Fix convert_builtin tests

8310879

Fix linter again

3110b2c

Remove test_list_view_from_factory(), add None to test_list_from_nump…

a289fb4

…y() input

danepitkin force-pushed the danepitkin/python-lv-array branch from c47ebce to a289fb4 Compare February 29, 2024 22:12

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 29, 2024

jorisvandenbossche approved these changes Mar 1, 2024

View reviewed changes

jorisvandenbossche merged commit 7c4f4c2 into apache:main Mar 1, 2024

jorisvandenbossche removed the awaiting change review Awaiting change review label Mar 1, 2024

github-actions bot added the awaiting merge Awaiting merge label Mar 1, 2024

This was referenced Mar 6, 2024

[Python][CI] Hypothesis tests are failing (test-conda-python-3.11-hypothesis crossbow build) #40379

Closed

GH-40379: [Python] Fix byte_width for binary(0) + fix hypothesis tests #40381

Merged

-    arr = pa.array(arr, type=list_type_factory(pa.int8()))
-    reconstructed_arr = list_array_type.from_arrays(
-        arr.offsets, arr.sizes, arr.values, mask=arr.is_null())
+    result = pa.array(arr, type=list_type_factory(pa.int8()))
+    assert result.to_pylist() == arr

Conversation

danepitkin commented Feb 20, 2024 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Feb 20, 2024

Uh oh!

danepitkin Feb 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danepitkin commented Feb 20, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danepitkin Feb 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paleolimbot left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

conbench-apache-arrow bot commented Mar 1, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

danepitkin commented Feb 20, 2024 •

edited by github-actions bot

Loading

danepitkin Feb 20, 2024 •

edited

Loading

danepitkin Feb 21, 2024 •

edited

Loading