GH-39855: [Python] ListView support for pa.array()#40160
GH-39855: [Python] ListView support for pa.array()#40160jorisvandenbossche merged 8 commits intoapache:mainfrom
Conversation
|
|
There was a problem hiding this comment.
ListArray.Append() was being indirectly invoked with Append(true, 0), which is a simple way to dynamically build a ListArray with unknown size. However ListViewArray needs the size included in the Append() API. In order for ListArray to be convertible to ListViewArray, it also needs the size assigned in Append(). That change is included here for ListArray and ListViewArray (MapArray and FixedSizeListArray don't support a parameterized Append() method). See this comment for more details:
arrow/cpp/src/arrow/array/builder_nested.h
Line 105 in 47f15b0
There was a problem hiding this comment.
Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)
|
There was a problem hiding this comment.
I suppose at the moment we get here, we know for sure that the object has a size? (eg a generator is iterable, but has no length. We do support generators as the main value passed to pa.array(..) (through ConvertToSequenceAndInferSize), but not nested in a list array, I think)
There was a problem hiding this comment.
At the moment, AppendIterable() is only called for Sets and Dictionary values views. I confirmed generators are routed through AppendSequence() instead. I did give it a try as well:
>>> def gen():
... yield [0]
... yield [1]
... yield [2]
...
>>> pa.array(gen(), type=pa.list_view(pa.int32()))
<pyarrow.lib.ListViewArray object at 0x12c96e6e0>
[
[
0
],
[
1
],
[
2
]
]
>>> pa.array([gen()], type=pa.list_view(pa.int32()))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/array.pxi", line 343, in pyarrow.lib.array
result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
File "pyarrow/array.pxi", line 42, in pyarrow.lib._sequence_to_array
chunked = GetResultValue(
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
return check_status(status)
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
raise convert_status(status)
pyarrow.lib.ArrowTypeError: Could not convert <generator object gen at 0x12bed75e0> with type generator: was not a sequence or recognized null for conversion to list type
There was a problem hiding this comment.
Sounds good. Maybe add a brief comment about this? (why those AppendTo methods are defined here)
python/pyarrow/tests/test_array.py
Outdated
There was a problem hiding this comment.
I would expect that this test needs some slight changes to work for the list view types? Because those use 3 instead of 2 buffers for the main array (apart from the child values)?
paleolimbot
left a comment
There was a problem hiding this comment.
I looked through and didn't see anything out of place! I appreciate the clarity of the tests and commitment to similarity with surrounding code. I can't guarantee that there isn't an offset or value out of place somewhere; however, I did by best to check them!
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Looks good! Just two small comments on the tests
python/pyarrow/tests/test_array.py
Outdated
There was a problem hiding this comment.
| arr = pa.array(arr, type=list_type_factory(pa.int8())) | |
| reconstructed_arr = list_array_type.from_arrays( | |
| arr.offsets, arr.sizes, arr.values, mask=arr.is_null()) | |
| result = pa.array(arr, type=list_type_factory(pa.int8())) | |
| assert result.to_pylist() == arr |
Maybe something like this instead? Because right now it doesn't really guarantee it was constructed correctly
There was a problem hiding this comment.
Although this test might also not add really value compared to what is already tested in test_convert_builtin.py?
There was a problem hiding this comment.
I will remove it, since I also think it is a redundant test. I duplicated this test from list/largelist unit tests unnecessarily.
python/pyarrow/tests/test_scalars.py
Outdated
There was a problem hiding this comment.
Can you keep the case where type is None (inferred)? Eg add None to the parametrzation, and then after this line set ty to the default inferred one if it was None
There was a problem hiding this comment.
Good catch! Will update
c47ebce to
a289fb4
Compare
|
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 7c4f4c2. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them. |
#40381) ### Rationale for this change Fixing the hypothesis tests: - fixup untested changes to the strategies from #40160 - fix a bug in the `byte_width` attribute discovered by hypothesis (introduced by #39592) * GitHub Issue: #40379 Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Rationale for this change
Add pa.array() instantiation support for ListView and LargeListView formats.
What changes are included in this PR?
Are these changes tested?
Yes, unit tested.
Are there any user-facing changes?
Yes, new feature added.