Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Aug 1, 2019

No description provided.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kszucs Does this look right?

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch from 8ccd61f to b5ae30c Compare August 1, 2019 12:31
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to have this subclass ListValue to reduce code duplication? (getitem, iter, as_py look the same)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the C++ types are different (e.g. ListArray vs. LargeListArray), and those need to be compile-time constants for Cython.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code like this makes me wish for some kind of macro system in Cython.

@jorisvandenbossche
Copy link
Member

Again, while trying it out this branch, I noticed some oddities. But now they are certainly not related to the code in this PR, as they also hold for non-large ListType.

When creating an array from offsets and values in python, there is no validation of the offsets that it starts with 0 and ends with the length of the array (but is that required? the docs seem to indicate that: https://github.com/apache/arrow/blob/master/docs/source/format/Layout.rst#list-type ("The first value in the offsets array is 0, and the last element is the length of the values array.").
The array you get "seems" ok (the repr), but on conversion to python or flattened arrays, things go wrong:

In [61]: a = pa.ListArray.from_arrays([1,3,10], np.arange(5)) 

In [62]: a
Out[62]: 
<pyarrow.lib.ListArray object at 0x7fdd9c468678>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

In [63]: a.flatten()
Out[63]: 
<pyarrow.lib.Int64Array object at 0x7fdd9cbfe9e8>
[
  0,   # <--- includes the 0
  1,
  2,
  3,
  4
]

In [64]: a.to_pylist()
Out[64]: [[1, 2], [3, 4, 1121, 1, 64, 93969433636432, 13]]  # <--includes more elements as garbage

I understand that in C++ the main constructors are not safe, and as the caller you need to ensure that the data is correct or call a safe (slower) constructor. But do we want to use the unsafe / fast constructors without validation in Python as default as well?

@jorisvandenbossche
Copy link
Member

Ah, I see that in Python a validation function is exposed, and this actually raises:

In [65]: a.validate()                                                                                                                                                                                              
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
<ipython-input-65-b377de7e0b22> in <module>
----> 1 a.validate()

~/scipy/repos/arrow/python/pyarrow/array.pxi in pyarrow.lib.Array.validate()

~/scipy/repos/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: Final offset invariant not equal to values length: 10!=5

Do we want to call that by default in the creation method?
A quick search seems to indicate that pa.Array.from_buffers does this, but other from_arrays method don't seem to explicitly do this. But eg in DictionaryArray.from_arrays, if you pass values that is too short for the max of the indices, you do get an error.

@pitrou
Copy link
Member Author

pitrou commented Aug 5, 2019

On the C++ side indeed, calling validation for each constructor would be expensive. We can adopt a different strategy for Python. Can you open a JIRA about that?

@jorisvandenbossche
Copy link
Member

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch 2 times, most recently from e87eef5 to 3c38574 Compare August 5, 2019 17:34
@pitrou
Copy link
Member Author

pitrou commented Aug 6, 2019

@wesm Do you want to take a look here or can I merge as-is? (assuming the R AppVeyor failure isn't related)

@pitrou pitrou force-pushed the ARROW-6084-py-large-list branch from 3c38574 to 4266ea2 Compare August 6, 2019 12:20
@pitrou
Copy link
Member Author

pitrou commented Aug 6, 2019

Rebased.

@codecov-io
Copy link

Codecov Report

Merging #4979 into master will increase coverage by 1.6%.
The diff coverage is 87.4%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4979      +/-   ##
==========================================
+ Coverage   87.57%   89.17%    +1.6%     
==========================================
  Files        1005      727     -278     
  Lines      143560   103008   -40552     
  Branches     1418        0    -1418     
==========================================
- Hits       125720    91862   -33858     
+ Misses      17478    11146    -6332     
+ Partials      362        0     -362
Impacted Files Coverage Δ
python/pyarrow/__init__.py 67.04% <ø> (ø) ⬆️
python/pyarrow/tests/test_compute.py 100% <ø> (ø) ⬆️
python/pyarrow/lib.pyx 100% <100%> (ø) ⬆️
cpp/src/arrow/python/python_to_arrow.cc 91.29% <100%> (+0.07%) ⬆️
python/pyarrow/tests/test_convert_builtin.py 97.24% <100%> (+0.03%) ⬆️
python/pyarrow/public-api.pxi 59.67% <100%> (+0.43%) ⬆️
python/pyarrow/types.py 97.84% <100%> (+0.04%) ⬆️
python/pyarrow/tests/test_scalars.py 100% <100%> (ø) ⬆️
python/pyarrow/tests/test_types.py 97.11% <100%> (+0.11%) ⬆️
python/pyarrow/tests/test_array.py 96.33% <100%> (+0.03%) ⬆️
... and 285 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 42f4f34...4266ea2. Read the comment docs.

@wesm
Copy link
Member

wesm commented Aug 6, 2019

Taking a look

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code like this makes me wish for some kind of macro system in Cython.

@wesm wesm closed this in 2774cfb Aug 6, 2019
@pitrou pitrou deleted the ARROW-6084-py-large-list branch August 6, 2019 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants