GH-17682: [C++][Python] Bool8 Extension Type Implementation#43488
GH-17682: [C++][Python] Bool8 Extension Type Implementation#43488felipecrv merged 21 commits intoapache:mainfrom
Conversation
westonpace
left a comment
There was a problem hiding this comment.
Just one question, but looks good otherwise.
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
I'm a little confused by _pc().not_equal(self.storage, 0). Isn't this creating a copy? Wasn't the purpose of bool8 to allow zero-copy with numpy?
There was a problem hiding this comment.
Hi @westonpace. Yes the default path for the to_numpy() method is to enforce zero-copy behavior which is achieved by the line return self.storage.to_numpy().view(np.bool_). The zero_copy_only kwarg can optionally be set to False which relaxes this requirement.
The line you indicated does create a copy, but it will only be reached if zero_copy_only is False AND the original attempt at a zero copy view failed.
There was a problem hiding this comment.
And in practice, this code path gets reached if there are missing values?
There was a problem hiding this comment.
Yes, correct. The outcomes of taking the various paths are demonstrated in this test.
This also matches the existing semantics of converting a normal boolean array to numpy, which currently performs a copy to an array of dtype=np.object_ if there are any missing values.
There was a problem hiding this comment.
Got it. Thanks for the explanation!
|
Thank you for this, this is such an excellent addition ❤️ |
jorisvandenbossche
left a comment
There was a problem hiding this comment.
@joellubi added some quick comments, but generally looking good! Still need to check the tests
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
And in practice, this code path gets reached if there are missing values?
python/pyarrow/array.pxi
Outdated
There was a problem hiding this comment.
This would loose track of the buffer owner (the numpy array obj), so you would need to pass that to the foreign_buffer function as base argument.
However, I think we could also simplify this by first creating a pyarrow storage array of int8, and then using self.from_storage() instead of using from_buffers() ?
There was a problem hiding this comment.
I gave this a try and it works if the numpy array has dtype=np.int8:
np_arr = np.array([1, 0, 1], dtype=np.int8)
pa_storage_arr = pa.array(np_arr, type=pa.int8())
pa_bool8_arr = pa.ExtensionArray.from_storage(pa.bool8(), pa_storage_arr)This does not produce any copies. The existing approach of using foreign_buffer also works with np_arr = np.array([True, False, True], dtype=np.bool_) without making a copy.
However using the pa.array() constuctor currently does make a copy when going bool -> int8. I think this would require a zero-copy casting kernel to be added to C++. That seems like it would be a better approach, I just have to wrap my head around that part of the code.
CC: @felipecrv does this sound right ^?
There was a problem hiding this comment.
Actually now that I think about it I don't think a casting kernel is what's needed in this specific scenario since that goes between Arrow types and we're not trying to convert Arrow Boolean to Arrow Int8. I think what we need is to reinterpret the numpy bool as a numpy int8, then continue the same way as above for the int8 arrow array. I'll give that a try now.
There was a problem hiding this comment.
Ok I pushed up the change, let me know what you think.
There was a problem hiding this comment.
Yes, that looks good!
@pitrou I'll update that table in a follow-up PR. I made edits to it in #43679, so the addition will be easier once that PR has merged. |
|
@pitrou @jorisvandenbossche Any more comments on the C++ or Python sides respectively, or does this look ok to merge? |
| return ss.str(); | ||
| } | ||
|
|
||
| std::string Bool8Type::Serialize() const { return ""; } |
There was a problem hiding this comment.
This is what's specified in "description of the serialization" for Bool8.
This method is generally used to encode type parameters, but for bool8 there are no parameters. The type is fully defined by its name and storage type.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Looks good!
I added a bunch more comments, but they are all just minor formatting / testing nits
python/pyarrow/types.pxi
Outdated
| unknown_col: [[True, False, True, True, null]] | ||
| unknown_col: [[-1,0,1,2,null]] |
There was a problem hiding this comment.
Sidenote: this is a good illustration for that we should ideally have a way to let the extension type control this string representation
There was a problem hiding this comment.
This is a great point and certainly something I would have liked to have when going through this implementation. I'll open an issue for it.
|
|
||
|
|
||
| def test_bool8_scalar(): | ||
| assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() |
There was a problem hiding this comment.
Something I didn't think about in the previous round, but it might be better to test the value explicitly in this case, instead of relying on python's general truthiness:
| assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() | |
| assert pa.ExtensionScalar.from_storage(pa.bool8(), -1).as_py() is True |
Because otherwise this test doesn't actually ensure that the result is True or False. If we were still returning the underlying storage of 0, 1, 2 etc, those tests would also pass in its current form.
(same for the lines below)
There was a problem hiding this comment.
Good idea, it reads a lot clearer now too.
There was a problem hiding this comment.
Thanks for adding that support!
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 5258819. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 26 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
C++ and Python implementations of #43234
What changes are included in this PR?
Bool8Type,Bool8Array,Bool8Scalar, and testsAre these changes tested?
Yes
Are there any user-facing changes?
Bool8 extension type will be available in C++ and Python libraries