ARROW-1673: [Python] Add support for numpy 'bool' type#1199
ARROW-1673: [Python] Add support for numpy 'bool' type#1199pcmoritz wants to merge 4 commits intoapache:masterfrom
Conversation
cpp/src/arrow/type.h
Outdated
There was a problem hiding this comment.
I'm not at all sure if this is the right fix; maybe we need a separate field for the width if the type is contained in a tensor? Standardizing around numpy for tensors seems the way to go.
There was a problem hiding this comment.
Boolean in Arrow is 1 bit, so don’t make this change. We may need to get creative about dealing with NumPy’s metadata
There was a problem hiding this comment.
Fair enough, for our Python serialization we could hack around it by defining a custom serializer. However, for Tensor.from_numpy() we can't do that because the type needs to be fully encoded in the Tensor type. Would it be acceptable to introduce a new type for this? Let me know which solution you prefer.
There was a problem hiding this comment.
In particular I'm thinking of introducing a "bool8" type, which is a bool that is encoded as a single byte.
There was a problem hiding this comment.
I created https://issues.apache.org/jira/browse/ARROW-1674. This is probably the right way to handle this at the format level. We can separately add a data type in C++. It will be useful to be able to receive numpy.bool_ data with zero copy in Arrow
097a78b to
1de11a4
Compare
1de11a4 to
14943a0
Compare
|
This just passes bool arrays to the custom serializer, right? Does it make sense to register the custom serializer in the default serialization context or no? |
|
Magically, this is already taken care of. The custom serializer we already have is generic, it will convert the array to nested lists and the custom deserializer will make a numpy array out of it. Not the most efficient solution but it fixes the problem until we have the proper solution :) |
|
Oh, I see. Could be made efficient by having the custom serializer special case bool arrays. But if this is just temporary then no need to. |
|
+1 this is ready to merge as a workaround for ray-project/ray#1121 |
|
I haven't looked too deeply, but could you explain how this fix works? |
|
Yeah, the switch case I removed makes it fall back to the default, which uses the custom serializer. This will fall back to the function arrow/python/pyarrow/serialization.py Line 81 in a043018 which converts the array to a nested list upon serialization and back upon deserialization. |
|
Got it, so the workaround is slower / not zero copy. No big deal. I will work to get this fixed more properly + zero copy reads in time for 0.8.0 |
This is currently a workaround until the Arrow tensor supports zero copy of byte-length booleans.