-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector #7514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fsaintjacques
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me, but I think it would be good for 1.0 to also support other binary types, at least from arrow -> R. I'm not making this a request changes, but a strong suggestion.
r/src/array_to_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in your mental cache, we can accept FIXED_BINARY and LARGE_BINARY with minimal effort by using GetView instead (and some template dispatching because they don't share the same base class). LARGE_BINARY will require to bound check each value with INT32_MAX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed types: https://issues.apache.org/jira/browse/ARROW-9291
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Could you please also update the Arrow <-> R conversion tables in vignettes/arrow.Rmd to indicate this support?
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read this correctly, you're introducing general support for vctrs::list_of ptype here? IIUC then it would make sense for the ListArray to R vector conversion to set that class and attribute too then, right? (If so, feel free to add that here or defer to a followup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah that makes sense. Although this only looks at the attribute, not specifically that it is a vctrs_list_of but on the arrow -> R conversion, I think it does not hurt to make it a vctrs_list_of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I had to modify the roundtrip checks to use expect_equivalent() because a roundtrip might add information:
list() -> List Array -> list_of( ptype = <from list array value type>)
r/src/array_to_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this file is going to need to get the BitBlockCounter treatment soon
|
@wesm I don't know what you mean by the |
|
Also deals with LargeBinary now, e.g. https://issues.apache.org/jira/browse/ARROW-6543 library(arrow, warn.conflicts = FALSE)
a <- Array$create(list(raw()), type = large_binary())
a
#> Array
#> <large_binary>
#> [
#>
#> ]
a$as_vector()
#> <list_of<raw>[1]>
#> [[1]]
#> raw(0)Created on 2020-06-26 by the reprex package (v0.3.0.9001) |
Ah sorry, welcome back =) We created some facilities to improve the performance of processing validity bitmaps https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_block_counter.h The idea is that you can scan forward 64 bits at a time and determine efficiently (using hardware popcount) whether a "block" of 64 values has any nulls or not (or if the block is all null, too). For those 64 values you can skip null checking all together and use the fast path. This speeds up processing significantly for data that is mostly not null or mostly null |
|
Added support for LargeList: library(arrow, warn.conflicts = FALSE)
a <- Array$create(list(integer()), type = large_list_of(int32()))
a
#> LargeListArray
#> <large_list<item: int32>>
#> [
#> []
#> ]
a$as_vector()
#> <list_of<integer>[1]>
#> [[1]]
#> integer(0)Created on 2020-06-26 by the reprex package (v0.3.0.9001) |
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good, just some suggestions on the tests. Thanks for adding the Large* types. There's still FixedSizeBinary and FixedSizeList to do, but we can defer them to a followup.
Please do also fill in the vignette table now that these types are supported.
299a801 to
cb0721b
Compare
r/src/array_from_vector.cpp
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new builder is breaking the (new) UTF-8 tests. The previous converter code is https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L147-L171 and it is apparently no longer being called.
I wonder if this whole code block isn't possible right now as is. The "Reserve enough space before appending" block would also need to convert to UTF-8 in order to get the size right, and I wonder if converting/asserting everything to UTF-8 twice would outweigh the benefits of Reserving space. Or maybe we can take the size as is and overcommit bytes?
Tangentially related, would a "reserve" check like this be the way to solve https://issues.apache.org/jira/browse/ARROW-3308, where we need to switch to Large types if there's more than 2GB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have a look once back at this. Overall there seems to be two concurrent systems with no real reason and I believe we should only keep the once powered by VectorToArrayConverter .
I'm on my dplyr week now, I'll try still to make some space for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this; will deal with efficiency/redundancy in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about the redundancy. I checked code coverage and it appears that both paths are being called, so that should be addressed (my guess is that factor levels are still going through the old).
…he value_type() of the list.
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
cb0721b to
ff10327
Compare
Going with list of raw vectors because strings can't hold nulls:
Created on 2020-06-22 by the reprex package (v0.3.0.9001)