Skip to content

Conversation

@romainfrancois
Copy link
Contributor

Going with list of raw vectors because strings can't hold nulls:

library(arrow, warn.conflicts = FALSE)

raws <- vctrs::list_of(as.raw(0:2), as.raw(0:255), .ptype = raw())
a <- Array$create(raws)
a
#> Array
#> <binary>
#> [
#>   000102,
#>   000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
#> ]
a$as_vector()
#> <list_of<raw>[2]>
#> [[1]]
#> [1] 00 01 02
#> 
#> [[2]]
#>   [1] 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18
#>  [26] 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31
#>  [51] 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a
#>  [76] 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63
#> [101] 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c
#> [126] 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95
#> [151] 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae
#> [176] af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#> [201] c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0
#> [226] e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
#> [251] fa fb fc fd fe ff

# does not have to be a list_of, just have a ptype attribute
raws <- structure(list(as.raw(0:2), as.raw(0:255)), ptype = raw())
a <- Array$create(raws)
a
#> Array
#> <binary>
#> [
#>   000102,
#>   000102030405060708090A0B0C0D0E0F101112131415161718191A1B1C1D1E1F202122232425262728292A2B2C2D2E2F303132333435363738393A3B3C3D3E3F404142434445464748494A4B4C4D4E4F505152535455565758595A5B5C5D5E5F606162636465666768696A6B6C6D6E6F707172737475767778797A7B7C7D7E7F808182838485868788898A8B8C8D8E8F909192939495969798999A9B9C9D9E9FA0A1A2A3A4A5A6A7A8A9AAABACADAEAFB0B1B2B3B4B5B6B7B8B9BABBBCBDBEBFC0C1C2C3C4C5C6C7C8C9CACBCCCDCECFD0D1D2D3D4D5D6D7D8D9DADBDCDDDEDFE0E1E2E3E4E5E6E7E8E9EAEBECEDEEEFF0F1F2F3F4F5F6F7F8F9FAFBFCFDFEFF
#> ]

# but it makes one on the way out
a$as_vector()
#> <list_of<raw>[2]>
#> [[1]]
#> [1] 00 01 02
#> 
#> [[2]]
#>   [1] 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18
#>  [26] 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31
#>  [51] 32 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a
#>  [76] 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63
#> [101] 64 65 66 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 76 77 78 79 7a 7b 7c
#> [126] 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d 8e 8f 90 91 92 93 94 95
#> [151] 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aa ab ac ad ae
#> [176] af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf c0 c1 c2 c3 c4 c5 c6 c7
#> [201] c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 da db dc dd de df e0
#> [226] e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9
#> [251] fa fb fc fd fe ff

Created on 2020-06-22 by the reprex package (v0.3.0.9001)

@github-actions
Copy link

@fsaintjacques fsaintjacques changed the title ARROW-6235 : [R] Conversion from arrow::BinaryArray to R character vector not implemented ARROW-6235 : [R] Implement conversion from arrow::BinaryArray to R character vector Jun 22, 2020
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, but I think it would be good for 1.0 to also support other binary types, at least from arrow -> R. I'm not making this a request changes, but a strong suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in your mental cache, we can accept FIXED_BINARY and LARGE_BINARY with minimal effort by using GetView instead (and some template dispatching because they don't share the same base class). LARGE_BINARY will require to bound check each value with INT32_MAX.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsaintjacques fsaintjacques changed the title ARROW-6235 : [R] Implement conversion from arrow::BinaryArray to R character vector ARROW-6235: [R] Implement conversion from arrow::BinaryArray to R character vector Jun 22, 2020
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Could you please also update the Arrow <-> R conversion tables in vignettes/arrow.Rmd to indicate this support?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this correctly, you're introducing general support for vctrs::list_of ptype here? IIUC then it would make sense for the ListArray to R vector conversion to set that class and attribute too then, right? (If so, feel free to add that here or defer to a followup)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah that makes sense. Although this only looks at the attribute, not specifically that it is a vctrs_list_of but on the arrow -> R conversion, I think it does not hurt to make it a vctrs_list_of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I had to modify the roundtrip checks to use expect_equivalent() because a roundtrip might add information:

list() -> List Array -> list_of( ptype = <from list array value type>) 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this file is going to need to get the BitBlockCounter treatment soon

@romainfrancois
Copy link
Contributor Author

@wesm I don't know what you mean by the BitBlockCounter treatment.

@romainfrancois
Copy link
Contributor Author

Also deals with LargeBinary now, e.g. https://issues.apache.org/jira/browse/ARROW-6543

library(arrow, warn.conflicts = FALSE)

a <- Array$create(list(raw()), type = large_binary())
a
#> Array
#> <large_binary>
#> [
#>   
#> ]
a$as_vector()
#> <list_of<raw>[1]>
#> [[1]]
#> raw(0)

Created on 2020-06-26 by the reprex package (v0.3.0.9001)

@wesm
Copy link
Member

wesm commented Jun 26, 2020

@wesm I don't know what you mean by the BitBlockCounter treatment.

Ah sorry, welcome back =) We created some facilities to improve the performance of processing validity bitmaps

https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/bit_block_counter.h

The idea is that you can scan forward 64 bits at a time and determine efficiently (using hardware popcount) whether a "block" of 64 values has any nulls or not (or if the block is all null, too). For those 64 values you can skip null checking all together and use the fast path. This speeds up processing significantly for data that is mostly not null or mostly null

@romainfrancois
Copy link
Contributor Author

Added support for LargeList:

library(arrow, warn.conflicts = FALSE)

a <- Array$create(list(integer()), type = large_list_of(int32()))
a
#> LargeListArray
#> <large_list<item: int32>>
#> [
#>   []
#> ]
a$as_vector()
#> <list_of<integer>[1]>
#> [[1]]
#> integer(0)

Created on 2020-06-26 by the reprex package (v0.3.0.9001)

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, just some suggestions on the tests. Thanks for adding the Large* types. There's still FixedSizeBinary and FixedSizeList to do, but we can defer them to a followup.

Please do also fill in the vignette table now that these types are supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new builder is breaking the (new) UTF-8 tests. The previous converter code is https://github.com/apache/arrow/blob/master/r/src/array_from_vector.cpp#L147-L171 and it is apparently no longer being called.

I wonder if this whole code block isn't possible right now as is. The "Reserve enough space before appending" block would also need to convert to UTF-8 in order to get the size right, and I wonder if converting/asserting everything to UTF-8 twice would outweigh the benefits of Reserving space. Or maybe we can take the size as is and overcommit bytes?

Tangentially related, would a "reserve" check like this be the way to solve https://issues.apache.org/jira/browse/ARROW-3308, where we need to switch to Large types if there's more than 2GB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look once back at this. Overall there seems to be two concurrent systems with no real reason and I believe we should only keep the once powered by VectorToArrayConverter .

I'm on my dplyr week now, I'll try still to make some space for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this; will deal with efficiency/redundancy in a followup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to https://issues.apache.org/jira/browse/ARROW-7798 about the redundancy. I checked code coverage and it appears that both paths are being called, so that should be addressed (my guess is that factor levels are still going through the old).

@fsaintjacques fsaintjacques self-assigned this Jun 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants