-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13164: [R] altrep vectors from Array with nulls #10730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-13164: [R] altrep vectors from Array with nulls #10730
Conversation
|
follow up after #10593 |
0c68a38 to
498614d
Compare
This comment has been minimized.
This comment has been minimized.
b961c89 to
54b743c
Compare
|
Thanks for the reviews. I'll apply the relevant changes. We now have |
|
I finally settled for a single altrep (template) class for both cases (the array has no nulls or the array has nulls). The altrep object uses:
Materialization is needed when:
Get_region() does not need to materialize the full vector. |
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to say that the AltrepArrayPrimitive seems very difficult to read and understand to me.
I would recommend:
- write extensive comments and explanations (including hyperlinks to R-related docs where necessary)
- choose naming wisely
- make sure no unused functionality or code paths are left
- make internal helper methods private, to better stress what is the "public" API
- also, make sure we're not doing any logic errors (if the user requests a writable array, do we really want to return a pointer to the immutable Arrow data?)
|
Changed |
|
🤔 I think now that means that if the data has been materialized, a |
We can write tests for this behavior, yeah? |
422c2fb to
ef20698
Compare
|
The However, if we make this an error, this fails in many many places, including |
…table. Sliding in the R sentinel.
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>
2145dc4 to
83dcad9
Compare
|
@github-actions crossbow submit -g r |
|
Revision: 763781c Submitted crossbow builds: ursacomputing/crossbow @ actions-806 |
|
I think that's all clean. @jonkeane do you want to do any benchmarking before we merge this? |
|
@ursabot please benchmark lang=R |
|
Benchmark runs are scheduled for baseline = f60bc12 and contender = 763781c. Results will be available as each benchmark for each run completes. |
|
Yeah, we should run these manually. I'll try to get to that tomorrow. None of the array conversion benchmarks have been wired up in the benchmarks package, so we don't have them available on conbench right now. |
|
OK, I just ran Array to R benchmarks locally, and the results are very impressive. I've added in other types (even though they are not (yet) backed by altrep). I've also tested both chunked and non-chunked arrays, and with nulls and without. TL;DR: For floats + ints we see huge speed ups (~10x faster) in both arrays with and without nulls as well as in arrays and chunked arrays (at least in the case where there is a single chunk*). I'm still investigating what's going on with the two fannie parquet file reads that show up as regressions. If anything, this should have sped them up.
library(arrow)
nyctaxi <- read_csv_arrow("~/repos/ab_store/data/nyctaxi_2010-01.csv.gz", as_data_frame = FALSE)
chunked_array <- as.vector(nyctaxi[[5]])
.Internal(inspect(chunked_array))
#> @7fb218000000 14 REALSXP g0c7 [REF(4)] (len=14863778, tl=0) 0.75,5.9,4,4.7,0.6,...
array <- as.vector(nyctaxi[[5]]$chunk(1))
.Internal(inspect(array))
#> @7fb20dc3a6e0 14 REALSXP g0c0 [REF(65535)] arrow::Array<double, NONULL> len=5738, Array=<0x7fb25d5080d8>
#> @7fb20dc3a670 22 EXTPTRSXP g0c0 [REF(4)] |
|
And here's the report: |
That's ARROW-13111. ARROW-13112 is for supporting other types. If I understand correctly, we're conceptually close on those since we're implementing the get_region and elt methods for integer and float here, just ("just"!) need to [something about templates] and do the same for other types and for ChunkedArrays using our existing conversion code. |
Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Romain François <romain@rstudio.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
No description provided.