Skip to content

Conversation

@romainfrancois
Copy link
Contributor

No description provided.

@github-actions
Copy link

@romainfrancois
Copy link
Contributor Author

follow up after #10593

@romainfrancois romainfrancois force-pushed the ARROW-13164_altrep_with_nulls branch from 0c68a38 to 498614d Compare July 16, 2021 14:49
@romainfrancois

This comment has been minimized.

@romainfrancois romainfrancois marked this pull request as ready for review July 19, 2021 11:16
@romainfrancois romainfrancois force-pushed the ARROW-13164_altrep_with_nulls branch from b961c89 to 54b743c Compare August 17, 2021 09:54
@romainfrancois
Copy link
Contributor Author

Thanks for the reviews. I'll apply the relevant changes.

We now have AltrepArrayNoNulls<INTSXP|REALSXP> and AltrepArrayWithNulls<INTSXP|REALSXP> which are not all that different. I believe they can be merged into one template, which would perhaps slightly more involved Dataptr() ... methods...

@romainfrancois
Copy link
Contributor Author

I finally settled for a single altrep (template) class for both cases (the array has no nulls or the array has nulls).

The altrep object uses:

  • data1: an external pointer to the Array (well to a shared pointer to the Array)
  • data2: NULL until materialization is needed, then data2 is a standard (and immutable) R vector with the same data

Materialization is needed when:

  • we need the DATAPTR and the Array has nulls
  • we need a duplicate

Get_region() does not need to materialize the full vector.
Creating the altrep object does not alter the data to impose the R sentinel.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say that the AltrepArrayPrimitive seems very difficult to read and understand to me.

I would recommend:

  1. write extensive comments and explanations (including hyperlinks to R-related docs where necessary)
  2. choose naming wisely
  3. make sure no unused functionality or code paths are left
  4. make internal helper methods private, to better stress what is the "public" API
  5. also, make sure we're not doing any logic errors (if the user requests a writable array, do we really want to return a pointer to the immutable Arrow data?)

@romainfrancois
Copy link
Contributor Author

Changed Dataptr(writeable = TRUE) so that the data is materialized.

@romainfrancois
Copy link
Contributor Author

🤔 I think now that means that if the data has been materialized, a Dataptr(writeable = TRUE) might have happened, and maybe the data has been written into. To be on the safe side, perhaps we should now treat the Array as obsolete and only ever use data2

@nealrichardson
Copy link
Member

maybe the data has been written into

We can write tests for this behavior, yeah?

@romainfrancois romainfrancois force-pushed the ARROW-13164_altrep_with_nulls branch from 422c2fb to ef20698 Compare August 30, 2021 15:52
@romainfrancois
Copy link
Contributor Author

The AltrepArrayPrimitive class now marks the altrep object as immutable. This makes this "a bad idea" to call Dataptr(writeable = TRUE) on it, and it should rather be accessed read-only.

However, if we make this an error, this fails in many many places, including identical() so we can't do that.

@romainfrancois romainfrancois force-pushed the ARROW-13164_altrep_with_nulls branch from 2145dc4 to 83dcad9 Compare August 31, 2021 08:40
@nealrichardson
Copy link
Member

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: 763781c

Submitted crossbow builds: ursacomputing/crossbow @ actions-806

Task Status
conda-linux-gcc-py36-cpu-r40 Azure
conda-linux-gcc-py37-cpu-r41 Azure
conda-osx-clang-py36-r40 Azure
conda-osx-clang-py37-r41 Azure
conda-win-vs2017-py36-r40 Azure
conda-win-vs2017-py37-r41 Azure
homebrew-r-autobrew Github Actions
test-r-depsource-auto Azure
test-r-depsource-system Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-ubuntu-21.04 Github Actions
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-ubuntu-18.04-r-sanitizer Azure

@nealrichardson
Copy link
Member

I think that's all clean. @jonkeane do you want to do any benchmarking before we merge this?

@nealrichardson
Copy link
Member

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Aug 31, 2021

Benchmark runs are scheduled for baseline = f60bc12 and contender = 763781c. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Provided benchmark filters do not have any benchmark groups to be executed on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Finished ⬇️2.19% ⬆️1.46%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Supported benchmarks:
ursa-i9-9960x: langs = Python, R, JavaScript
ursa-thinkcentre-m75q: langs = C++, Java
ec2-t3-xlarge-us-east-2: cloud = True

@jonkeane
Copy link
Member

Yeah, we should run these manually. I'll try to get to that tomorrow. None of the array conversion benchmarks have been wired up in the benchmarks package, so we don't have them available on conbench right now.

@jonkeane
Copy link
Member

jonkeane commented Sep 1, 2021

OK, I just ran Array to R benchmarks locally, and the results are very impressive. I've added in other types (even though they are not (yet) backed by altrep). I've also tested both chunked and non-chunked arrays, and with nulls and without.

TL;DR:

For floats + ints we see huge speed ups (~10x faster) in both arrays with and without nulls as well as in arrays and chunked arrays (at least in the case where there is a single chunk*).

I'm still investigating what's going on with the two fannie parquet file reads that show up as regressions. If anything, this should have sped them up.

  • – we don't yet have fixtures with chunked arrays with multiple chunks. But I simulated this with a csv (which does come in with multiple chunks) and multi-chunk arrays aren't (yet) supported:
library(arrow)

nyctaxi <- read_csv_arrow("~/repos/ab_store/data/nyctaxi_2010-01.csv.gz", as_data_frame = FALSE)

chunked_array <- as.vector(nyctaxi[[5]])
.Internal(inspect(chunked_array))
#> @7fb218000000 14 REALSXP g0c7 [REF(4)] (len=14863778, tl=0) 0.75,5.9,4,4.7,0.6,...

array <- as.vector(nyctaxi[[5]]$chunk(1))
.Internal(inspect(array))
#> @7fb20dc3a6e0 14 REALSXP g0c0 [REF(65535)] arrow::Array<double, NONULL> len=5738, Array=<0x7fb25d5080d8>
#>   @7fb20dc3a670 22 EXTPTRSXP g0c0 [REF(4)]

@jonkeane
Copy link
Member

jonkeane commented Sep 1, 2021

And here's the report:

altvec-with-nulls.html.zip

@nealrichardson
Copy link
Member

  • multi-chunk arrays aren't (yet) supported

That's ARROW-13111. ARROW-13112 is for supporting other types. If I understand correctly, we're conceptually close on those since we're implementing the get_region and elt methods for integer and float here, just ("just"!) need to [something about templates] and do the same for other types and for ChunkedArrays using our existing conversion code.

@romainfrancois romainfrancois deleted the ARROW-13164_altrep_with_nulls branch September 9, 2021 06:49
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
Closes apache#10730 from romainfrancois/ARROW-13164_altrep_with_nulls

Lead-authored-by: Romain Francois <romain@rstudio.com>
Co-authored-by: Romain François <romain@rstudio.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants