Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Apr 28, 2021

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

  • You can't dictionary_encode a dataset column. Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})} (ARROW-12632). I will remove the as.factor method and leave a TODO to restore it after that JIRA is resolved.
  • with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing mutate(); this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

  • 5b501c5 is the main switch to use InMemoryDataset
  • b31fb5e deletes array_expression
  • 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
  • 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
  • d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
  • a0914f6 + eee491a contain ARROW-12696

@ianmcook
Copy link
Member

ianmcook commented May 3, 2021

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

@nealrichardson
Copy link
Member Author

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less FUN passed around) and delete duplicate tests.

@nealrichardson nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from eb6848f to 4f854b8 Compare May 6, 2021 17:52
@nealrichardson nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from 27c62f5 to 795e1f9 Compare May 8, 2021 00:07
@nealrichardson nealrichardson marked this pull request as ready for review May 10, 2021 19:44
@nealrichardson nealrichardson changed the title [R] [WIP] Use InMemoryDataset for Table/RecordBatch in dplyr code [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021
@nealrichardson nealrichardson changed the title [R] Use InMemoryDataset for Table/RecordBatch in dplyr code ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021
@westonpace
Copy link
Member

I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough.

@nealrichardson
Copy link
Member Author

Yeah I may have transposed the numbers, will confirm

@nealrichardson nealrichardson changed the title ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 11, 2021
@github-actions
Copy link

@apache apache deleted a comment from github-actions bot May 12, 2021
@apache apache deleted a comment from github-actions bot May 12, 2021
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, a few (small) comments

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: bc6e3563c91b45e15432804b75ba8f0b3a1b0575

Submitted crossbow builds: ursacomputing/crossbow @ actions-407

Task Status
conda-linux-gcc-py36-cpu-r36 Azure
conda-linux-gcc-py37-cpu-r40 Azure
conda-osx-clang-py36-r36 Azure
conda-osx-clang-py37-r40 Azure
conda-win-vs2017-py36-r36 Azure
conda-win-vs2017-py37-r40 Azure
homebrew-r-autobrew Github Actions
test-r-devdocs Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-r-without-arrow Azure
test-ubuntu-18.04-r-sanitizer Azure

@ianmcook
Copy link
Member

ianmcook commented May 13, 2021

This looks great.

The increased dependency on Dataset means we will have to skip many more tests to make test-r-minimal-build pass.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit test-r-minimal-build

@github-actions
Copy link

Revision: fa5731e

Submitted crossbow builds: ursacomputing/crossbow @ actions-410

Task Status
test-r-minimal-build Azure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants