-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3bbe342 to
9640cff
Compare
|
The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct? |
Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less |
eb6848f to
4f854b8
Compare
27c62f5 to
795e1f9
Compare
|
I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough. |
|
Yeah I may have transposed the numbers, will confirm |
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, a few (small) comments
|
@github-actions crossbow submit -g r |
|
Revision: bc6e3563c91b45e15432804b75ba8f0b3a1b0575 Submitted crossbow builds: ursacomputing/crossbow @ actions-407 |
|
This looks great. The increased dependency on Dataset means we will have to skip many more tests to make |
bc6e356 to
fa5731e
Compare
|
@github-actions crossbow submit test-r-minimal-build |
|
Revision: fa5731e Submitted crossbow builds: ursacomputing/crossbow @ actions-410
|
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:
Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}(ARROW-12632). I will remove theas.factormethod and leave a TODO to restore it after that JIRA is resolved.mutate(); this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:
array_expression