-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-11929: [C++][Dataset][Compute] Promote expression to the compute namespace #10166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. One nit about something that was recently introduced (I assume it just got lost in the rebasing).
I'll hold off on merging datasets stuff until this is through so we're not stuck in an endless rebase cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up and adjust the line numbers in the corresponding reST file (and see if I can figure out a better way to excerpt code snippets than hardcoding line numbers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Ah, looks like clang-format/cmake-format are unhappy, and the CMake magic that makes the new header gets installed is missing. |
|
This just moves the code? There's no change in how or where expressions are evaluated yet, correct? I.e. they're still only for datasets here, they're just in a different namespace? |
|
TODO: update bindings |
5ecf4e2 to
874b9fe
Compare
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't notice this got rebased yesterday (feel free to ping me). This looks good, just needs to be linted/formatted, and it looks like compute/expression_benchmark.cc needs to be fixed. The AppVeyor failure is unrelated.
nealrichardson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R changes LGTM
lidavidm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
|
@bkietz This caused a failure in the |
|
@ianmcook I think we just need to add Expression to compute/type_fwd.h. I'll do this now. |
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * 5b501c5 is the main switch to use InMemoryDataset * b31fb5e deletes `array_expression` * 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * a0914f6 + eee491a contain ARROW-12696 Closes #10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Moves Expression and its test and benchmark into the compute/exec/ directory. I haven't introduced an exec namespace.