-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-13740: [R] summarize() should not eagerly evaluate #10992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3f4e7af to
f946b51
Compare
|
@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ? |
|
@jonkeane can you help me figure out what's up with https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437175048#step:8:15613? |
47a2bff to
29f5103
Compare
It looks like two targets are being passed in but only one aggregation. |
Thanks, it looks like I've fixed this (unintentionally) |
30727ae to
0326808
Compare
|
I've fixed the second of the two metadata failures (that had hard-coded reordering, which I changed to arrange.) The one about the missing warning is trickier / I haven't fully solved it yet. What's going on there is that the dataset has metadata associated with it, but when we run https://github.com/apache/arrow/pull/10992/files#diff-fee018727db3ca05257dd7755794b240a3974e7ee4d6c1803a2d9efeb9d692a9R39
|
I think what's happening is that the Dataset's schema metadata isn't propagated through the ExecPlan, while it was through the Scanner. I changed how collect() works in this PR to use ExecPlan so that it could also handle aggregations and sorting. I'll skip the test for now and make a JIRA. It's not a problem in this test (because actually we want the metadata to be dropped) but it would be a problem for other contexts where we have |
2419435 to
8407070
Compare
|
Re: the metadata issue, the Jira is https://issues.apache.org/jira/browse/ARROW-13852 FTR here. I do worry a little bit that this could break folk's workflows in surprising and frustrating ways. I'm going to bump the priority on that Jira to blocker since we definitely don't want to release without it done. |
|
Looks like 32-bit rtools35 just doesn't work with the async scanning stuff (if I recall past tests that we've had to skip). |
13081d9 to
d796288
Compare
Co-authored-by: Ian Cook <ianmcook@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com>
ee6e666 to
a63acb9
Compare
jonkeane
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
- [x] collect() uses ExecPlan - [x] arrange() uses an OrderBySink - [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query - [x] can build more query after calling summarize() - [x] handle non-deterministic dataset collect() tests - [x] fix group_by-expression behavior - [x] make official collapse() method with more testing of faithful behavior after collapsing - [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests) - [x] add print method for collapsed query - [x] Skip 32-bit rtools35 dataset tests/examples ~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~ Followups: * ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions * ARROW-13778: [R] Handle complex summarize expressions * ARROW-13779: [R] Disallow expressions that depend on order after arrange() * ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan * ARROW-13854: [R] More accurately determine output type of an aggregation expression * ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries Closes apache#10992 from nealrichardson/subquery Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?Followups: