Skip to content

Conversation

@nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Aug 24, 2021

  • collect() uses ExecPlan
  • arrange() uses an OrderBySink
  • .data inside of arrow_dplyr_query can itself be arrow_dplyr_query
  • can build more query after calling summarize()
  • handle non-deterministic dataset collect() tests
  • fix group_by-expression behavior
  • make official collapse() method with more testing of faithful behavior after collapsing
  • make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests)
  • add print method for collapsed query
  • Skip 32-bit rtools35 dataset tests/examples
    - [ ] should queries on in-memory data evaluate eagerly (like dplyr)?

Followups:

  • ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions
  • ARROW-13778: [R] Handle complex summarize expressions
  • ARROW-13779: [R] Disallow expressions that depend on order after arrange()
  • ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan
  • ARROW-13854: [R] More accurately determine output type of an aggregation expression
  • ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries

@github-actions
Copy link

@nealrichardson
Copy link
Member Author

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

@nealrichardson
Copy link
Member Author

@lidavidm
Copy link
Member

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

It looks like two targets are being passed in but only one aggregation. some_grouping probably shouldn't go into the targets?

(gdb) p aggs.size()
$2 = 1
(gdb) p aggregate_options.targets.size()
$3 = 2
(gdb) p aggregate_options.targets[0].ToString()
$5 = {static npos = 18446744073709551615, 
  _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
    _M_p = 0x55555a939db0 "FieldRef.Name(total)"}, _M_string_length = 20, {
    _M_local_buf = "\036\000\000\000\000\000\000\000\277j\366\377\377\177\000", _M_allocated_capacity = 30}}
(gdb) p aggregate_options.targets[1].ToString()
$6 = {static npos = 18446744073709551615, 
  _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
    _M_p = 0x55555a939de0 "FieldRef.Name(some_grouping)"}, _M_string_length = 28, {
    _M_local_buf = "\036\000\000\000\000\000\000\000\277j\366\377\377\177\000", _M_allocated_capacity = 30}}

@nealrichardson
Copy link
Member Author

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

It looks like two targets are being passed in but only one aggregation. some_grouping probably shouldn't go into the targets?

Thanks, it looks like I've fixed this (unintentionally)

@jonkeane
Copy link
Member

I've fixed the second of the two metadata failures (that had hard-coded reordering, which I changed to arrange.)

The one about the missing warning is trickier / I haven't fully solved it yet. What's going on there is that the dataset has metadata associated with it, but when we run https://github.com/apache/arrow/pull/10992/files#diff-fee018727db3ca05257dd7755794b240a3974e7ee4d6c1803a2d9efeb9d692a9R39

collect.Dataset <- function(x, ...) dplyr::collect(as_adq(x), ...)

as_adq(x) results in an object that does not have an element named metadata anymore (x$.data$metadata) still exists, of course.

@nealrichardson
Copy link
Member Author

The one about the missing warning is trickier / I haven't fully solved it yet.

I think what's happening is that the Dataset's schema metadata isn't propagated through the ExecPlan, while it was through the Scanner. I changed how collect() works in this PR to use ExecPlan so that it could also handle aggregations and sorting.

I'll skip the test for now and make a JIRA. It's not a problem in this test (because actually we want the metadata to be dropped) but it would be a problem for other contexts where we have r metadata.

@jonkeane
Copy link
Member

jonkeane commented Sep 1, 2021

Re: the metadata issue, the Jira is https://issues.apache.org/jira/browse/ARROW-13852 FTR here.

I do worry a little bit that this could break folk's workflows in surprising and frustrating ways. I'm going to bump the priority on that Jira to blocker since we definitely don't want to release without it done.

@nealrichardson
Copy link
Member Author

Looks like 32-bit rtools35 just doesn't work with the async scanning stuff (if I recall past tests that we've had to skip).

@nealrichardson nealrichardson marked this pull request as ready for review September 2, 2021 15:30
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nealrichardson nealrichardson deleted the subquery branch September 3, 2021 17:03
ViniciusSouzaRoque pushed a commit to s1mbi0se/arrow that referenced this pull request Oct 20, 2021
- [x] collect() uses ExecPlan
- [x] arrange() uses an OrderBySink
- [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query
- [x] can build more query after calling summarize()
- [x] handle non-deterministic dataset collect() tests
- [x] fix group_by-expression behavior
- [x] make official collapse() method with more testing of faithful behavior after collapsing
- [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests)
- [x] add print method for collapsed query
- [x] Skip 32-bit rtools35 dataset tests/examples
~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~

Followups:

* ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions
* ARROW-13778: [R] Handle complex summarize expressions
* ARROW-13779: [R] Disallow expressions that depend on order after arrange()
* ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan
* ARROW-13854: [R] More accurately determine output type of an aggregation expression
* ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries

Closes apache#10992 from nealrichardson/subquery

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Jonathan Keane <jkeane@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants