ARROW-13740: [R] summarize() should not eagerly evaluate #10992

nealrichardson · 2021-08-24T19:01:34Z

Followups:

ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions
ARROW-13778: [R] Handle complex summarize expressions
ARROW-13779: [R] Disallow expressions that depend on order after arrange()
ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan
ARROW-13854: [R] More accurately determine output type of an aggregation expression
ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries

github-actions · 2021-08-24T19:03:01Z

https://issues.apache.org/jira/browse/ARROW-13740

r/R/query-engine.R

r/tests/testthat/test-dplyr-aggregate.R

nealrichardson · 2021-08-27T17:29:42Z

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

nealrichardson · 2021-08-27T17:30:45Z

@jonkeane can you help me figure out what's up with https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437175048#step:8:15613?

lidavidm · 2021-08-27T17:53:40Z

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

It looks like two targets are being passed in but only one aggregation. some_grouping probably shouldn't go into the targets?

(gdb) p aggs.size()
$2 = 1
(gdb) p aggregate_options.targets.size()
$3 = 2
(gdb) p aggregate_options.targets[0].ToString()
$5 = {static npos = 18446744073709551615, 
  _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
    _M_p = 0x55555a939db0 "FieldRef.Name(total)"}, _M_string_length = 20, {
    _M_local_buf = "\036\000\000\000\000\000\000\000\277j\366\377\377\177\000", _M_allocated_capacity = 30}}
(gdb) p aggregate_options.targets[1].ToString()
$6 = {static npos = 18446744073709551615, 
  _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, 
    _M_p = 0x55555a939de0 "FieldRef.Name(some_grouping)"}, _M_string_length = 28, {
    _M_local_buf = "\036\000\000\000\000\000\000\000\277j\366\377\377\177\000", _M_allocated_capacity = 30}}

nealrichardson · 2021-08-27T18:46:15Z

@bkietz @lidavidm could either of you check out this debug check failure: https://github.com/apache/arrow/pull/10992/checks?check_run_id=3437174950#step:9:12942 ?

It looks like two targets are being passed in but only one aggregation. some_grouping probably shouldn't go into the targets?

Thanks, it looks like I've fixed this (unintentionally)

r/R/dplyr-collect.R

jonkeane · 2021-08-31T14:43:54Z

I've fixed the second of the two metadata failures (that had hard-coded reordering, which I changed to arrange.)

The one about the missing warning is trickier / I haven't fully solved it yet. What's going on there is that the dataset has metadata associated with it, but when we run https://github.com/apache/arrow/pull/10992/files#diff-fee018727db3ca05257dd7755794b240a3974e7ee4d6c1803a2d9efeb9d692a9R39

collect.Dataset <- function(x, ...) dplyr::collect(as_adq(x), ...)

as_adq(x) results in an object that does not have an element named metadata anymore (x$.data$metadata) still exists, of course.

nealrichardson · 2021-09-01T19:03:48Z

The one about the missing warning is trickier / I haven't fully solved it yet.

I think what's happening is that the Dataset's schema metadata isn't propagated through the ExecPlan, while it was through the Scanner. I changed how collect() works in this PR to use ExecPlan so that it could also handle aggregations and sorting.

I'll skip the test for now and make a JIRA. It's not a problem in this test (because actually we want the metadata to be dropped) but it would be a problem for other contexts where we have r metadata.

jonkeane · 2021-09-01T20:35:09Z

Re: the metadata issue, the Jira is https://issues.apache.org/jira/browse/ARROW-13852 FTR here.

I do worry a little bit that this could break folk's workflows in surprising and frustrating ways. I'm going to bump the priority on that Jira to blocker since we definitely don't want to release without it done.

nealrichardson · 2021-09-02T01:44:32Z

Looks like 32-bit rtools35 just doesn't work with the async scanning stuff (if I recall past tests that we've had to skip).

r/tests/testthat/test-dplyr-summarize.R

r/tests/testthat/test-metadata.R

r/NEWS.md

r/tests/testthat/test-dplyr-summarize.R

r/R/dplyr.R

Co-authored-by: Ian Cook <ianmcook@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com>

jonkeane

LGTM

- [x] collect() uses ExecPlan - [x] arrange() uses an OrderBySink - [x] .data inside of arrow_dplyr_query can itself be arrow_dplyr_query - [x] can build more query after calling summarize() - [x] handle non-deterministic dataset collect() tests - [x] fix group_by-expression behavior - [x] make official collapse() method with more testing of faithful behavior after collapsing - [x] make sort after summarize be configurable by option (default FALSE, though local_options TRUE in the tests) - [x] add print method for collapsed query - [x] Skip 32-bit rtools35 dataset tests/examples ~~- [ ] should queries on in-memory data evaluate eagerly (like dplyr)?~~ Followups: * ARROW-13777: [R] mutate after group_by should be ok as long as there are only scalar functions * ARROW-13778: [R] Handle complex summarize expressions * ARROW-13779: [R] Disallow expressions that depend on order after arrange() * ARROW-13852: [R] Handle Dataset schema metadata in ExecPlan * ARROW-13854: [R] More accurately determine output type of an aggregation expression * ARROW-13893: [R] Improve head/tail/[ methods on Dataset and queries Closes apache#10992 from nealrichardson/subquery Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

github-actions bot added the Component: R label Aug 24, 2021

ianmcook reviewed Aug 25, 2021

View reviewed changes

r/R/query-engine.R Outdated Show resolved Hide resolved

ianmcook reviewed Aug 25, 2021

View reviewed changes

r/R/query-engine.R Outdated Show resolved Hide resolved

ianmcook reviewed Aug 25, 2021

View reviewed changes

r/tests/testthat/test-dplyr-aggregate.R Outdated Show resolved Hide resolved

nealrichardson force-pushed the subquery branch from 3f4e7af to f946b51 Compare August 26, 2021 19:51

thisisnic mentioned this pull request Aug 27, 2021

ARROW-13772: [R] Binding for median aggregation #11018

Closed

nealrichardson force-pushed the subquery branch from 47a2bff to 29f5103 Compare August 27, 2021 17:32

nealrichardson force-pushed the subquery branch from 30727ae to 0326808 Compare August 30, 2021 18:02

jonkeane reviewed Aug 30, 2021

View reviewed changes

r/R/dplyr-collect.R Outdated Show resolved Hide resolved

jonkeane force-pushed the subquery branch from 9ea1d83 to 9f5a5bc Compare August 31, 2021 22:01

nealrichardson force-pushed the subquery branch from 2419435 to 8407070 Compare September 1, 2021 19:10

nealrichardson force-pushed the subquery branch from 13081d9 to d796288 Compare September 2, 2021 15:20

nealrichardson marked this pull request as ready for review September 2, 2021 15:30

jonkeane reviewed Sep 2, 2021

View reviewed changes

r/tests/testthat/test-dplyr-summarize.R Outdated Show resolved Hide resolved

jonkeane reviewed Sep 2, 2021

View reviewed changes

r/tests/testthat/test-metadata.R Outdated Show resolved Hide resolved

jonkeane reviewed Sep 2, 2021

View reviewed changes

r/NEWS.md Outdated Show resolved Hide resolved

ianmcook reviewed Sep 2, 2021

View reviewed changes

r/tests/testthat/test-dplyr-summarize.R Outdated Show resolved Hide resolved

ianmcook reviewed Sep 2, 2021

View reviewed changes

r/R/dplyr.R Outdated Show resolved Hide resolved

nealrichardson added 3 commits September 3, 2021 08:34

Refactor ExecPlan building; use it in collect()

867e147

Implement order_by_sink and sort results of summarize

1bf8a07

Cleanup

d75f4bd

nealrichardson and others added 19 commits September 3, 2021 08:37

Make dataset tests not assume row order

9a2cde5

Add support for derived grouping columns in summarize

a1cd90f

summarize() collapses the query and we can do things on it after

90612b5

Rename test file

f6cf638

Refactor and fix tests

bd6e363

Clarify comments and add todos for the collapse() work

bcea9c8

Add collapse()

11c7066

Style and unskip test

cc2f0d7

use arrange instead of hardcoding

2ea6d04

Skip column metadata warning test

9e26457

Note breaking changes before I forget

92d8d3f

Add options(arrow.summarise.sort), default FALSE

88d07bb

Skip all dataset tests on 32-bit windows rtools35

b7d6313

Correct but not super satisfying print method

31ec558

sort more tests

bd25135

More sort

f07b420

Apply suggestions from code review

be2499e

Co-authored-by: Ian Cook <ianmcook@gmail.com> Co-authored-by: Jonathan Keane <jkeane@gmail.com>

Cleanups

f7e3e54

Improve test verbosity on windows

a63acb9

nealrichardson force-pushed the subquery branch from ee6e666 to a63acb9 Compare September 3, 2021 12:45

nealrichardson added 2 commits September 3, 2021 09:44

Skip all dataset tests on old 32-bit windows

3462b24

Final final tweaks

4fa2684

jonkeane approved these changes Sep 3, 2021

View reviewed changes

Fix python skip

ceecc8f

nealrichardson closed this in e9251b0 Sep 3, 2021

nealrichardson deleted the subquery branch September 3, 2021 17:03

This was referenced Sep 3, 2021

[R] summarize() should not eagerly evaluate #29373

Closed

[R] Handle Dataset schema metadata in ExecPlan #29473

Closed

ARROW-13740: [R] summarize() should not eagerly evaluate #10992

ARROW-13740: [R] summarize() should not eagerly evaluate #10992

Uh oh!

Conversation

nealrichardson commented Aug 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 24, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Aug 27, 2021

Uh oh!

nealrichardson commented Aug 27, 2021

Uh oh!

lidavidm commented Aug 27, 2021

Uh oh!

nealrichardson commented Aug 27, 2021

Uh oh!

Uh oh!

jonkeane commented Aug 31, 2021

Uh oh!

nealrichardson commented Sep 1, 2021

Uh oh!

jonkeane commented Sep 1, 2021

Uh oh!

nealrichardson commented Sep 2, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonkeane left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nealrichardson commented Aug 24, 2021 •

edited

Loading