-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
ARROW-13344 enabled the dplyr verb summarise() to use the Arrow engine but kept this off by default, controlled by the arrow.debug option.
Before this can be turned on by default, we should ensure that the following are all implemented:
- a sufficient set of hash aggregate kernels and R aggregate function mappings to them, covering the vast majority of all aggregate functions that dplyr users call in
summarise()(add any additional required ones to ARROW-13339) - support for a sufficient set of data types in aggregates
- support for a sufficient set of data types in grouping columns
- handling of
NAandNaNvalues in aggregates and thena.rmoption consistent with base R and dplyr (ARROW-13497 and possibly other issues) - handling of
NAandNaNvalues in grouping columns consistent with dplyr - handling empty or bad input to
summarise()(ARROW-13543) - many new tests to confirm equivalent results from a variety of
group_by() %>% summarise()queries on data frames and on Arrow data - resolution of various related bugs
Reporter: Ian Cook / @ianmcook
Assignee: Ian Cook / @ianmcook
Related issues:
- [C++][R] FunctionOptions not used by aggregation nodes (is blocked by)
- [R] Aggregation on expression doesn't NSE correctly (is blocked by)
- [R] Handle summarize() with 0 arguments or no aggregate functions (is blocked by)
- [R] Bindings for min/max aggregation (is blocked by)
- [R] Support .groups argument to dplyr::summarize() (is blocked by)
- [R] Bindings for count aggregation (is blocked by)
- [R] Bindings for mean, var, sd aggregation (is blocked by)
- [R] Binding for median() and quantile() aggregation functions (is blocked by)
- [R] mutate after group_by should be ok as long as there are only scalar functions (is blocked by)
- [R] Handle complex summarize expressions (is blocked by)
- [C++] Add option to handle NAs to VarianceOptions (is blocked by)
- [R] summarize() should not eagerly evaluate (is blocked by)
- [C++] Implement ScalarAggregateOptions for count_distinct (grouped) (is blocked by)
- [R] Initial bindings for ExecPlan/ExecNode (relates to)
Note: This issue was originally created as ARROW-13618. Please see the migration documentation for further details.