Skip to content

[R] [C++] Implement SQL-alike distinct() for dplyr queries #18714

@asfimport

Description

@asfimport

Hi

It would be desirable to have the ability to obtain a data frame with the unique combinations, say

open_dataset("sitc-rev2/parquet/",
             partitioning = c("Year", "Trade Flow", "Reporter ISO")) %>%
  select(Year, `Reporter ISO`) %>%
  filter(Year >= 1988 & Year <= 1994) %>% 
  distinct() %>% 
  collect()

However, in the current development version of the Arrow package (installed from GitHub), we get this error for the last expression

Error in UseMethod("distinct") : 
  no applicable method for 'distinct' applied to an object of class "arrow_dplyr_query"

This works

reporters_1 <- open_dataset("sitc-rev2/parquet/",
             partitioning = c("Year", "Trade Flow", "Reporter ISO")) %>%
  select(Year, `Reporter ISO`) %>%
  filter(Year >= 1988 & Year <= 1994) %>% 
  collect() %>% 
  distinct()

Reporter: Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev

Related issues:

Note: This issue was originally created as ARROW-13107. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions