Skip to content

[R] Unable to filter a factor column in a Dataset using %in% #43440

@spencerpease

Description

@spencerpease

Describe the bug, including details regarding any error messages, version, and platform.

Hello,

Is it possible filter a factor using %in% in an Arrow Dataset?

I naively expected that if I save an arrow IPC file with a factor column, I would then be able to filter that column using %in% when the file is loaded as an Arrow Dataset. Instead, I get the error Type error: Array type doesn't match type of values set: string vs dictionary<values=string, indices=int8, ordered=0>. Arrow seems aware of factors though, since I can filter that same column using == or != and collecting the dataset without filtering returns a factor. I was able to recreate this error on both Windows and Linux, please see the attached reprex for details.

Thank you in advance for your help!

# Create a simple data.frame and save as an arrow IPC file
temp_file <- tempfile()
d1 <- data.frame(x = factor(c("a", "b", "c")))
arrow::write_feather(d1, temp_file)

# Filtering using == (or !=) works
d2 <- arrow::open_dataset(temp_file, format = "arrow") |>
  dplyr::filter(x == "a") |>
  dplyr::collect()

# Filtering using %in% does not work (for single or multiple values)
d3 <- arrow::open_dataset(temp_file, format = "arrow") |>
  dplyr::filter(x %in% "a") |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Type error: Array type doesn't match type of values set: string vs dictionary<values=string, indices=int8, ordered=0>

# Collecting the dataset before filtering also works and returns a factor
d4 <- arrow::open_dataset(temp_file, format = "arrow") |>
  dplyr::collect() |>
  dplyr::filter(x %in% c("a"))

is.factor(d4$x)
#> [1] TRUE

Created on 2024-07-26 with reprex v2.1.1

Session info

sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Los_Angeles
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.48        rlang_1.1.4      
#>  [5] xfun_0.45         purrr_1.0.2       generics_0.1.3    assertthat_0.2.1 
#>  [9] glue_1.7.0        bit_4.0.5         htmltools_0.5.8.1 fansi_1.0.6      
#> [13] rmarkdown_2.27    tibble_3.2.1      evaluate_0.24.0   tzdb_0.4.0       
#> [17] fastmap_1.2.0     yaml_2.3.9        lifecycle_1.0.4   compiler_4.4.1   
#> [21] dplyr_1.1.4       fs_1.6.4          pkgconfig_2.0.3   rstudioapi_0.16.0
#> [25] digest_0.6.36     R6_2.5.1          utf8_1.2.4        reprex_2.1.1     
#> [29] tidyselect_1.2.1  pillar_1.9.0      magrittr_2.0.3    tools_4.4.1      
#> [33] withr_3.0.0       bit64_4.0.5       arrow_16.1.0

Component(s)

R

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions