-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[R] Unable to filter a factor column in a Dataset using %in% #43440
Copy link
Copy link
Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Hello,
Is it possible filter a factor using %in% in an Arrow Dataset?
I naively expected that if I save an arrow IPC file with a factor column, I would then be able to filter that column using %in% when the file is loaded as an Arrow Dataset. Instead, I get the error Type error: Array type doesn't match type of values set: string vs dictionary<values=string, indices=int8, ordered=0>. Arrow seems aware of factors though, since I can filter that same column using == or != and collecting the dataset without filtering returns a factor. I was able to recreate this error on both Windows and Linux, please see the attached reprex for details.
Thank you in advance for your help!
# Create a simple data.frame and save as an arrow IPC file
temp_file <- tempfile()
d1 <- data.frame(x = factor(c("a", "b", "c")))
arrow::write_feather(d1, temp_file)
# Filtering using == (or !=) works
d2 <- arrow::open_dataset(temp_file, format = "arrow") |>
dplyr::filter(x == "a") |>
dplyr::collect()
# Filtering using %in% does not work (for single or multiple values)
d3 <- arrow::open_dataset(temp_file, format = "arrow") |>
dplyr::filter(x %in% "a") |>
dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Type error: Array type doesn't match type of values set: string vs dictionary<values=string, indices=int8, ordered=0>
# Collecting the dataset before filtering also works and returns a factor
d4 <- arrow::open_dataset(temp_file, format = "arrow") |>
dplyr::collect() |>
dplyr::filter(x %in% c("a"))
is.factor(d4$x)
#> [1] TRUECreated on 2024-07-26 with reprex v2.1.1
Session info
sessionInfo()
#> R version 4.4.1 (2024-06-14 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 22631)
#>
#> Matrix products: default
#>
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8
#> [2] LC_CTYPE=English_United States.utf8
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.utf8
#>
#> time zone: America/Los_Angeles
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] vctrs_0.6.5 cli_3.6.3 knitr_1.48 rlang_1.1.4
#> [5] xfun_0.45 purrr_1.0.2 generics_0.1.3 assertthat_0.2.1
#> [9] glue_1.7.0 bit_4.0.5 htmltools_0.5.8.1 fansi_1.0.6
#> [13] rmarkdown_2.27 tibble_3.2.1 evaluate_0.24.0 tzdb_0.4.0
#> [17] fastmap_1.2.0 yaml_2.3.9 lifecycle_1.0.4 compiler_4.4.1
#> [21] dplyr_1.1.4 fs_1.6.4 pkgconfig_2.0.3 rstudioapi_0.16.0
#> [25] digest_0.6.36 R6_2.5.1 utf8_1.2.4 reprex_2.1.1
#> [29] tidyselect_1.2.1 pillar_1.9.0 magrittr_2.0.3 tools_4.4.1
#> [33] withr_3.0.0 bit64_4.0.5 arrow_16.1.0Component(s)
R
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Fields
Give feedbackNo fields configured for issues without a type.