-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
ARROW-14167 added support for factors in {}coalesce(){}, but the factors that are returned will not necessarily retain the factor levels like coalesce() does when used on an R data frame.
For example, compare these, noticing the difference in the levels:
# R data frame
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
mutate(y = coalesce(x, y)) %>%
pull(y)
#> [1] a c
#> Levels: a b c# Arrow Table
tibble(x = factor(c("a", NA_character_)), y = factor(c("b", "c"))) %>%
Table$create() %>%
mutate(y = coalesce(x, y)) %>%
pull(y)
#> [1] a c
#> Levels: a cSimilarly, ARROW-13358 and ARROW-14659 added support for factors in if_else() but the returned factors will not always retain the levels like if_else() does when used on an R data frame.
I'm not sure if it is practical to make Arrow return the factors with the unused levels included like R does. If so, we should do it.
See the tests in test-dplyr-funcs-conditional.R that refers to this Jira.
Reporter: Ian Cook / @ianmcook
Related issues:
- [R] Remove warning about factor conversion to string in if_else() (relates to)
- [C++] Optimize dictionary support in kernels/Support nulls in DictionaryUnifier (relates to)
- [C++] Extend type support for if_else kernel (depends upon)
- [C++] Support dictionaries directly in coalesce kernel (depends upon)
Note: This issue was originally created as ARROW-14649. Please see the migration documentation for further details.