-
Notifications
You must be signed in to change notification settings - Fork 3
Overlapping levels produce unexpected splits #6
Description
Problem
The pre-dev version (0.3.1 and earlier) of split_by() can lead to bugs when there is a split by a variable stored as a double, if the upper bound of one level is the lower bound of another.
Minimal example: this fails to mark sections between 2.5 and 3.2 as not belongning to any level, instead adding a new level between levels 2 and 3. This results in a total of 4 levels, rather than the desired 3.
# v0.3.1
dat <- data.frame(
a = c(1.0, 1.1, 1.3, 1.9, 2, 2.1, 2.2, 2.9, 3, 3.1, 3.3)
)
dat |>
split_by(a, 1:2 ~ 2:2.5 ~ 3.2:4, filter=FALSE) |>
with(df)# A tibble: 11 × 3
a LexOPS_splitCond_A tmp
<dbl> <fct> <chr>
1 1 A1 A1
2 1.1 A1 A1
3 1.3 A1 A1
4 1.9 A1 A1
5 2 A2 A2
6 2.1 A2 A2
7 2.2 A2 A2
8 2.9 A3 A3
9 3 A3 A3
10 3.1 A3 A3
11 3.3 A4 A4
This was due to the way that the breaks were put through unique() before being given to the cut() function:
Line 206 in 205400e
| cuts <- unique(unlist(breaks)) |
This could result in the gap between 2.5 and 3.2 in the example being ignored:
breaks <- unique(c(1, 2, 2, 2.5, 3.2, 4))
cut(dat$a, breaks, labels=1:4) [1] <NA> 1 1 1 1 2 2 3 3 3 4
Levels: 1 2 3 4
Fix
The issue was addressed by ff8b8ab. Instead of cut(), a more transparent method of subsetting is used:
Line 163 in ff8b8ab
| cuts_mat <- sapply(breaks, function(b) df[[column]]>=b[[1]] & df[[column]]<=b[[2]]) |
In addition, the dev version now correctly detects that a value of 2 could belong to either group, and gives an error to avoid any ambiguity in pipelines.
# v0.3.1.9000 (Commit 372df3b)
dat <- data.frame(
a = c(1.0, 1.1, 1.3, 1.9, 2, 2.1, 2.2, 2.9, 3, 3.1, 3.3)
)
dat |>
split_by(a, 1:2 ~ 2:2.5 ~ 3.2:4, filter=FALSE) |>
with(df)Error in check_overlapping(breaks) :
overlapping levels - ensure that no value could fall into multiple levels