Skip to content

Overlapping levels produce unexpected splits #6

@JackEdTaylor

Description

@JackEdTaylor

Problem

The pre-dev version (0.3.1 and earlier) of split_by() can lead to bugs when there is a split by a variable stored as a double, if the upper bound of one level is the lower bound of another.

Minimal example: this fails to mark sections between 2.5 and 3.2 as not belongning to any level, instead adding a new level between levels 2 and 3. This results in a total of 4 levels, rather than the desired 3.

# v0.3.1
dat <- data.frame(
  a = c(1.0, 1.1, 1.3, 1.9, 2, 2.1, 2.2, 2.9, 3, 3.1, 3.3)
)

dat |>
  split_by(a, 1:2 ~ 2:2.5 ~ 3.2:4, filter=FALSE) |>
  with(df)
# A tibble: 11 × 3
       a LexOPS_splitCond_A tmp  
   <dbl> <fct>              <chr>
 1   1   A1                 A1   
 2   1.1 A1                 A1   
 3   1.3 A1                 A1   
 4   1.9 A1                 A1   
 5   2   A2                 A2   
 6   2.1 A2                 A2   
 7   2.2 A2                 A2   
 8   2.9 A3                 A3   
 9   3   A3                 A3   
10   3.1 A3                 A3   
11   3.3 A4                 A4   

This was due to the way that the breaks were put through unique() before being given to the cut() function:

cuts <- unique(unlist(breaks))

This could result in the gap between 2.5 and 3.2 in the example being ignored:

breaks <- unique(c(1, 2, 2, 2.5, 3.2, 4))
cut(dat$a, breaks, labels=1:4)
 [1] <NA> 1    1    1    1    2    2    3    3    3    4   
Levels: 1 2 3 4

Fix

The issue was addressed by ff8b8ab. Instead of cut(), a more transparent method of subsetting is used:

cuts_mat <- sapply(breaks, function(b) df[[column]]>=b[[1]] & df[[column]]<=b[[2]])

In addition, the dev version now correctly detects that a value of 2 could belong to either group, and gives an error to avoid any ambiguity in pipelines.

# v0.3.1.9000 (Commit 372df3b)
dat <- data.frame(
  a = c(1.0, 1.1, 1.3, 1.9, 2, 2.1, 2.2, 2.9, 3, 3.1, 3.3)
)

dat |>
  split_by(a, 1:2 ~ 2:2.5 ~ 3.2:4, filter=FALSE) |>
  with(df)
Error in check_overlapping(breaks) : 
  overlapping levels - ensure that no value could fall into multiple levels

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions