Intro
I'm currently involved in a research project called EFFECT. EFFECT is a multicentre, cluster-randomised, placebo-controlled cross-over trial evaluating antiseptic body wash of patients on intensive care units (ICU). The trial is to test whether daily antiseptic body wash reduces the risk of intensive care unit (ICU)-acquired primary bacteraemia and ICU-acquired multidrug-resistant organisms. EFFECT requires two types of data: (1) The patients' individual ward-movement history and
(2) microbiological test results (see Meissner 2017).
According to the study protocol, positive blood tests do count as infection unless there is a negative blood test within 48 hours after the positive blood test.
In this blog post, I show how to solve this problem on a computational level.
The Problem
The following code chunk provides an hypothetical example of the microbiological data I have to deal with. The data frame df.mibi contains 4 variables:
- ID: Patient id (only 1 patient in this example);
- ORGANISM: name of skin commensal organism found in some blood sample,
- RESULT: laboratory test result (POS vs. NEG);
- DATE: date of laboratory test
library(tidyverse)
library(lubridate)
df.mibi <- tibble(
ID = paste0("ID_", rep(1, 11)),
ORGANISM = c(rep('Propionibacterium acnes', 2),
rep('Staphylococcus epidermidis', 2),
rep('Staphylococcus capitis', 2),
rep('', 5)),
RESULT = c(rep('POS', 6), rep('NEG', 5)),
DATE = ymd(c(
"2018-02-07", "2018-02-12", "2018-02-13", "2018-02-20",
"2018-02-21", "2018-03-18", "2018-02-01", "2018-02-06",
"2018-02-10", "2018-02-21", "2018-04-05")
)
)
My Idea
In a first step, I separated df.mibi into two data frames:
df.POS: containing positive blood tests onlydf.NEG: containing negative blood tests only
df.POS <- df.mibi %>% filter(RESULT == 'POS') df.NEG <- df.mibi %>% filter(RESULT == 'NEG')
In a second step, I removed two variables from df.NEG (RESULT, ORGANISM), grouped the data frame by ID, and put all dates belonging to one ID into the list column data using the nest() function of the tidyr package
df.NEG <- df.NEG %>%
select(ID, DATE) %>%
group_by(ID) %>%
nest()
This is how both data frames look like:
df.POS
## # A tibble: 6 x 4 ## ID ORGANISM RESULT DATE ## <chr> <chr> <chr> <date> ## 1 ID_1 Propionibacterium acnes POS 2018-02-07 ## 2 ID_1 Propionibacterium acnes POS 2018-02-12 ## 3 ID_1 Staphylococcus epidermidis POS 2018-02-13 ## 4 ID_1 Staphylococcus epidermidis POS 2018-02-20 ## 5 ID_1 Staphylococcus capitis POS 2018-02-21 ## 6 ID_1 Staphylococcus capitis POS 2018-03-18
df.NEG
## # A tibble: 1 x 2 ## ID data ## <chr> <list> ## 1 ID_1 <tibble [5 x 1]>
In a third step, I tried to check whether one of the negative test (stored in the list variable data) lies within the time interval positive test + 48 hours (TIME).
I did the mapping using the map2() function of the purrr package:
# merging and mapping
df.TOTAL <- df.POS %>%
left_join(df.NEG, by = 'ID') %>%
mutate(TIME = interval(DATE, DATE + days(2)),
RESULT = map2(data, "DATE", TIME, ~ .x %within% .y))
Unfortunaltely, my code did not work. The RESULT variable should be logical and return TRUE in case of a negative test result up to 2 days after the positive test. Instead it is a list and returns NULL.
df.TOTAL
## # A tibble: 6 x 6 ## ID ORGANISM RESULT DATE data TIME ## <chr> <chr> <list> <date> <list> <S4: Interval> ## 1 ID_1 Propionib~ <NULL> 2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC ## 2 ID_1 Propionib~ <NULL> 2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC ## 3 ID_1 Staphyloc~ <NULL> 2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC ## 4 ID_1 Staphyloc~ <NULL> 2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC ## 5 ID_1 Staphyloc~ <NULL> 2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC ## 6 ID_1 Staphyloc~ <NULL> 2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC
The Solution
Not even one hour after I posted my question to StackOverflow, a user who calles himself “utubun” found the following solution:
df.TOTAL <- df.POS %>%
left_join(df.NEG, by = 'ID') %>%
mutate(TIME = interval(DATE, DATE + days(2)),
RESULT = map2_lgl(data, TIME, ~ any(.x$DATE %within% .y)))
df.TOTAL
## # A tibble: 6 x 6 ## ID ORGANISM RESULT DATE data TIME ## <chr> <chr> <lgl> <date> <list> <S4: Interval> ## 1 ID_1 Propionib~ FALSE 2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC ## 2 ID_1 Propionib~ FALSE 2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC ## 3 ID_1 Staphyloc~ FALSE 2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC ## 4 ID_1 Staphyloc~ TRUE 2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC ## 5 ID_1 Staphyloc~ TRUE 2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC ## 6 ID_1 Staphyloc~ FALSE 2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC
It works!!! Thank you very much! 🙂