Mastering Multiple Imputations using R. Part I: Grouped imputations

Intro

Multiple imputation is one of the great ideas in statistical science. The technique is simple, elegant and powerful. It is simple because it fills the holes in the data with plausible values. It is elegant because the uncertainty about the unknown data is coded in the data itself. And it is powerful because it can solve “other” problems that are actually missing data problems in disguise (Stef van Buuren).

Multiple imputation has become a standard for the treatment of missing data. Unlike conventional imputation methods (mean, median imputation etc.) that merely replace the missing values with a single value, “Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute” (see Yang C. Yuan). According to Wikipedia, “the primary method of multiple imputation is multiple imputation by chained equations (MICE).” In R, the MICE method is implemented by the mice package.

In this blog post, I show how to conduct multiple imputation on a numeric variable grouped by another variable indicating the meaning of the numeric values. The “grouping” is done using a function of the miceadds package and works similar to the group_by() function of the dplyr package.

Data and packages

library(tidyverse)
library(mice)
library(miceadds)
library(labelled)

The dataset we use in this blog post comes along with the mice package and is called brandsma. the dataset contains “data from 4106 pupils attending 216 schools. This dataset includes all pupils and schools with missing data.” (see mice package description). With the following code, we select the variables we need for multiple imputations and for calculating statistical models. The variables are:

  • sch (School number)
  • pup (Pupil ID)
  • sex (Sex of pupil)
  • min (Minority member)
  • iqv (IQ verbal)
  • iqp (IQ performal)
  • lpr (language score PRE)
  • lpo (language score POST)
  • apr (Arithmetic score PRE)
  • apo (Arithmetic score POST)
df.RAW <- mice::brandsma %>%
  select(sch, pup, sex, min, iqv, iqp, lpr, lpo, apr, apo) %>%
  filter(!is.na(sex)) %>%
  set_variable_labels(.labels = c(
    "School number", "Pupil ID",
    "Sex of pupil", "Minority member",
    "IQ verbal", "IQ performal",
    "language score PRE", "language score POST",
    "Arithmetic score PRE", "Arithmetic score POST"
  ))
glimpse(df.RAW)
## Observations: 4,096
## Variables: 10
## $ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ pup <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## $ sex <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ min <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,…
## $ iqv <dbl> -1.3535094, 2.1464906, 3.1464906, 2.6464906, -2.3535094, -0.8535094, -3.8535094, -2…
## $ iqp <dbl> -3.72274979, 3.27725021, 1.27391680, -1.05608313, -0.05608313, -1.05608313, -4.3900…
## $ lpr <dbl> 33, 44, 36, 36, 33, 29, 19, 22, 20, 44, 34, 31, 18, 36, 31, 34, 31, 32, 23, 20, 27,…
## $ lpo <dbl> NA, 50, 46, 45, 33, 46, 20, 30, 30, 57, 36, 36, 29, 40, 41, 47, 33, 37, 29, 26, 37,…
## $ apr <dbl> 10, 18, 14, 12, 10, 13, 8, 8, 7, 17, 10, 14, 11, 10, 10, 12, 9, 13, 9, 9, 7, 16, 16…
## $ apo <dbl> 12, 30, 24, 19, 24, 26, 9, 13, 13, 30, 23, 22, 19, 23, 18, 22, 15, 21, 13, 12, 11, …

Using the pivot_longer() function of the tidyr package, we transform the original data into a long table. The values of the quantitative variables are stored in the new variable VALUE. The ENDPOINT variable indicates to which variable each value belongs. Furthermore, we create a new variable (IMPUTED) which indicates missing values that need to be imputed. In addition, we assign value labels to the variables sex and min.

df.RAW <- df.RAW %>%
  pivot_longer(names_to = 'ENDPOINT', values_to = 'VALUE', cols = c(iqv:apo)) %>%
  mutate(IMPUTED = is.na(VALUE),
         sex = factor(sex, levels = c(0, 1), labels = c('f', 'm')),
         min = factor(min, levels = c(0, 1), labels = c('n', 'y')))
glimpse(df.RAW)
## Observations: 24,576
## Variables: 7
## $ sch      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pup      <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, …
## $ sex      <fct> m, m, m, m, m, m, m, m, m, m, m, m, f, f, f, f, f, f, f, f, f, f, f, f, f, f, …
## $ min      <fct> y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n, n, n, y, y, y, y, y, y, n, n, …
## $ ENDPOINT <chr> "iqv", "iqp", "lpr", "lpo", "apr", "apo", "iqv", "iqp", "lpr", "lpo", "apr", "…
## $ VALUE    <dbl> -1.35350942, -3.72274979, 33.00000000, NA, 10.00000000, 12.00000000, 2.1464905…
## $ IMPUTED  <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

Specifying the imputation model

init <- mice(df.RAW, maxit = 0)
meth <- init$method
predM <- init$predictorMatrix

# groupwise imputation of variable VALUE
meth["VALUE"] <- "bygroup"

# remove as predictor
predM[, c('IMPUTED')] <- 0
# specify name of the grouping variable ('ENDPOINT') and imputation method ('pmm')
group <- list( "VALUE" = "ENDPOINT" )
imputationFunction <- list("VALUE" = "pmm" )

Running multiple imputations

## grouped imputation

mids_endpoint <- df.RAW %>%
  mice(.,
       method = meth,
       predictorMatrix = predM,
       m = 3,
       seed = 2020,
       printFlag = FALSE,
       group = group,
       imputationFunction = imputationFunction 
  )

Creating complete data

df.COMPLETE <- mice::complete(mids_endpoint, include = FALSE) %>% 
  mutate(ENDPOINT = factor(ENDPOINT, levels = c('iqv', 'iqp', 'apr', 'apo', 'lpr', 'lpo')))

Plotting observed and imputed values

ggplot(df.COMPLETE, aes(x = ENDPOINT, y = VALUE)) +
  geom_jitter(aes(colour = IMPUTED, alpha = IMPUTED), size = 0.5) +
  geom_violin(alpha=0) +
  scale_colour_manual(values = c('grey', 'red3')) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(colour = NULL, alpha = NULL, x = NULL, y = NULL,
       title = 'The distribution of observed and imputed values',
       subtitle = 'Red dots indicate imputed values') 

plot of chunk imp-plot

Formatting p-values: A curated list of R functions

Intro

Reporting results of statistical analyses usually goes along with reporting p-values which indicate the probability under the null hypothesis of sampling a test statistic at least as extreme as that which was observed.

R offers quite a lot of options to format p-values. This blog post will give an overview (which is by no means comprehensive).

First, we compute a vector with six p-values and one missing value.

p <- c(0.50, 0.12, 0.045, 0.011, 0.009, 0.0000234, NA)
options(scipen = 9999) # suppress scientific notation

Formatting p-values

base package (Base R)

The first function I'm going to introduce, is part of the base package, which is part of the default R installation. Thus, no install.packages() and library() function is required to use it.

Using the digits option, the number of significant digits can be specified.

format.pval(p)
## [1] "0.500"   "0.120"   "0.045"   "0.011"   "0.009"   "0.00002" "NA"
format.pval(p, 
            digits = 2) # how many significant digits are to be used
## [1] "0.50" "0.12" "0.04" "0.01" "0.01" "0"    "NA"

Hmisc package

With nsmall, the format.pval() function of the popular Hmisc package includes an option to specify the minimum number of digits to the right of the decimal point.

library(Hmisc)
Hmisc::format.pval(p)
## [1] "0.500"   "0.120"   "0.045"   "0.011"   "0.009"   "0.00002" "NA"
Hmisc::format.pval(p,
                   nsmall=3, # the minimum number of digits to the right of the decimal point 
                   digits = 2) # how many significant digits are to be used
## [1] "0.500" "0.120" "0.040" "0.010" "0.010" "0.000" "NA"

scales package

In addition, the pvalue() function of the scales package, has got an option to specify a threshold for rounding the p-value according to a given significance level.

library(scales)
scales::pvalue(p)
## [1] "0.500"  "0.120"  "0.045"  "0.011"  "0.009"  "<0.001" "NA"
scales::pvalue(p,
               accuracy = 0.05, # Number to round to
               decimal.mark = ".", # The character to be used to indicate the numeric decimal point
               add_p = TRUE) # Add "p=" before the value?
## [1] "p=0.50" "p=0.10" "p<0.05" "p<0.05" "p<0.05" "p<0.05" "p=NA"

finalfit package

The p_tidy function of the finalfit package doesn't have an option to specify the number of significant digits. With the digits option, a value for rounding the p-value can be specified.

library(finalfit)
finalfit::p_tidy(p, digits = 2)
## [1] "=0.50" "=0.12" "=0.04" "=0.01" "=0.01" "<0.01" "=NA"
finalfit::p_tidy(p, 
                 digits = 3, # value to round to, no default
                 prefix = NULL) # suppress prefix
## [1] "0.500"  "0.120"  "0.045"  "0.011"  "0.009"  "<0.001" "NA"

psycho package

The format_p function of the psycho package formats the p-values according to predefined significance levels (<0.5, <0.1, <0.01). In addition, stars may be added.

library(psycho)
psycho::format_p(p)
## [1] "> .1"      "> .1"      "< .05*"    "< .05*"    "< .01**"   "< .001***"
## [7] NA
psycho::format_p(p,
                 stars = FALSE) # remove significance stars
## [1] "> .1"   "> .1"   "< .05"  "< .05"  "< .01"  "< .001" NA
psycho::format_p(p,
                 stars_only = TRUE) # return only significance stars
## [1] ""    ""    "*"   "*"   "**"  "***" NA

How to Vectorize a Function in R

Last year I came across the base R function Vectorize(). Vectorize() vectorizes the action of a non-vectorized function. Let’s give an example.

In one of my current research projects, I need to hash patient ids to fulfill the requirements of data privacy protection. With sha1(), the digest package contains a function to calculate a hash of an object. Let’s see what the function does, when we apply it to a column of the mtcars data frame:

First, we write the row names (names of the cars) into a new variable (‘NAME’):

library(dplyr)
library(tibble)

data("mtcars")
mtcars <- mtcars %>%
  tibble::rownames_to_column('NAME')

Now, we assume that ‘NAME’ is the id variable we want to hash:

library(digest)

mtcars <- mtcars %>%
  mutate(HASH = sha1(NAME)) %>%
  select(NAME, HASH, mpg)

head(mtcars)
##                NAME                                     HASH  mpg
## 1         Mazda RX4 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0
## 2     Mazda RX4 Wag cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.0
## 3        Datsun 710 cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 22.8
## 4    Hornet 4 Drive cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 21.4
## 5 Hornet Sportabout cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.7
## 6           Valiant cbe2ae3f7e5a2c558d4c36cad5e27a906e8aef8d 18.1

As we can see, different car names received the same hash. This is not exactly what we want. It happened because the sha1() function is not vectorized.

In the final step, we vectorize the sha1() function and apply it once again to the mtcars data frame:

sha1_vectorized <- Vectorize(digest::sha1)
mtcars <- mtcars %>%
  mutate(HASH = sha1_vectorized(NAME)) %>%
  select(NAME, HASH, mpg)

head(mtcars)
##                NAME                                     HASH  mpg
## 1         Mazda RX4 b22967895db5fb044febfaad31d34ccfc95f4440 21.0
## 2     Mazda RX4 Wag 45464747af0f4df66ee253bfef89d4b106cfb713 21.0
## 3        Datsun 710 785ba328b314246358feec3166fafa71bb724793 22.8
## 4    Hornet 4 Drive e1265538639ccf3f772038fe3db16aaaa28a4dd9 21.4
## 5 Hornet Sportabout 0b3f30b312e17c7c610399bf204ea9de2c71b96e 18.7
## 6           Valiant fe5206e3d182bff5748e295f9f78dba99ed0ec7f 18.1

Bingo! The vectorized version of sha1() did the job!

PS: Vectorizing a function makes the function perform the same operation on every entry in a data structure (but with different values) (see Win-Vector Blog). The non-vectorized sha1() function seems to treat the variable NAME as a scalar (a single value). Thus, it hashes not every single entry of the variable, but all elements of the variable on the whole.

R Markdown Inline Code: Adding a Conjunction to Listings

In my last blog post, I wrote a couple of lines about EFFECT, a clinical trial I'm currently involved in. EFFECT is a cross-over trial with two wash-out and two study phases. After each of the four phases, the participating hospitals receive a summary of some study results. When I write these summary reports using R Markdown, I put a character string with the names of the ICUs into an inline R expression.

The character vector containing the names of the ICUs, I usually extract from one of the data frames I initially load into the document. Here, I create it manually:

names.icu <- LETTERS[1:3]

When I put this vector into an inline R expression, e.g.

  • Hospital X takes part with the following ICUs: r names.icu.

I get the following output:

“Hospital X takes part with the following ICUs: A, B, C.”

This is not exactly what I want, because by convention the last element of a listing should be preceded by a conjunction (in English: and). While some languages require the next-to-last element of a listing to be followed by a comma, other languages don't. Since the hospitals taking part at the trial have between one and five intensive care units (ICU), I needed to write a function to cover the following cases:

  • If the string vector has one element only, it must not be followed by a conjunction;
  • If the string vector has got more than one element, the next-to-last word of the listing must be followed by:
    • both comma and conjunction or
    • conjunction only.

The add_and() Function

The function I wrote can be found in the following code chunk:

add_and <- function(x, conj = "and") {
  l <- length(x)
  if (l > 1) {
    x[l] = paste(x[l - 1], conj, x[l])
    x = x[-(l - 1)]
    x = sub("\\s,", ",", x)
  }
  else {
    x
  }
  x
}

The function has got two input parameters:

  • x: the character vector and
  • conj: the conjunction (default = “and”) which may be preceded by comma and white space (“, and”) or not (“and”).

First, the function checks, whether the character vector has got more than one element. If not, it is returned as is. If yes, the conjunction is put before the last element of the vector. If the next-to-last listing element is followd by comma, the sub() function (sub("\\s,", ",", x)) removes the white space preceding this comma.

Examples

The following examples show how my function works.

Example 1: Vector with 1 element

  • Hospital X takes part with the following ICU(s): r add_and(names.icu[1]).

returns:

“Hospital X takes part with the following ICU(s): A.”

Example 2: Vector with 3 elements with no conjunction specified

  • Hospital X takes part with the following ICU(s): r add_and(names.icu).

returns:

“Hospital X takes part with the following ICU(s): A, B and C.”

Example 3: Vector with 3 elements with conjunction specified (German)

  • Krankenhaus X nimmt mit den folgenden ITS teil: r add_and(names.icu, 'und').

returns:

“Krankenhaus X nimmt mit den folgenden ITS teil: A, B und C.”

Example 4: Vector with 3 elements with conjunction preceded by comma

  • Hospital X takes part with the following ICU(s): r add_and(names.icu, ', and').

returns:

“Hospital X takes part with the following ICU(s): A, B, and C.”

Nothing New under the Sun

A former Professor of mine sometimes said that reading prevents from discovering “new” things. He was right: A couple of weeks ago, I discovered that the knitr package includes a function (combine_words()) with similar functionality. 🙂

How to Check if a Date is Within a List of Intervals in R

Intro

I'm currently involved in a research project called EFFECT. EFFECT is a multicentre, cluster-randomised, placebo-controlled cross-over trial evaluating antiseptic body wash of patients on intensive care units (ICU). The trial is to test whether daily antiseptic body wash reduces the risk of intensive care unit (ICU)-acquired primary bacteraemia and ICU-acquired multidrug-resistant organisms. EFFECT requires two types of data: (1) The patients' individual ward-movement history and
(2) microbiological test results (see Meissner 2017).

According to the study protocol, positive blood tests do count as infection unless there is a negative blood test within 48 hours after the positive blood test.

In this blog post, I show how to solve this problem on a computational level.

The Problem

The following code chunk provides an hypothetical example of the microbiological data I have to deal with. The data frame df.mibi contains 4 variables:

  • ID: Patient id (only 1 patient in this example);
  • ORGANISM: name of skin commensal organism found in some blood sample,
  • RESULT: laboratory test result (POS vs. NEG);
  • DATE: date of laboratory test
library(tidyverse)
library(lubridate)

df.mibi <- tibble(
  ID = paste0("ID_", rep(1, 11)),
  ORGANISM = c(rep('Propionibacterium acnes', 2), 
               rep('Staphylococcus epidermidis', 2),
               rep('Staphylococcus capitis', 2),
               rep('', 5)),
  RESULT = c(rep('POS', 6), rep('NEG', 5)),
  DATE = ymd(c(
    "2018-02-07", "2018-02-12", "2018-02-13", "2018-02-20",
    "2018-02-21", "2018-03-18", "2018-02-01", "2018-02-06",
    "2018-02-10", "2018-02-21", "2018-04-05")
  )
)

My Idea

In a first step, I separated df.mibi into two data frames:

  • df.POS: containing positive blood tests only
  • df.NEG: containing negative blood tests only
df.POS <- df.mibi %>%
  filter(RESULT == 'POS')
df.NEG <- df.mibi %>%
  filter(RESULT == 'NEG')

In a second step, I removed two variables from df.NEG (RESULT, ORGANISM), grouped the data frame by ID, and put all dates belonging to one ID into the list column data using the nest() function of the tidyr package

df.NEG <- df.NEG %>%
  select(ID, DATE) %>%
    group_by(ID) %>%
      nest()

This is how both data frames look like:

df.POS
## # A tibble: 6 x 4
##   ID    ORGANISM                   RESULT DATE      
##   <chr> <chr>                      <chr>  <date>    
## 1 ID_1  Propionibacterium acnes    POS    2018-02-07
## 2 ID_1  Propionibacterium acnes    POS    2018-02-12
## 3 ID_1  Staphylococcus epidermidis POS    2018-02-13
## 4 ID_1  Staphylococcus epidermidis POS    2018-02-20
## 5 ID_1  Staphylococcus capitis     POS    2018-02-21
## 6 ID_1  Staphylococcus capitis     POS    2018-03-18
df.NEG
## # A tibble: 1 x 2
##   ID    data            
##   <chr> <list>          
## 1 ID_1  <tibble [5 x 1]>

In a third step, I tried to check whether one of the negative test (stored in the list variable data) lies within the time interval positive test + 48 hours (TIME).
I did the mapping using the map2() function of the purrr package:

# merging and mapping
df.TOTAL <- df.POS %>%
  left_join(df.NEG, by = 'ID') %>%
    mutate(TIME = interval(DATE, DATE + days(2)),
           RESULT = map2(data, "DATE", TIME, ~ .x %within% .y)) 

Unfortunaltely, my code did not work. The RESULT variable should be logical and return TRUE in case of a negative test result up to 2 days after the positive test. Instead it is a list and returns NULL.

df.TOTAL
## # A tibble: 6 x 6
##   ID    ORGANISM   RESULT DATE       data   TIME                          
##   <chr> <chr>      <list> <date>     <list> <S4: Interval>                
## 1 ID_1  Propionib~ <NULL> 2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC
## 2 ID_1  Propionib~ <NULL> 2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC
## 3 ID_1  Staphyloc~ <NULL> 2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC
## 4 ID_1  Staphyloc~ <NULL> 2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC
## 5 ID_1  Staphyloc~ <NULL> 2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC
## 6 ID_1  Staphyloc~ <NULL> 2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC

The Solution

Not even one hour after I posted my question to StackOverflow, a user who calles himself “utubun” found the following solution:

df.TOTAL <- df.POS %>%
  left_join(df.NEG, by = 'ID') %>%
    mutate(TIME = interval(DATE, DATE + days(2)),
           RESULT = map2_lgl(data, TIME, ~ any(.x$DATE %within% .y)))
df.TOTAL
## # A tibble: 6 x 6
##   ID    ORGANISM   RESULT DATE       data   TIME                          
##   <chr> <chr>      <lgl>  <date>     <list> <S4: Interval>                
## 1 ID_1  Propionib~ FALSE  2018-02-07 <tibb~ 2018-02-07 UTC--2018-02-09 UTC
## 2 ID_1  Propionib~ FALSE  2018-02-12 <tibb~ 2018-02-12 UTC--2018-02-14 UTC
## 3 ID_1  Staphyloc~ FALSE  2018-02-13 <tibb~ 2018-02-13 UTC--2018-02-15 UTC
## 4 ID_1  Staphyloc~ TRUE   2018-02-20 <tibb~ 2018-02-20 UTC--2018-02-22 UTC
## 5 ID_1  Staphyloc~ TRUE   2018-02-21 <tibb~ 2018-02-21 UTC--2018-02-23 UTC
## 6 ID_1  Staphyloc~ FALSE  2018-03-18 <tibb~ 2018-03-18 UTC--2018-03-20 UTC

It works!!! Thank you very much! 🙂

Drawing a Fish Curve using R and ggplot2

Intro

Recently, I wondered whether there is a way to draw a fish shape using a mathematical function. Since I did not find a ready-made R function, I tried to write the function by myself. The equations, I've used for writing this function can be found on WolframMathWorld.

The function

The fish_curve() function requires the ggplot2 and the dplyr package. It creates a data frame with two variables (x and y) and 10.000 observations. Finally, the data points are plotted using ggplot2.

fish_curve <- function(colour='black', size = 5){
  library(ggplot2)
  library(dplyr)
  data.frame(
    x = cos(1:10000) - sin(1:10000)^2 / sqrt(2),
    y = cos(1:10000) * sin(1:10000)
  ) %>%
    ggplot(., aes(x, y)) +
    geom_point(colour = colour, size = size) +
    theme_void()
}

Function call with default parameters

With colour and size the fish_curve() function allows the user to specify two parameters; that is colour and size of the plotted points. The default values are black for colour and 5 for size.

(p1 <- fish_curve())

plot of chunk fish-1

Customization

In the following example, we customize colour and size of the fish shape:

(p2 <- fish_curve(colour = 'blue', size = 1))

plot of chunk fish-2

And finally, we place the two plots side by side using the patchwork package:

library(patchwork)
p1 + p2

plot of chunk unnamed-chunk-1

How to order factors by level frequency and level name

Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values

The dataset I’m going to use in this post (mtcars) is part of the datasets package.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In the first code chunk, we:

  • extract the first word of each car name and write it into a new variable called “brand”,
  • rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
  • calculate the median miles per gallon (mpg) for each car brand.
df.mtcars %
  mutate(name = str_extract(rownames(.), "^\\w+\\b"),
         brand = str_replace(name, "^M\\w+", 'Others')) %>%
  group_by(brand) %>%
  summarize(mpg = median(mpg))
df.mtcars$brand
##  [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"   
##  [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"  
## [13] "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"  
## [19] "Valiant"  "Volvo"

The following code chunk is to reorder the brand variable by level frequency using the reorder() function.

df.mtcars %
  mutate(brand = as.factor(brand),
         brand = reorder(brand, mpg))
levels(df.mtcars$brand)
##  [1] "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"     
##  [7] "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency")

plot of chunk unnamed-chunk-4

To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the relevel() function.

df.mtcars %
  mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars$brand)
##  [1] "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##  [7] "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

Finally, the bar representing the “Others” level appears at the desired position.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency and level name")

plot of chunk unnamed-chunk-6

PS: In both plots, the gghighlight() function of the gghighlight package was used to highlight the desired factor level.

Postleitzahlen mit führender Null richtig formatieren

Intro

Importiert man Postleitzahlen aus anderen Datenformaten (z.B. Excel, Access) in R, ist es nicht selten, dass Postleitzahlen programmintern tatsächlich auch als Zahlen abgespeichert werden. Wie ein Blick auf einen im Internet frei verfügbaren Datensatz zeigt, kann dies zu folgendem Problem führen:

library(dplyr)
library(readxl)
mydata <- readxl::read_xlsx("Liste-der-PLZ-in-Excel-Karte-Deutschland-Postleitzahlen.xlsx")
head(mydata)
## # A tibble: 6 x 4
##     PLZ Bundesland Kreis   Typ  
##   <dbl> <chr>      <chr>   <chr>
## 1  1067 Sachsen    Dresden Stadt
## 2  1069 Sachsen    Dresden Stadt
## 3  1097 Sachsen    Dresden Stadt
## 4  1099 Sachsen    Dresden Stadt
## 5  1108 Sachsen    Dresden Stadt
## 6  1109 Sachsen    Dresden Stadt

Die Postleitzahlen der Städte und Gemeinden in den Bundesländern Sachsen, Sachsen-Anhalt und Thüringen werden der führenden Null beraubt und als vierstellige Zahlen dargestellt.

Um den entsprechenden PLZ die führende Null zurückzugeben, habe ich die Funktion plz_repair() geschrieben.

plz_repair <- function(x){
  x = ifelse(nchar(x) == 4, paste0('0', x), as.character(x))
}

Die Funktion prüft zunächst, ob die PLZ vierstellig ist. Wenn diese Bedingung erfüllt ist, wird die PLZ um eine führende Null erweitert, sodass eine fünfstellige PLZ entsteht. Bereits fünfstellige PLZ bleiben unverändert. Die reparierte PLZ-Variable wird als character string abgespeichert.

mydata <- mydata %>%
  mutate(PLZ = plz_repair(PLZ))
head(mydata)
## # A tibble: 6 x 4
##   PLZ   Bundesland Kreis   Typ  
##   <chr> <chr>      <chr>   <chr>
## 1 01067 Sachsen    Dresden Stadt
## 2 01069 Sachsen    Dresden Stadt
## 3 01097 Sachsen    Dresden Stadt
## 4 01099 Sachsen    Dresden Stadt
## 5 01108 Sachsen    Dresden Stadt
## 6 01109 Sachsen    Dresden Stadt
tail(mydata)
## # A tibble: 6 x 4
##   PLZ   Bundesland Kreis                 Typ  
##   <chr> <chr>      <chr>                 <chr>
## 1 99986 Thüringen  Unstrut-Hainich-Kreis Kreis
## 2 99988 Thüringen  Unstrut-Hainich-Kreis Kreis
## 3 99991 Thüringen  Unstrut-Hainich-Kreis Kreis
## 4 99994 Thüringen  Unstrut-Hainich-Kreis Kreis
## 5 99996 Thüringen  Unstrut-Hainich-Kreis Kreis
## 6 99998 Thüringen  Unstrut-Hainich-Kreis Kreis

R Markdown: How to place two tables side by side using ‘knitr’ and ‘kableExtra’

Intro

When I was recently writing some report using R Markdown, I wanted to place two rather small tables side by side. Since I usually use the kable()-function of the knitr package and the kableExtra package to print tables, I tried to find a solution for my problem using both packages.

Since my Google search (“two tables side by side with kableExtra” or something similar) did not return a helpful result, I experimented with some table formating options provided by the kableExtra package. Here is my solution.

Packages and data

For printing the tables we need to install and load two packages: knitr and kableExtra. The dplyr packages is required for some data manipulation. The data we want to put into the tables stem from the bundesligR package which contains final tables of Germany's highest football (soccer) league. We want to place the final tables of two seasons (1985/86 and 2015/16) side by side.

df <- bundesligR::bundesligR 
table.1985 <- df %>%
  filter(Season == 1985) %>%
    select(Position, Team, Points)
table.2015 <- df %>%
  filter(Season == 2015) %>%
    select(Position, Team, Points)

Now, we place both tables side by side using some functionality of the kableExtra package:

table.1985 %>%
  kable("html", align = 'clc', caption = 'Bundesliga, Season 1985/86') %>%
    kable_styling(full_width = F, position = "float_left")

table.2015 %>%
  kable("html", align = 'clc', caption = 'Bundesliga, Season 2015/16') %>%
    kable_styling(full_width = F, position = "right")
Bundesliga, Season 1985/86
Position Team Points
1 FC Bayern Muenchen 70
2 Werder Bremen 69
3 FC Bayer 05 Uerdingen 64
4 Borussia Moenchengladbach 57
5 VfB Stuttgart 58
6 TSV Bayer 04 Leverkusen 55
7 Hamburger SV 56
8 SV Waldhof Mannheim 44
9 VfL Bochum 46
10 FC Schalke 04 41
11 1. FC Kaiserslautern 40
12 1. FC Nuernberg 41
13 1. FC Koeln 38
14 Fortuna Duesseldorf 40
15 Eintracht Frankfurt 35
16 Borussia Dortmund 38
17 1. FC Saarbruecken 27
18 Hannover 96 23
Bundesliga, Season 2015/16
Position Team Points
1 FC Bayern Muenchen 88
2 Borussia Dortmund 78
3 Bayer 04 Leverkusen 60
4 Borussia Moenchengladbach 55
5 FC Schalke 04 52
6 1. FSV Mainz 05 50
7 Hertha BSC 50
8 VfL Wolfsburg 45
9 1. FC Koeln 43
10 Hamburger SV 41
11 FC Ingolstadt 04 40
12 FC Augsburg 38
13 Werder Bremen 38
14 SV Darmstadt 98 38
15 TSG 1899 Hoffenheim 37
16 Eintracht Frankfurt 36
17 VfB Stuttgart 33
18 Hannover 96 25

The trick is to set the position argument to float_left (left table) and right (right table). Furthermore, the argument full_width must be set to FALSE in both tables.

To Do

Unfortunately, the given example only works for rendering HTML documents. Does anyone know how to place two tables side by side when the output format is PDF/LaTeX?

How to Assign Variable Labels in R

Intro

Defining variable labels is a useful way to describe and document datasets. Unlike SPSS, which makes it very easy to define variable labels using the data editor, base R doesn't provide any function to define variable labels (as far as I know).

However, Daniel Luedecke's R package sjlablled fills this gap. Let's give an example.

Defining variable labels

First, we load the mtcars data frame and define variable labels for all of the 11 variables:

data(mtcars)
labs <- c("Miles/(US) gallon", "Number of cylinders", "Displacement (cu.in.)", 
    "Gross horsepower", "Rear axle ratio", "Weight (1000 lbs)", "1/4 mile time", 
    "V/S", "Transmission", "Number of forward gears", "Number of carburetors")

Assigning labels to variables

Second, we assign the variable labels to the variables of the mtcars data frame:

library(sjlabelled)
mtcars <- set_label(mtcars, label = labs)

When we have a look at the mtcars data frame using RStudio's data viewer, we find the variable labels placed right underneath the variable names:


Moreover, we may as well save both variable names and labels into a data frame:

library(dplyr) # for data manipulation
library(knitr) # for printing tables
df <- get_label(mtcars) %>%
        data.frame() %>%
          rename_at(vars(1), funs(paste0('var.labs'))) %>%
            mutate(var.names = colnames(mtcars)) 
kable(df, align = 'lc')
var.labs var.names
Miles/(US) gallon mpg
Number of cylinders cyl
Displacement (cu.in.) disp
Gross horsepower hp
Rear axle ratio drat
Weight (1000 lbs) wt
¼ mile time qsec
V/S vs
Transmission am
Number of forward gears gear
Number of carburetors carb
Design a site like this with WordPress.com
Get started