Statistik in Leipzig: Hexagonaler Stadtplan

Als ich vor einigen Jahren zum ersten Mal eine Tile Grid Map der USA sah, kam bei mir die Idee auf, eine solche im Deutschen als “Kachelkarte” oder “hexagonaler Karte” bezeichnete Karte für meine Heimatstadt Leipzig zu programmieren. Vor einigen Tagen habe ich mit dem R-Paket {tilemaps} eine Möglichkeit entdeckt, die insgesamt 63 Leipziger Ortsteile als Hexagone darzustellen. Die 63 Ortsteile verteilen sich auf 10 Stadtbezirke, welche in der nachfolgenden Karte farblich markiert sind.

Hexagonaler Stadtplan der Stadt Leipzig

Hexagonaler Stadtplan der Stadt Leipzig

“Kachelkarten” lassen sich u.a. für die Visualisierung statistischer Angaben verwenden. Die folgende Abbildung zeigt die Verteilung von Straftaten (ingesamt) auf die einzelnen Ortsteile (absolute Zahlen) aus dem Jahr 2023. Die Daten habe ich mit dem von mir entwickelten R-Paket {lisr} aus dem “Leipzig Informations-System” importiert.

Leipzig: Anzahl Straftaten im Jahr 2023

Leipzig: Anzahl Straftaten im Jahr 2023

ANHANG: R-Code

Im Anhang dokumentiere ich den R-Code, mit dem die Daten importiert, umgeformt und visualisiert wurden.

knitr::opts_chunk$set(echo = FALSE,
                      message = FALSE,
                      warning = FALSE
                      )

# remotes::install_github("nrkoehler/lisr")
library(lisr)
library(tilemaps)
library(sf)
library(ggfittext)
library(tidyverse)


colsbz <- c(
  "gold", # Nordwest
  "#fe8f24", # Nord
  "#ffc126", # Nordost
  "steelblue1", # Altwest
  "#4481fd", # West
  "#f6dcb4", # Zentrum
  "#dcafa9", # Ost
  "seagreen3", # Suedwest
  "yellowgreen", # Sued
  "limegreen" # Suedost
)

df.OT <- get_lis_shapefile(shape = "Ortsteile")

df.CRIME <- get_lis_kd(kategorie_nr = 12) %>% 
  filter(SACHMERKMAL == "Straftaten insgesamt") %>% 
  select(Name = GEBIET, '2023') %>% 
  rename(Crime_Total = 2)

df.OT <- df.OT %>% 
  inner_join(df.CRIME, by = "Name")

set.seed(2011) #2022, 2017, 2011
data <- df.OT %>%
  mutate(NR = as.numeric(str_sub(OT, 1, 1)),
         BZ = factor(NR, levels = c(8, 9, 1, 7, 6, 0, 2, 5, 4, 3),
                     labels = c("Nordwest ", "Nord", "Nordost" , "Alt-West", "West", "Mitte", "Ost", "S\u00fcdwest ", 'S\u00fcd', "S\u00fcdost")
         )) %>% 
  mutate(tile_map = generate_map(geometry, square = FALSE, flat_topped = FALSE, prop = 0.2,
                                 interpolate = 0))

df.cent <- st_boundary(data$tile_map) %>% 
  st_coordinates() %>% 
  as_tibble() %>% 
  group_by(Name = L1) %>% 
  mutate(
    XMIN = min(X),
    XMAX = max(X),
    YMIN = min(Y),
    YMAX = max(Y)
  ) %>% 
  distinct(Name, XMIN, XMAX, YMIN, YMAX) %>% 
  ungroup() %>% 
  mutate(Name = str_replace(df.OT$Name, "-", "-\n"))

ggplot(data) +
  geom_sf(aes(geometry = tile_map, fill = BZ), alpha = 1, color = "white", linewidth = 0.4) +
  geom_fit_text(data = df.cent, aes(xmin = XMIN, xmax = XMAX, ymin = YMIN, ymax = YMAX, label = Name), reflow = F) +
  scale_fill_manual(values = colsbz) +
  theme_void() +
  theme(plot.title = element_text(color="white", face = "bold",
                                  hjust = 0.05, vjust=-2),
        plot.subtitle = element_text(color="grey", hjust = 0.06, vjust=-4),
        plot.caption = element_text(color="grey", vjust = 7, hjust = 0.98),
        legend.position = "bottom",
        legend.byrow = TRUE,
        legend.text=element_text(color="white"),
        plot.background = element_rect(fill="black", colour = "white")) +
  labs(fill = NULL, 
       title = "Stadt Leipzig",
       subtitle = "Stadtbezirke (10) und Ortsteile (63)",
       caption = "Quelle: LIS"
  )



ggplot(data) +
  geom_sf(aes(geometry = tile_map, fill = Crime_Total), alpha = 1, color = "grey", linewidth = 0.4) +
  geom_fit_text(data = df.cent, aes(xmin = XMIN, xmax = XMAX, ymin = YMIN, ymax = YMAX, label = Name), reflow = F) +
  scale_fill_gradient(low="white", high="red2") +
  theme_void() +
  theme(plot.title = element_text(color="black", face = "bold",
                                  hjust = 0.05, vjust=-2),
        plot.subtitle = element_text(hjust = 0.06, vjust=-4),
        plot.caption = element_text(vjust = 7, hjust = 0.98),
        legend.position = "top",
        legend.byrow = TRUE) +
  labs(fill = "Anzahl Straftaten im Jahr 2023", 
       title = "Stadt Leipzig",
       caption = "Quelle: LIS"
  )

Statistik in Leipzig: Geburten und Einwohner

In meiner zweiten Blogpost zum Themengebiet “Statistik in Leipzig” beschäftige ich mich mit der Bevölkerungsentwicklung in Leipzig. Im Besonderen geht es um die Anzahl der Geburten und die Geburtenrate für den Zeitraum 2000 bis 2025. Die Daten habe ich mit dem R-Paket {lisr} aus dem “Leipzig-Informationssystem” (LIS) abgefragt.

Abbildung 1 zeigt, dass die Anzahl der Einwohner in den Jahren 2000 bis 2025 stetig gestiegen ist. Lebten im Jahr 2000 noch 479,996 Menschen in der Stadt an der Pleiße, so waren es im Jahr 2025 633,592. Dies Entspricht einem Zugewinn von 153,596 Einwohnern. Im gesamten Zeitraum überstieg die Anzahl der Frauen die Anzahl der Männer, wobei sich beide Zahlen tendenziell annäherten.

plot of chunk population-vis

Die Anzahl der Geburten stieg von 2001 (3,773 Neugeborene) bis 2016 (6,982 Neugeborene) kontinuierlich an. Seitdem ist die Anzahl der Geburten rückläufig, wobei der Rückgang in den Jahren 2022 und 2023 besonders stark war. Im Jahr 2024 wurden in Leipzig nur noch 4,811 Kinder geboren. Ingesamt wurden im Zeitraum der Jahre 2001 bis 2024 in Leipzig 63,098 Mädchen und 66,780 Jungen geboren.

plot of chunk births-vis

Die Geburtenrate gibt an, wie viele Kinder in einem bestimmten Zeitraum (z.B. Jahr) pro 1.000 Einwohner geboren wurden. Sie kann mit der folgenden Formel berechnet werden:

Anzahl Neugeborene / Anzahl Einwohner * 1.000

Im Zeitraum der Jahre 2001 bis 2024 lag die durchschnittliche Geburtenrate in Leipzig bei etwa 10 Neugeborenen pro 1.000 Einwohner. Während die Geburtenrate von 2001 bis 2016 fast kontinuierlich anstieg, ist seitdem ein stetiger Rückgang zu verzeichnen, der nach Beginn der COVID-19-Pandemie besonders stark ausgeprägt war.

plot of chunk birthrate-vis

ANHANG: R-Code

Im Anhang dokumentiere ich den R-Code, mit dem die Daten importiert, umgeformt und visualisiert wurden.

knitr::opts_chunk$set(echo = FALSE)

# remotes::install_github('nrkoehler/lisr')
library(lisr)
library(scales)
library(htmlTable)
library(geomtextpath)
library(tidyverse)

# import birth data
data.birth <- get_lis_move(rubrik_nr = 1, periode = "y")

# import population data

data.pop <- get_lis_pop(rubrik_nr = 1, periode = "y")


dfr.pop <- data.pop %>%
    slice(1:3) %>%
    select(KENNZIFFER, starts_with("20")) %>%
    mutate(KENNZIFFER = case_when(str_detect(KENNZIFFER, "^B") ~ "Total", str_detect(KENNZIFFER, "^M") ~ "Male", str_detect(KENNZIFFER, "^F") ~ "Female")) %>%
    pivot_longer(cols = starts_with("20"), values_to = "Number", values_transform = list(Number = as.integer), names_to = "Year") %>%
    mutate(KENNZIFFER = factor(KENNZIFFER, levels = c("Total", "Male", "Female")))


lst.pop <- tibble::lst(nmb.2000 = dfr.pop %>%
    filter(Year == "2000", KENNZIFFER == "Total") %>%
    pull(Number), nmb.2025 = dfr.pop %>%
    filter(Year == "2025", KENNZIFFER == "Total") %>%
    pull(Number), nmb.plus = nmb.2025 - nmb.2000)


ggplot(dfr.pop, aes(x = Year, y = Number, colour = KENNZIFFER)) + geom_point() + geom_line(aes(group = KENNZIFFER)) + scale_color_manual(values = c("black",
    "#1B5583", "#FA9BCB")) + scale_y_continuous(labels = comma, breaks = seq(0, 7e+05, 50000)) + theme_bw() + theme(legend.position = "bottom", axis.text.x = element_text(angle = 45,
    vjust = 0.5, hjust = 1), strip.text = element_text(colour = "white", face = "bold", size = 12), strip.background = element_rect(fill = "#4c4a4a")) + labs(x = NULL,
    y = NULL, colour = NULL, subtitle = "Abb. 1: Number of inhabitants in Leipzig, 2000-2025")


dfr.birth <- data.birth %>%
    filter(MERKMAL_1 == "Lebendgeborene") %>%
    select(KENNZIFFER, starts_with("20")) %>%
    mutate(KENNZIFFER = case_when(str_detect(KENNZIFFER, "^I") ~ "Total", str_detect(KENNZIFFER, "^M") ~ "Male", str_detect(KENNZIFFER, "^F") ~ "Female")) %>%
    pivot_longer(cols = starts_with("20"), values_to = "Number", names_to = "Year") %>%
    mutate(KENNZIFFER = factor(KENNZIFFER, levels = c("Total", "Male", "Female")))

dfr.birth.sum <- dfr.birth %>%
    group_by(KENNZIFFER) %>%
    summarise(BIRTH = sum(Number))


lst.birth <- tibble::lst(nmb.2001 = dfr.birth %>%
    filter(Year == "2001", KENNZIFFER == "Total") %>%
    pull(Number), nmb.2024 = dfr.birth %>%
    filter(Year == "2024", KENNZIFFER == "Total") %>%
    pull(Number), nmb.max = dfr.birth %>%
    filter(KENNZIFFER == "Total") %>%
    pull(Number) %>%
    max(), yr.max = dfr.birth %>%
    filter(KENNZIFFER == "Total", Number == max(Number)) %>%
    pull(Year), nmb.min = dfr.birth %>%
    filter(KENNZIFFER == "Total") %>%
    pull(Number) %>%
    min(), yr.min = dfr.birth %>%
    filter(KENNZIFFER == "Total") %>%
    filter(Number == min(Number)) %>%
    pull(Year), nmb.total = dfr.birth.sum[1, 2] %>%
    as.numeric(), nmb.boys = dfr.birth.sum[2, 2] %>%
    as.numeric(), nmb.girls = dfr.birth.sum[3, 2] %>%
    as.numeric(), nmb.diff = nmb.boys - nmb.girls)


ggplot(dfr.birth, aes(x = Year, y = Number, colour = KENNZIFFER)) + geom_point() + geom_line(aes(group = KENNZIFFER)) + scale_color_manual(values = c("black",
    "#1B5583", "#FA9BCB")) + theme_bw() + theme(legend.position = "bottom", axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1), strip.text = element_text(colour = "white",
    face = "bold", size = 12), strip.background = element_rect(fill = "#4c4a4a")) + labs(x = NULL, y = NULL, colour = NULL, subtitle = "Abb. 2: Number of births in Leipzig, 2001-2024")



dfr.birthrate <- dfr.birth %>%
    filter(KENNZIFFER == "Total") %>%
    rename(nmb_birth = Number) %>%
    left_join(dfr.pop, by = c("KENNZIFFER", "Year")) %>%
    rename(nmb_pop = Number) %>%
    mutate(Birthrate = nmb_birth/nmb_pop * 1000)

lst.br <- tibble::lst(br.mean = sum(dfr.birthrate$nmb_birth)/sum(dfr.birthrate$nmb_pop) * 1000)



ggplot(dfr.birthrate, aes(x = Year, y = Birthrate)) + geom_point() + geom_line(aes(group = 1), colour = "red3") + geom_texthline(yintercept = lst.br$br.mean,
    label = "Mean birthrate", colour = "#4c4a4a") + scale_y_continuous(limits = c(4, 16), breaks = seq(0, 20, 2)) + theme_bw() + theme(legend.position = "bottom",
    axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 1), strip.text = element_text(colour = "white", face = "bold", size = 12), strip.background = element_rect(fill = "#4c4a4a")) +
    labs(x = NULL, y = "birthrate", subtitle = "Abb. 3: Birthrate in Leipzig, 2001 - 2024")

Statistik in Leipzig: Vornamen 2014 – 2025

Das Amt für Statistik und Wahlen der Stadt Leipzig stellt auf seinen Internetseiten zahlreiche Statistiken zu verschiedenen Aspekten der städtischen Entwicklung zur Verfügung. Mit dem “Leipzig-Informationssystem” (LIS) ist zudem eine Vielzahl interessanter Daten über eine Schnittstelle (API) programmatisch erreichbar. Mit Hilfe meines R-Pakets {lisr} lassen sich die im LIS erhältlichen Daten bequem in das Statistikprogramm R importieren. In diesem Beitrag stelle ich die Leipziger Vornahmensstatistik der Jahre 2014 bis 2025 vor. In dieser Statistik werden die Vornamen aller in den entsprechenden Jahren in Leipzig geboren Kinder erfasst.

Im Folgenden visualisiere ich die Daten als Tabelle, Balkendiagramm und als Poster.

Gesamtstatistik

Ich beginne mit einer Tabelle, in welcher die 20 beliebtesten Babynames in Leipzig im Zeitraum 2014 bis 2025 unterteilt nach Geschlecht dargestellt werden. Die drei beliebtesten Mädchennamen waren Emma, Charlotte und Emilia und die drei beliebtesten Jungennamen waren Emil, Oskar und Paul.

Die 20 beliebtesten Maedchen- und Jungennamen, Leipzig, 2014-2025
Rank Girls Number Boys Number
1 Emma 588 Emil 649
2 Charlotte 488 Oskar 548
3 Emilia 478 Paul 527
4 Mia 457 Noah 491
5 Johanna 428 Karl 488
6 Clara 419 Ben 482
7 Ella 386 Anton 476
8 Hannah 357 Theo 439
9 Anna 349 Elias 385
10 Mila 346 Leon 374
11 Lina 309 Felix 368
12 Lea 303 Moritz 367
13 Ida 302 Jonas 367
14 Frieda 302 Theodor 362
15 Mathilda 296 Arthur 357
16 Hanna 293 Finn 333
17 Marie 279 Luca 317
18 Lena 277 Jakob 315
19 Sophia 272 Bruno 294
20 Leni 269 Maximilian 290

Namen neugeborener Mädchen

Die folgenden Balkendiagramme zeigen die Anzahl der Namen neugeborener Mädchen unterteilt nach Geburtsjahr.

Namen neugeborener Mädchen in Leipzig

Namen neugeborener Mädchen in Leipzig

Bei genauerer Betrachtung der Balendiagramme ist zu erkennen, dass in jedem Jahr mindestens zwei Vornamen mit derselben Häufigkeit vergeben wurden. Auf dem folgenden Poster werden diese Namen untereinander dargestellt (z.B. Clara und Lea im Jahr 2014).

Namen neugeborener Mädchen in Leipzig

Namen neugeborener Mädchen in Leipzig

Namen neugeborener Jungen

Die Namen neugeborener Jungen habe ich analog zu den Mädchennamen dargestellt.

Namen neugeborener Jungen in Leipzig

Namen neugeborener Jungen in Leipzig

Namen neugeborener Jungen in Leipzig

Namen neugeborener Jungen in Leipzig

Die Balkendiagramme deuten an, dass die Anzahl der Geburten in den letzten Jahren rückläufig war. Die Leipziger Geburtenstatistik werde ich in meinem nächsten Blog-Beitrag darstellen.

ANHANG: R-Code

Im Anhang dokumentiere ich den R-Code, mit dem die Daten importiert, umgeformt und visualisiert wurden. Die Tabelle wurde mit dem {huxtable}-Paket erzeugt, andere Pakete ({kableExtra}, {flextable}, {tinytable}) scheinen mit WordPress nicht zu funktionieren.

knitr::opts_chunk$set(echo = FALSE, cache = FALSE)

# remotes::install_github("nrkoehler/lisr")
library(lisr)
library(huxtable)
library(tidyverse)
library(tidytext)

data <- map_df(2014:2025, get_lis_babynames) 

df.top.10_boys <- data %>%
  filter(GESCHLECHT == 'm') %>%
  select(-GESCHLECHT) %>% 
  group_by(YEAR) %>% 
  mutate(RANK = rank(-ANZAHL, ties.method = "min")) %>% 
  ungroup() %>% 
  filter(RANK <= 10) %>% 
  arrange(YEAR, RANK) %>% 
  group_by(YEAR, RANK) %>% 
  nest() %>% 
  mutate(VORNAME = map_chr(data, ~toString(.x$VORNAME)),
         VORNAME = str_replace_all(VORNAME, ",", "\n"),
         ANZAHL = map_int(data, ~max(.x$ANZAHL))) %>% 
  select(VORNAME, YEAR, ANZAHL, RANK)

df.top.10_girls <- data %>%
  filter(GESCHLECHT == 'w') %>%
  select(-GESCHLECHT) %>% 
  group_by(YEAR) %>% 
  mutate(RANK = rank(-ANZAHL, ties.method = "min")) %>% 
  ungroup() %>% 
  filter(RANK <= 10) %>% 
  arrange(YEAR, RANK) %>% 
  group_by(YEAR, RANK) %>% 
  nest() %>% 
  mutate(VORNAME = map_chr(data, ~toString(.x$VORNAME)),
         VORNAME = str_replace_all(VORNAME, ",", "\n"),
         ANZAHL = map_int(data, ~max(.x$ANZAHL))) %>% 
  select(VORNAME, YEAR, ANZAHL, RANK)
df.girls <- data %>%
  filter(GESCHLECHT == "w") %>%
  select(-GESCHLECHT) %>%
  summarise(.by = c(VORNAME), NMB = sum(ANZAHL)) %>%
  top_n(20) %>%
  arrange(desc(NMB))

df.boys <- data %>%
  filter(GESCHLECHT == "m") %>%
  select(-GESCHLECHT) %>%
  summarise(.by = c(VORNAME), NMB = sum(ANZAHL)) %>%
  top_n(20) %>%
  arrange(desc(NMB))

df.total <- bind_cols(tibble(A = 1:20), df.girls, df.boys) %>% 
  set_names(., c("Rank", "Girls", "Number", "Boys", "Number"))


df.total %>% 
  as_huxtable() %>% 
  set_width(1) %>% 
  set_bold(row = 1, col = everywhere) %>% 
  set_align("centre") %>% 
  set_caption("Die 20 beliebtesten Maedchen- und Jungennamen, Leipzig, 2014-2025") %>% 
  huxtable::theme_grey()

# datatable(df.total, 
#           filter = 'none',
#           autoHideNavigation = TRUE,
#           options = list(pageLength = 20),
#           colnames = c("Girls", "Number", "Boys", "Number"),
#           caption = "Die 20 beliebtesten Maedchen- und Jungennamen, Leipzig, 2014-2025"
#           )



data %>%
  filter(GESCHLECHT == 'w') %>%
  select(-GESCHLECHT) %>% 
  summarise(.by = c(VORNAME, YEAR), NMB = sum(ANZAHL)) %>% 
  group_by(YEAR) %>% 
  top_n(10) %>% 
  arrange(NMB) %>% 
  ungroup() %>% 
  ggplot(., aes(x = reorder_within(VORNAME, NMB, YEAR), y = NMB)) +
  coord_flip() +
  geom_col(fill = "#FA9BCB", width = 0.8, colour = 'white') +
  scale_fill_distiller(palette = 3) +
  geom_text(aes(label = NMB), color = 'black', hjust = -0.6, size = 2) +
  scale_x_reordered() +
  facet_wrap(~YEAR, ncol = 2, scales = "free_y") +
  labs(x = NULL, y = NULL, fill = NULL) +
  tidyquant::theme_tq() +
  theme(legend.position = 'none') 
  

ggplot(df.top.10_girls,
       aes(x = YEAR, y = RANK, colour = as.character(YEAR))) +
  coord_flip() +
  scale_y_continuous(breaks = seq(1, 12, 1), sec.axis = dup_axis()) +
  scale_x_reverse(breaks = rev(seq(2014, 2025, 1))) +
  geom_text(aes(label = VORNAME), hjust = 'center') +
  hrbrthemes::theme_modern_rc() +
  theme(legend.position = 'none',
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  expand_limits(x = c(2013.8, 2025.1)) +
    labs(title = 'Most popular female babynames',
       subtitle = 'Leipzig, Germany')

data %>%
  filter(GESCHLECHT == 'm') %>%
  select(-GESCHLECHT) %>% 
  summarise(.by = c(VORNAME, YEAR), NMB = sum(ANZAHL)) %>% 
  group_by(YEAR) %>% 
  top_n(10) %>% 
  arrange(NMB) %>% 
  ungroup() %>% 
  ggplot(., aes(x = reorder_within(VORNAME, NMB, YEAR), y = NMB)) +
  coord_flip() +
  geom_col(fill = "#1B5583", width = 0.8, colour = 'white') +
  geom_text(aes(label = NMB), color = 'black', hjust = -0.6, size = 2) +
  scale_x_reordered() +
  facet_wrap(~YEAR, ncol = 2, scales = "free_y") +
  labs(x = NULL, y = NULL, fill = NULL) +
  tidyquant::theme_tq() +
  theme(legend.position = 'none') 
  
ggplot(df.top.10_boys,
       aes(x = YEAR, y = RANK, colour = as.character(YEAR))) +
  coord_flip() +
  scale_y_continuous(breaks = seq(1, 12, 1), sec.axis = dup_axis()) +
  scale_x_reverse(breaks = rev(seq(2014, 2025, 1))) +
  geom_text(aes(label = VORNAME), hjust = 'center') +
  hrbrthemes::theme_modern_rc() +
  theme(legend.position = 'none',
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  expand_limits(x = c(2013.8, 2025.1)) +
  labs(title = 'Most popular male babynames',
       subtitle = 'Leipzig, Germany')

How to use lists in RMarkdown and Quarto

RMarkdown and Quarto documents are “dynamic analysis documents that combine code, rendered output (such as figures), and prose”. Since everything in R is an object, lots of R objects (numbers, charts, tables, statistical models, etc.) are created while rendering the document. In large documents, it can be difficult to keep track with where and when these objects are created which sometimes makes debugging a cumbersome and time consuming job.

However, in this blog post, I will show how to use lists to improve clarity throughout the document. Let me give you an example:

Creating a list with named objects

In R, a “list” itself is an object and can be created using the native list() function. In the following code snippet, we save list as an object (“lst.a”) and add one named object which is numeric (‘nmb.1’). The str()function shows all objects our list contains.

lst.a <- list(
  nmb.1 = 10
)

str(lst.a)
## List of 1
##  $ nmb.1: num 10

Reusing objects

Next, we try to create a new list and check whether it’s possible to use one object (‘nmb.1’) to compute another object (‘nmb.2’):

lst.b <- list(
  nmb.1 = 10,
  nmb.2 = nmb.1 + 10
)
## Error:
## ! object 'nmb.1' not found

As we can see, the list() function cannot create objects and reuse them at the same time. Fortunately, the {tibble} package contains the lst() function which is able to do so:

lst.b <- tibble::lst(
  nmb.1 = 10,
  nmb.2 = nmb.1 + 10
)
str(lst.b)
## List of 2
##  $ nmb.1: num 10
##  $ nmb.2: num 20

Saving plots in lists

Unfortunately, base R plots cannot be saved as object and, thus, cannot be put into a list. When we evaluate the following line of Rcode, we see that the plot() function is evaluated immediately (as a side effect). The object it was assigned to remains empty (“NULL”).

lst.c <- list(
  plt = plot(1:10)
)

plot of chunk lst-plot

lst.c$plt
## NULL

However, the {ggplot2} package produces plots which can be assigned to objects and, thus, can be put into lists.

lst.d <- list(
  ggplt = data.frame(x = 1:10, y = 1:10) %>% ggplot(aes(x, y)) + geom_point()
)
lst.d$ggplt

plot of chunk lst-ggplot

Statistical models, data.frames etc.

It probably doesn’t come as a surprise that many more objects can be put into a list. In the following example, we us the ‘mtcars’ data.frame to build a simple linear model explaining horsepower (hp) by miles per gallon (mpg). At the same time, we put this model into a list (‘lst.e’) both as a list (‘mdl.lm’) and a tibble.

put a linear model explaining horsepower (hp) by miles per gallon (mpg) both as a list and a tibble (‘tib.lm’).

lst.e <- lst(
  mdl.lm = lm(hp ~ mpg, data = mtcars),
  tib.lm = broom::tidy(mdl.lm)
)

str(lst.e,  list.len = 5)
## List of 2
##  $ mdl.lm:List of 12
##   ..$ coefficients : Named num [1:2] 324.08 -8.83
##   .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "mpg"
##   ..$ residuals    : Named num [1:32] -28.7 -28.7 -29.8 -25.1 16 ...
##   .. ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   ..$ effects      : Named num [1:32] -829.8 296.3 -23.6 -20 19.3 ...
##   .. ..- attr(*, "names")= chr [1:32] "(Intercept)" "mpg" "" "" ...
##   ..$ rank         : int 2
##   ..$ fitted.values: Named num [1:32] 139 139 123 135 159 ...
##   .. ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. [list output truncated]
##   ..- attr(*, "class")= chr "lm"
##  $ tib.lm: tibble [2 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ term     : chr [1:2] "(Intercept)" "mpg"
##   ..$ estimate : num [1:2] 324.08 -8.83
##   ..$ std.error: num [1:2] 27.43 1.31
##   ..$ statistic: num [1:2] 11.81 -6.74
##   ..$ p.value  : num [1:2] 8.25e-13 1.79e-07

Using lists with RMarkdown and Quarto

Finally, we are going to save everything into a single list and put the list elements into a text.

lst.final <- lst(
  nmb.1 = nrow(mtcars),
  nmb.2 = ncol(mtcars),
  ggplt = mtcars %>% ggplot(aes(x = hp, y = mpg)) +
    geom_point() + geom_smooth(method='lm', se = FALSE) + ggthemes::theme_clean(),
  mdl.lm = lm(hp ~ mpg, data = mtcars),
  tib.lm = broom::tidy(mdl.lm, conf.int = TRUE)
)

Putting text and R code together

This is just an example text showing how to put text and code into:

The ‘mtcars’ dataset consists of lst.final$nmb.1 rows and lst.final$nmb.2 columns.

Rendering to:

The ‘mtcars’ dataset consists of 32 rows and 11 columns.

The relation between horsepower and miles per gallon is visualised in the following figure:

lst.final$ggplt

Horsepower and miles per gallon

Statistical models can be easily put into a table using the {gtsummary} package:

library(gtsummary)

tbl_regression(lst.final$mdl.lm, intercept = TRUE) %>% 
  as_hux_table()

Characteristic Beta 95% CI p-value
(Intercept) 324 268, 380 <0.001
mpg -8.8 -12, -6.2 <0.001
Abbreviation: CI = Confidence Interval

How to create a descriptive summary table (‘table 1’) using R

Intro

“Table 1”, that is a table providing the sample characteristics of an empirical study or clinical trial is an obligatory part of scientific publications. Since I started using {R} some ten years ago, I have come across a couple of packages and functions aiming to create such a table. In this blog post, I’ll give an overview about two of these packages and show how to use them.

Data

The data.frame I use is named ‘trial’ and is part of the {gtsummary} package. It contains the following variables:

head(gtsummary::trial)
## # A tibble: 6 x 8
##   trt      age marker stage grade response death ttdeath
##   <chr>  <dbl>  <dbl> <fct> <fct>    <int> <int>   <dbl>
## 1 Drug A    23  0.16  T1    II           0     0    24  
## 2 Drug B     9  1.11  T2    I            1     0    24  
## 3 Drug A    31  0.277 T1    II           0     0    24  
## 4 Drug A    NA  2.07  T3    III          1     1    17.6
## 5 Drug A    51  2.77  T4    III          1     1    16.4
## 6 Drug B    39  0.613 T4    I            0     1    15.6

Apparently, the data.frame contains a treatment variable (“trt”) with two categories (“Drug A” vs. “Drug B”) and several categorical and numerical variables.

{finalfit}

According to the package description, the {finalfit} package has the following purposes:

Generate regression results tables and plots in final format for publication. Explore models and export directly to PDF and ‘Word’ using ‘RMarkdown’.

However, with summary_factorlist(), the package also includes a function to create a table with summary statistics.

library(finalfit)
## Error in `library()`:
## ! there is no package called 'finalfit'
library(dplyr)
tab.ff <- gtsummary::trial %>%
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  summary_factorlist(
    dependent = "trt", # name of grouping / treatment variable
    explanatory = c("age", "", "marker", "stage", "response", "death", "ttdeath"),
    total_col = TRUE, # add column with statistics for the whole sample
    add_row_total = TRUE, # add column with number of valid cases
    include_row_missing_col = FALSE,
    na_include = TRUE # make variables' missing data explicit
  )
## Error in `summary_factorlist()`:
## ! could not find function "summary_factorlist"
tab.ff
## Error:
## ! object 'tab.ff' not found

Rather than printing a table, the summary_factorlist() function returns a data.frame which must be further processed and piped to a printing function. The following example shows how these steps may be done using the {labelled} and the {kableExtra} package.

library(labelled)
library(dplyr)
library(kableExtra, exclude = "group_rows")
gtsummary::trial %>%
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  # Add variable labels
  set_variable_labels(
    age = "Age [yrs]",
    marker = "Marker Level [ng/mL]",
    stage = "T Stage",
    grade = "Grade",
    response = "Tumor Response",
    death = "Patient Died",
    ttdeath = "Months to Death/Censor"
  ) %>%
  summary_factorlist(
    dependent = "trt", # name of grouping / treatment variable
    explanatory = c("age", "", "marker", "stage", "response", "death", "ttdeath"),
    total_col = TRUE, # add column with statistics for the whole sample
    add_row_total = TRUE, # add column with number of valid cases
    include_row_missing_col = FALSE,
    na_include = TRUE # make variables' missing data explicit
  ) %>%
  kbl(
    caption = "Baseline characteristics",
    booktabs = TRUE,
    col.names = c(
      " ", "Total N", " ",
      "Drug A", "Drug B", "Total"
    ),
    align = "lrlrrr",
  ) %>%
  kable_classic(full_width = FALSE)
## Error in `summary_factorlist()`:
## ! could not find function "summary_factorlist"

{gtsummary}

The description of the {gtsummary} package gives the following information:

The package creates presentation-ready tables summarizing data sets, regression models, and more. The code to create the tables is concise and highly customizable. Data frames can be summarized with any function, e.g. mean(), median(), even user-written functions. Regression models are summarized and include the reference rows for categorical variables. Common regression models, such as logistic regression and Cox proportional hazards regression, are automatically dentified and the tables are pre-filled with appropriate column headers.

For creating a table with summary statistics, the tbl_summary() function is required. In addition, the {dplyr} package should be loaded.

library(gtsummary)
library(dplyr)

The creation of the table works best using the pipe operator:

tbl.gts <- trial %>%
  # categorical variables must be factors or character strings
  mutate(across(c(response, death),
    factor,
    levels = c(1, 0), labels = c("yes", "no")
  )) %>%
  # apply the tbl_summary() function
  tbl_summary(
    by = trt, # Treatment variable
    label = list(
      age ~ "Age [yrs]",
      marker ~ "Marker Level [ng/mL]",
      stage ~ "T Stage",
      grade ~ "Grade",
      response ~ "Tumor Response",
      death ~ "Patient Died",
      ttdeath ~ "Months to Death/Censor"
    ),
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    # may be also ...
    # statistic = list(all_continuous() ~ "{median} ({p25}, {p75})"),
    digits = all_continuous() ~ 1,
    missing_text = "(Missing)",
    include = everything() # select variables to be included into the table
  ) %>%
  add_overall() %>% # add column with statistics for the whole sample
  add_n() # add column with number of valid cases
tbl.gts

#tllrekyvog table {
font-family: system-ui, ‘Segoe UI’, Roboto, Helvetica, Arial, sans-serif, ‘Apple Color Emoji’, ‘Segoe UI Emoji’, ‘Segoe UI Symbol’, ‘Noto Color Emoji’;
-webkit-font-smoothing: antialiased;
-moz-osx-font-smoothing: grayscale;
}

#tllrekyvog thead, #tllrekyvog tbody, #tllrekyvog tfoot, #tllrekyvog tr, #tllrekyvog td, #tllrekyvog th {
border-style: none;
}

#tllrekyvog p {
margin: 0;
padding: 0;
}

#tllrekyvog .gt_table {
display: table;
border-collapse: collapse;
line-height: normal;
margin-left: auto;
margin-right: auto;
color: #333333;
font-size: 16px;
font-weight: normal;
font-style: normal;
background-color: #FFFFFF;
width: auto;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #A8A8A8;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #A8A8A8;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
}

#tllrekyvog .gt_caption {
padding-top: 4px;
padding-bottom: 4px;
}

#tllrekyvog .gt_title {
color: #333333;
font-size: 125%;
font-weight: initial;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
border-bottom-color: #FFFFFF;
border-bottom-width: 0;
}

#tllrekyvog .gt_subtitle {
color: #333333;
font-size: 85%;
font-weight: initial;
padding-top: 3px;
padding-bottom: 5px;
padding-left: 5px;
padding-right: 5px;
border-top-color: #FFFFFF;
border-top-width: 0;
}

#tllrekyvog .gt_heading {
background-color: #FFFFFF;
text-align: center;
border-bottom-color: #FFFFFF;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}

#tllrekyvog .gt_bottom_border {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}

#tllrekyvog .gt_col_headings {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
}

#tllrekyvog .gt_col_heading {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 6px;
padding-left: 5px;
padding-right: 5px;
overflow-x: hidden;
}

#tllrekyvog .gt_column_spanner_outer {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: normal;
text-transform: inherit;
padding-top: 0;
padding-bottom: 0;
padding-left: 4px;
padding-right: 4px;
}

#tllrekyvog .gt_column_spanner_outer:first-child {
padding-left: 0;
}

#tllrekyvog .gt_column_spanner_outer:last-child {
padding-right: 0;
}

#tllrekyvog .gt_column_spanner {
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: bottom;
padding-top: 5px;
padding-bottom: 5px;
overflow-x: hidden;
display: inline-block;
width: 100%;
}

#tllrekyvog .gt_spanner_row {
border-bottom-style: hidden;
}

#tllrekyvog .gt_group_heading {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
text-align: left;
}

#tllrekyvog .gt_empty_group_heading {
padding: 0.5px;
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
vertical-align: middle;
}

#tllrekyvog .gt_from_md > :first-child {
margin-top: 0;
}

#tllrekyvog .gt_from_md > :last-child {
margin-bottom: 0;
}

#tllrekyvog .gt_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
margin: 10px;
border-top-style: solid;
border-top-width: 1px;
border-top-color: #D3D3D3;
border-left-style: none;
border-left-width: 1px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 1px;
border-right-color: #D3D3D3;
vertical-align: middle;
overflow-x: hidden;
}

#tllrekyvog .gt_stub {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 5px;
padding-right: 5px;
}

#tllrekyvog .gt_stub_row_group {
color: #333333;
background-color: #FFFFFF;
font-size: 100%;
font-weight: initial;
text-transform: inherit;
border-right-style: solid;
border-right-width: 2px;
border-right-color: #D3D3D3;
padding-left: 5px;
padding-right: 5px;
vertical-align: top;
}

#tllrekyvog .gt_row_group_first td {
border-top-width: 2px;
}

#tllrekyvog .gt_row_group_first th {
border-top-width: 2px;
}

#tllrekyvog .gt_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}

#tllrekyvog .gt_first_summary_row {
border-top-style: solid;
border-top-color: #D3D3D3;
}

#tllrekyvog .gt_first_summary_row.thick {
border-top-width: 2px;
}

#tllrekyvog .gt_last_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}

#tllrekyvog .gt_grand_summary_row {
color: #333333;
background-color: #FFFFFF;
text-transform: inherit;
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
}

#tllrekyvog .gt_first_grand_summary_row {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-top-style: double;
border-top-width: 6px;
border-top-color: #D3D3D3;
}

#tllrekyvog .gt_last_grand_summary_row_top {
padding-top: 8px;
padding-bottom: 8px;
padding-left: 5px;
padding-right: 5px;
border-bottom-style: double;
border-bottom-width: 6px;
border-bottom-color: #D3D3D3;
}

#tllrekyvog .gt_striped {
background-color: rgba(128, 128, 128, 0.05);
}

#tllrekyvog .gt_table_body {
border-top-style: solid;
border-top-width: 2px;
border-top-color: #D3D3D3;
border-bottom-style: solid;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
}

#tllrekyvog .gt_footnotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}

#tllrekyvog .gt_footnote {
margin: 0px;
font-size: 90%;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}

#tllrekyvog .gt_sourcenotes {
color: #333333;
background-color: #FFFFFF;
border-bottom-style: none;
border-bottom-width: 2px;
border-bottom-color: #D3D3D3;
border-left-style: none;
border-left-width: 2px;
border-left-color: #D3D3D3;
border-right-style: none;
border-right-width: 2px;
border-right-color: #D3D3D3;
}

#tllrekyvog .gt_sourcenote {
font-size: 90%;
padding-top: 4px;
padding-bottom: 4px;
padding-left: 5px;
padding-right: 5px;
}

#tllrekyvog .gt_left {
text-align: left;
}

#tllrekyvog .gt_center {
text-align: center;
}

#tllrekyvog .gt_right {
text-align: right;
font-variant-numeric: tabular-nums;
}

#tllrekyvog .gt_font_normal {
font-weight: normal;
}

#tllrekyvog .gt_font_bold {
font-weight: bold;
}

#tllrekyvog .gt_font_italic {
font-style: italic;
}

#tllrekyvog .gt_super {
font-size: 65%;
}

#tllrekyvog .gt_footnote_marks {
font-size: 75%;
vertical-align: 0.4em;
position: initial;
}

#tllrekyvog .gt_asterisk {
font-size: 100%;
vertical-align: 0;
}

#tllrekyvog .gt_indent_1 {
text-indent: 5px;
}

#tllrekyvog .gt_indent_2 {
text-indent: 10px;
}

#tllrekyvog .gt_indent_3 {
text-indent: 15px;
}

#tllrekyvog .gt_indent_4 {
text-indent: 20px;
}

#tllrekyvog .gt_indent_5 {
text-indent: 25px;
}

#tllrekyvog .katex-display {
display: inline-flex !important;
margin-bottom: 0.75em !important;
}

#tllrekyvog div.Reactable > div.rt-table > div.rt-thead > div.rt-tr.rt-tr-group-header > div.rt-th-group:after {
height: 0px !important;
}

Characteristic N Overall
N = 200
1
Drug A
N = 98
1
Drug B
N = 102
1
Age [yrs] 189 47.2 (14.3) 47.0 (14.7) 47.4 (14.0)
    (Missing) 11 7 4
Marker Level [ng/mL] 190 0.9 (0.9) 1.0 (0.9) 0.8 (0.8)
    (Missing) 10 6 4
T Stage 200
    T1 53 (27%) 28 (29%) 25 (25%)
    T2 54 (27%) 25 (26%) 29 (28%)
    T3 43 (22%) 22 (22%) 21 (21%)
    T4 50 (25%) 23 (23%) 27 (26%)
Grade 200
    I 68 (34%) 35 (36%) 33 (32%)
    II 68 (34%) 32 (33%) 36 (35%)
    III 64 (32%) 31 (32%) 33 (32%)
Tumor Response 193 61 (32%) 28 (29%) 33 (34%)
    (Missing) 7 3 4
Patient Died 200 112 (56%) 52 (53%) 60 (59%)
Months to Death/Censor 200 19.6 (5.3) 20.2 (5.0) 19.0 (5.5)
1 Mean (SD); n (%)


Since the {gtsummary} packages contains functions to convert the {gtsummary} object to object types required by other popular table-specific R packages, e.g. as_kable_extra(), as_flextable() etc., {gtsummary} tables can be easily rendered as .docx, .pdf or .html. The following example shows how to print a {gtsummary} object using the {kableExtra} package.

library(kableExtra)
tbl.gts %>%
  as_kable_extra(
    caption = "Baseline characteristics",
    booktabs = TRUE,
    align = "lcccc",
  ) %>%
  kable_classic(full_width = FALSE)
Baseline characteristics
Characteristic N Overall
N = 200
Drug A
N = 98
Drug B
N = 102
Age [yrs] 189 47.2 (14.3) 47.0 (14.7) 47.4 (14.0)
(Missing) 11 7 4
Marker Level [ng/mL] 190 0.9 (0.9) 1.0 (0.9) 0.8 (0.8)
(Missing) 10 6 4
T Stage 200
T1 53 (27%) 28 (29%) 25 (25%)
T2 54 (27%) 25 (26%) 29 (28%)
T3 43 (22%) 22 (22%) 21 (21%)
T4 50 (25%) 23 (23%) 27 (26%)
Grade 200
I 68 (34%) 35 (36%) 33 (32%)
II 68 (34%) 32 (33%) 36 (35%)
III 64 (32%) 31 (32%) 33 (32%)
Tumor Response 193 61 (32%) 28 (29%) 33 (34%)
(Missing) 7 3 4
Patient Died 200 112 (56%) 52 (53%) 60 (59%)
Months to Death/Censor 200 19.6 (5.3) 20.2 (5.0) 19.0 (5.5)
1 Mean (SD); n (%)

Other packages

Other packages I have used in the past are {qwraps2} and {tableone}. However, the {gtsummary} package seems to offer the most convenient way to create a “table one”.

Mastering Multiple Imputations using R. Part I: Grouped imputations

Intro

Multiple imputation is one of the great ideas in statistical science. The technique is simple, elegant and powerful. It is simple because it fills the holes in the data with plausible values. It is elegant because the uncertainty about the unknown data is coded in the data itself. And it is powerful because it can solve “other” problems that are actually missing data problems in disguise (Stef van Buuren).

Multiple imputation has become a standard for the treatment of missing data. Unlike conventional imputation methods (mean, median imputation etc.) that merely replace the missing values with a single value, “Rubin’s (1987) multiple imputation procedure replaces each missing value with a set of plausible values that represent the uncertainty about the right value to impute” (see Yang C. Yuan). According to Wikipedia, “the primary method of multiple imputation is multiple imputation by chained equations (MICE).” In R, the MICE method is implemented by the mice package.

In this blog post, I show how to conduct multiple imputation on a numeric variable grouped by another variable indicating the meaning of the numeric values. The “grouping” is done using a function of the miceadds package and works similar to the group_by() function of the dplyr package.

Data and packages

library(tidyverse)
library(mice)
library(miceadds)
library(labelled)

The dataset we use in this blog post comes along with the mice package and is called brandsma. the dataset contains “data from 4106 pupils attending 216 schools. This dataset includes all pupils and schools with missing data.” (see mice package description). With the following code, we select the variables we need for multiple imputations and for calculating statistical models. The variables are:

  • sch (School number)
  • pup (Pupil ID)
  • sex (Sex of pupil)
  • min (Minority member)
  • iqv (IQ verbal)
  • iqp (IQ performal)
  • lpr (language score PRE)
  • lpo (language score POST)
  • apr (Arithmetic score PRE)
  • apo (Arithmetic score POST)
df.RAW <- mice::brandsma %>%
  select(sch, pup, sex, min, iqv, iqp, lpr, lpo, apr, apo) %>%
  filter(!is.na(sex)) %>%
  set_variable_labels(.labels = c(
    "School number", "Pupil ID",
    "Sex of pupil", "Minority member",
    "IQ verbal", "IQ performal",
    "language score PRE", "language score POST",
    "Arithmetic score PRE", "Arithmetic score POST"
  ))
glimpse(df.RAW)
## Observations: 4,096
## Variables: 10
## $ sch <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ pup <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, …
## $ sex <int> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ min <int> 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,…
## $ iqv <dbl> -1.3535094, 2.1464906, 3.1464906, 2.6464906, -2.3535094, -0.8535094, -3.8535094, -2…
## $ iqp <dbl> -3.72274979, 3.27725021, 1.27391680, -1.05608313, -0.05608313, -1.05608313, -4.3900…
## $ lpr <dbl> 33, 44, 36, 36, 33, 29, 19, 22, 20, 44, 34, 31, 18, 36, 31, 34, 31, 32, 23, 20, 27,…
## $ lpo <dbl> NA, 50, 46, 45, 33, 46, 20, 30, 30, 57, 36, 36, 29, 40, 41, 47, 33, 37, 29, 26, 37,…
## $ apr <dbl> 10, 18, 14, 12, 10, 13, 8, 8, 7, 17, 10, 14, 11, 10, 10, 12, 9, 13, 9, 9, 7, 16, 16…
## $ apo <dbl> 12, 30, 24, 19, 24, 26, 9, 13, 13, 30, 23, 22, 19, 23, 18, 22, 15, 21, 13, 12, 11, …

Using the pivot_longer() function of the tidyr package, we transform the original data into a long table. The values of the quantitative variables are stored in the new variable VALUE. The ENDPOINT variable indicates to which variable each value belongs. Furthermore, we create a new variable (IMPUTED) which indicates missing values that need to be imputed. In addition, we assign value labels to the variables sex and min.

df.RAW <- df.RAW %>%
  pivot_longer(names_to = 'ENDPOINT', values_to = 'VALUE', cols = c(iqv:apo)) %>%
  mutate(IMPUTED = is.na(VALUE),
         sex = factor(sex, levels = c(0, 1), labels = c('f', 'm')),
         min = factor(min, levels = c(0, 1), labels = c('n', 'y')))
glimpse(df.RAW)
## Observations: 24,576
## Variables: 7
## $ sch      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ pup      <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, …
## $ sex      <fct> m, m, m, m, m, m, m, m, m, m, m, m, f, f, f, f, f, f, f, f, f, f, f, f, f, f, …
## $ min      <fct> y, y, y, y, y, y, y, y, y, y, y, y, n, n, n, n, n, n, y, y, y, y, y, y, n, n, …
## $ ENDPOINT <chr> "iqv", "iqp", "lpr", "lpo", "apr", "apo", "iqv", "iqp", "lpr", "lpo", "apr", "…
## $ VALUE    <dbl> -1.35350942, -3.72274979, 33.00000000, NA, 10.00000000, 12.00000000, 2.1464905…
## $ IMPUTED  <lgl> FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…

Specifying the imputation model

init <- mice(df.RAW, maxit = 0)
meth <- init$method
predM <- init$predictorMatrix

# groupwise imputation of variable VALUE
meth["VALUE"] <- "bygroup"

# remove as predictor
predM[, c('IMPUTED')] <- 0
# specify name of the grouping variable ('ENDPOINT') and imputation method ('pmm')
group <- list( "VALUE" = "ENDPOINT" )
imputationFunction <- list("VALUE" = "pmm" )

Running multiple imputations

## grouped imputation

mids_endpoint <- df.RAW %>%
  mice(.,
       method = meth,
       predictorMatrix = predM,
       m = 3,
       seed = 2020,
       printFlag = FALSE,
       group = group,
       imputationFunction = imputationFunction 
  )

Creating complete data

df.COMPLETE <- mice::complete(mids_endpoint, include = FALSE) %>% 
  mutate(ENDPOINT = factor(ENDPOINT, levels = c('iqv', 'iqp', 'apr', 'apo', 'lpr', 'lpo')))

Plotting observed and imputed values

ggplot(df.COMPLETE, aes(x = ENDPOINT, y = VALUE)) +
  geom_jitter(aes(colour = IMPUTED, alpha = IMPUTED), size = 0.5) +
  geom_violin(alpha=0) +
  scale_colour_manual(values = c('grey', 'red3')) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(colour = NULL, alpha = NULL, x = NULL, y = NULL,
       title = 'The distribution of observed and imputed values',
       subtitle = 'Red dots indicate imputed values') 

plot of chunk imp-plot

Scoring Questionnaires: QSCORER version 0.0.10 has been released

A new release of the qscorer package is now available on GitHub.

Scoring procedures

The package (version 0.0.10) provides procedures for scoring the following health-related questionnaires:

  • Atrial Fibrillation Effect on QualiTy-of-Life Questionnaire (AFEQT)
  • Beck Depression Inventory (BDI, BDI-II)
  • Behavior Rating Inventory of Executuve Function, adult version (BRIEF-A)
  • Dutch Eating Behavior Questionnaire, German version (DEBQ)
  • Eating Disorder Examination Questionnaire, short form (EDE-Q8)
  • EORTC QLQ-C30 Quality of Life Questionnaire
  • Epworth Sleepiness Scale (ESS)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (EQ-5D-3L)
  • General Self-Efficacy Scale (GSES)
  • Hospital Anxiety and Depression Scale (HADS)
  • Impact of Weight on Quality of Life-Lite Questionnaire (IWQOL-Lite)
  • Internaltional Index of Erectile Function, short form (IIEF-5)
  • International Physical Activity Questionnaire, short form (IPAQ)
  • Patient Health Questionnaire-9 (PHQ-9)
  • Patient Health Questionnaire-15 (PHQ-15)
  • Rosenberg Self-Esteem Scale (RSES)
  • Severe Respiratory Questionnaire (SRI)
  • Skala zur Selbstregulation (REG) (German)
  • Social Support Questionnaire, short form (F-SozU-K-7) (German)
  • Weight Bias Internalization Scale (WBIS)
  • Weight Self-Stigma Questionnaire (WSSQ)
  • Yale Food Addiction Scale, Version 2.0 (YFAS V2.0)

The scoring functions usually have the following arguments:

  • data: A data frame containing the items of the questionnaire in a pre-specified order. The data.frame may contain further variables.
  • items: A character vector with the item names in a pre-specified order, or a numeric vector indicating the column numbers of the items in data.
  • keep: Logical, whether to keep the single items and whether to return variables containing the number of non-missing items on each scale for each respondent. The default is TRUE.
  • nvalid: A numeric value indicating the number of non-missing items required for score calculations. A default is pre-specified for each questionnaire.
  • digits: Integer of length one: value to round to. No rounding by default.

If there are items that need to be scored reversely:

  • reverse: Items to be scored reversely. These items can be specified either by name or by index. Default is pre-specified.

Data

Furthermore, the package contains real-life data of the following questionnaires:

  • Beck Depression Inventory II (df.test.bdi)
  • Dutch Eating Behavior Questionnaire, German version (df.test.debq)
  • EORTC QLQ-C30 Quality of Life Questionnaire (df.test.eortcc30)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (df.test.eq5d3l)
  • Hospital Anxiety and Depression Scale (df.test.hads)
  • Patient Health Questionnaire-9 (df.test.phq9)

These real-life data are publicly available; sources can be found in the package documentation. The following output shows the df.test.phq9 data set (data of the PHQ-9 (a.k.a. PHQ-D) questionnaire).

dplyr::glimpse(qscorer::df.test.phq9)
## Observations: 1,337
## Variables: 11
## $ age    <dbl> 79, 62, 71, 65, 63, 68, 52, 88, 71, 77, 69, 55, 71, 65, 69, 65, 71, 53, 86, 58, 69, 70, 69, 69, 74, 74, 60…
## $ gender <fct> f, f, m, f, f, f, m, m, f, f, f, m, f, f, f, f, f, m, f, f, f, m, f, m, m, m, f, f, m, f, m, m, f, f, f, f…
## $ phq9_1 <dbl> 1, 3, 2, 0, 0, 0, 1, 0, 0, 2, 1, 1, 0, 3, 0, 0, 0, 2, 0, 0, 2, 3, 0, 0, 0, 3, 1, 0, 1, 1, 1, 1, 0, 1, 1, 2…
## $ phq9_2 <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 1, 0, 1…
## $ phq9_3 <dbl> 3, 2, 2, 2, 1, 0, 1, 3, 1, 0, 1, 1, 0, 3, 1, 0, 0, 0, 0, 0, 1, 3, 1, 0, 0, 3, 0, 0, 1, 1, 1, 1, 3, 3, 1, 1…
## $ phq9_4 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 2, 1, 3, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 3, 0, 0, 1, 1, 1, 2, 2, 2, 1, 2…
## $ phq9_5 <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 2, 0, 1, 0, 0, 0, 0, 0, 0, 1, 3, 0, 1, 0, 2, 0, 0, 1, 0, 0, 1, 3, 1, 1, 0…
## $ phq9_6 <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0…
## $ phq9_7 <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 3, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 3, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1…
## $ phq9_8 <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1…
## $ phq9_9 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Questionnaires without real-life data may be simulated using the simulate_items() function. The following code simulates data for the PHQ-15 questionnaire:

df.phq15 <- qscorer::simulate_items(num_cols = 15, item_name = 'PHQ15', item_range = 0:2)

Two more arguments have default values and don't need to be explicitly specified:

  • num_rows (Number of rows; default = 100)
  • prop_mis (Proportion of missing values to be put into each item; default = 0.05)

The function returns a data frame including the items of the questionnaire and an id variable.

dplyr::glimpse(df.phq15)
## Observations: 100
## Variables: 16
## $ id       <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "010", "011", "012", "013", "014", "015",…
## $ PHQ15_1  <int> 2, 2, 2, 1, 2, 1, 1, 1, 2, 0, 1, 1, 0, 1, 2, 0, 2, 2, 0, 0, 0, 0, 2, NA, 2, 1, NA, 1, 2, 1, 0, 2, 2, 0, …
## $ PHQ15_2  <int> 2, 1, 1, 2, 0, 0, 2, NA, 1, 1, 1, NA, 1, 2, 1, 0, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 2, 2, 0, 2, 2, 1, 1, 1, …
## $ PHQ15_3  <int> 2, 1, 2, 1, NA, 0, 0, 0, 0, 2, 1, 0, 1, 0, 2, 2, 0, 1, 2, 0, 2, 1, 0, 0, 0, 0, 0, 2, 1, 1, 1, 2, 0, 1, 1…
## $ PHQ15_4  <int> 1, 0, 1, 1, 1, 1, 2, 2, NA, 2, 0, 2, 2, 2, 1, 2, 2, 2, 0, NA, 2, 1, 2, 1, 0, NA, 1, 1, NA, 0, 0, 2, 2, 0…
## $ PHQ15_5  <int> 0, 0, 2, 0, 1, 1, 2, 1, 0, 2, 0, 0, 2, 1, 0, 0, 1, 1, 1, 1, 0, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 1, 2,…
## $ PHQ15_6  <int> 1, 1, 2, 1, 2, 2, 1, 2, 2, 0, NA, 0, 1, 0, 2, 2, 1, 1, 0, 1, 2, 2, 2, 1, 0, 1, 2, 1, 2, 2, 0, 1, 1, 1, 1…
## $ PHQ15_7  <int> 1, 2, 0, 0, 2, 0, 1, 1, 2, 0, 2, 0, 2, NA, 0, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 0, 0, 2, 0, 1, 0, 1, 1, 0, 0…
## $ PHQ15_8  <int> 2, 1, 1, 1, 2, 2, 1, 0, 0, NA, 1, 0, 1, 1, 2, 2, 2, 2, 1, 1, 1, 2, 2, 2, 0, 2, 0, 2, 1, 0, 0, 0, 2, 0, 1…
## $ PHQ15_9  <int> 0, 2, 2, 1, 1, 1, 0, 2, 0, 1, 2, 0, 0, 1, 2, 0, 1, 0, 0, 2, 2, NA, 2, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 0…
## $ PHQ15_10 <int> 0, 2, 2, 0, 2, 0, 0, 1, 0, 0, NA, NA, 0, 0, 2, 2, 0, NA, 2, 1, 1, 2, 2, 1, 2, 2, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ PHQ15_11 <int> 1, 2, NA, 0, 2, 0, 2, 2, 1, 2, 2, 1, 2, 0, 1, 1, 0, 0, 2, 2, 2, 1, 2, 2, 1, 2, 0, 2, 2, 2, 1, 1, 0, 0, 2…
## $ PHQ15_12 <int> 2, 1, 2, 2, 0, 0, 0, 0, 1, 1, 2, 2, 1, 0, 0, 2, 1, 0, 0, 1, 2, 0, 1, 0, 1, 2, 1, 0, 2, NA, 0, 1, 0, 0, 0…
## $ PHQ15_13 <int> 2, 0, 0, 1, 1, 0, 0, 2, 2, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 2, 1, 0, 1, 2, 0, 1, 2, NA, 1, 2, 0, 1, 0, 2…
## $ PHQ15_14 <int> 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 2, 1, 0, 0, 0, 1, 0, 0, 2, 1, 2, 2, 2, 1, 0, 2, 1, 0, 2, 1, 2, 2, 2,…
## $ PHQ15_15 <int> NA, 1, 0, 1, 2, 0, 1, 1, 2, 0, 2, 0, 0, 1, 2, 2, 0, 1, 1, 2, 1, 1, 0, 0, 0, 0, 0, 0, NA, 0, 2, NA, 0, NA…

Now, the corresponding scoring function (scoring_phq15()) can be apllied to this data:

library(dplyr)
df.phq15 <- df.phq15 %>% 
  qscorer::scoring_phq15(.,items = 2:16, keep = FALSE)

The function returns a data frame including the PHQ-15 score and the corresponding cut-off values (severe, moderate, mild, …):

dplyr::glimpse(df.phq15)
## Observations: 100
## Variables: 3
## $ id           <chr> "001", "002", "003", "004", "005", "006", "007", "008", "009", "010", "011", "012", "013", "014", "0…
## $ score.phq15  <dbl> 17.142857, 17.000000, 19.285714, 13.000000, 20.357143, 8.000000, 14.000000, 16.071429, 15.000000, 11…
## $ cutoff.phq15 <fct> Severe, Severe, Severe, Moderate, Severe, Mild, Moderate, Severe, Severe, Moderate, Severe, Mild, Se…

Documentation

The package documentation is hosted on GitHub Pages.

Formatting p-values: A curated list of R functions

Intro

Reporting results of statistical analyses usually goes along with reporting p-values which indicate the probability under the null hypothesis of sampling a test statistic at least as extreme as that which was observed.

R offers quite a lot of options to format p-values. This blog post will give an overview (which is by no means comprehensive).

First, we compute a vector with six p-values and one missing value.

p <- c(0.50, 0.12, 0.045, 0.011, 0.009, 0.0000234, NA)
options(scipen = 9999) # suppress scientific notation

Formatting p-values

base package (Base R)

The first function I'm going to introduce, is part of the base package, which is part of the default R installation. Thus, no install.packages() and library() function is required to use it.

Using the digits option, the number of significant digits can be specified.

format.pval(p)
## [1] "0.500"   "0.120"   "0.045"   "0.011"   "0.009"   "0.00002" "NA"
format.pval(p, 
            digits = 2) # how many significant digits are to be used
## [1] "0.50" "0.12" "0.04" "0.01" "0.01" "0"    "NA"

Hmisc package

With nsmall, the format.pval() function of the popular Hmisc package includes an option to specify the minimum number of digits to the right of the decimal point.

library(Hmisc)
Hmisc::format.pval(p)
## [1] "0.500"   "0.120"   "0.045"   "0.011"   "0.009"   "0.00002" "NA"
Hmisc::format.pval(p,
                   nsmall=3, # the minimum number of digits to the right of the decimal point 
                   digits = 2) # how many significant digits are to be used
## [1] "0.500" "0.120" "0.040" "0.010" "0.010" "0.000" "NA"

scales package

In addition, the pvalue() function of the scales package, has got an option to specify a threshold for rounding the p-value according to a given significance level.

library(scales)
scales::pvalue(p)
## [1] "0.500"  "0.120"  "0.045"  "0.011"  "0.009"  "<0.001" "NA"
scales::pvalue(p,
               accuracy = 0.05, # Number to round to
               decimal.mark = ".", # The character to be used to indicate the numeric decimal point
               add_p = TRUE) # Add "p=" before the value?
## [1] "p=0.50" "p=0.10" "p<0.05" "p<0.05" "p<0.05" "p<0.05" "p=NA"

finalfit package

The p_tidy function of the finalfit package doesn't have an option to specify the number of significant digits. With the digits option, a value for rounding the p-value can be specified.

library(finalfit)
finalfit::p_tidy(p, digits = 2)
## [1] "=0.50" "=0.12" "=0.04" "=0.01" "=0.01" "<0.01" "=NA"
finalfit::p_tidy(p, 
                 digits = 3, # value to round to, no default
                 prefix = NULL) # suppress prefix
## [1] "0.500"  "0.120"  "0.045"  "0.011"  "0.009"  "<0.001" "NA"

psycho package

The format_p function of the psycho package formats the p-values according to predefined significance levels (<0.5, <0.1, <0.01). In addition, stars may be added.

library(psycho)
psycho::format_p(p)
## [1] "> .1"      "> .1"      "< .05*"    "< .05*"    "< .01**"   "< .001***"
## [7] NA
psycho::format_p(p,
                 stars = FALSE) # remove significance stars
## [1] "> .1"   "> .1"   "< .05"  "< .05"  "< .01"  "< .001" NA
psycho::format_p(p,
                 stars_only = TRUE) # return only significance stars
## [1] ""    ""    "*"   "*"   "**"  "***" NA

Scoring questionnaires using the ‘qscorer’ package

Motivation

Health-related questionnaires are used in many psychological studies and clinical trials. I know from my own experience that information about scoring procedures are often scattered among numerous sources and, thus, hard to find. Since I frequently work with questionnaires, I have decided to write qscorer, an R package with scoring procedures for health-related questionnaires.

plot of chunk unnamed-chunk-1

Scoring procedures

The qscorer package (version 0.3.0) provides procedures for scoring the following health-related questionnaires:

  • Atrial Fibrillation Effect on QualiTy-of-Life Questionnaire (AFEQT)
  • Beck Depression Inventory (BDI, BDI-II)
  • Behavior Rating Inventory of Executuve Function, adult version (BRIEF-A)
  • Dutch Eating Behavior Questionnaire, German version (DEBQ)
  • Eating Disorder Examination Questionnaire, short form (EDE-Q8)
  • Epworth Sleepiness Scale (ESS)
  • European Quality of Life Five Dimension Three Level Scale Questionnaire (EQ-5D-3L)
  • General Self-Efficacy Scale (GSES)
  • Hospital Anxiety and Depression Scale (HADS)
  • Impact of Weight on Quality of Life-Lite Questionnaire (IWQOL-Lite)
  • International Physical Activity Questionnaire, short form (IPAQ)
  • Patient Health Questionnaire-9 (PHQ-9)
  • Patient Health Questionnaire-15 (PHQ-15)
  • Rosenberg Self-Esteem Scale (RSES)
  • Severe Respiratory Questionnaire (SRI)
  • Skala zur Selbstregulation (REG) (German)
  • Social Support Questionnaire, short form (F-SozU-K-7) (German)
  • Weight Bias Internalization Scale (WBIS)
  • Weight Self-Stigma Questionnaire (WSSQ)
  • Yale Food Addiction Scale, Version 2.0 (YFAS V2.0)

Documentation

The package documentation is hosted by GitHub Pages.

Installation

You can install the development version of qscorer from GitHub with:

devtools::install_github('nrkoehler/qscorer')

To Do

In the near future, I will add control structures to the scoring functions in order to prevent items with out-of-range values to be scored. qscorer is a growing package. Thus, further scoring procedures will be added.

Statistische Grundlagen mit R: Streuungsmaße

Nachdem ich mich im ersten Teil meiner Serie “Statistische Grundlagen mit R” mit Maßen der zentralen Tendenz beschäftigt habe, erkläre ich heute die Bedeutung verschiedener Steuungsmaße.

Streuungsmaße

Um einen Datenvektor möglichst treffend zu beschreiben, reicht die Kenntnis der Zentralitätsmaße in der Regel nicht aus. Stellen wir uns z.B. vor, dass an einer Grundschule ein Altpapiersammelwettbewerb durchgeführt wird, bei dem die drei Schulklassen, die am meisten Papier gesammelt haben Preise bekommen: die erstplatzierte Klasse 100€, die zweitplatzierte Klasse 50€ und die drittplatzierte Klasse 20€. Wenn wir uns vorstellen, dass die erstplatzierte Klasse insgesamt 800 kg, die zweitplatzierte Klasse 700 kg und die drittplatzierte Klasse 600 kg gesammelt habt, würde uns die Vergabe der Preise im Großen und Ganzen gerecht erscheinen. Stellen wir uns hingegen vor, dass die erstplatzierte Klasse 701 kg, die zweitplatzierte Klasse genau 700 kg und die drittplatzierte Klasse 699 kg Papier gesammelt hat, erscheint die Preisvergabe weniger angemessen zu sein, weil die Unterschiede zwischen den gesammelte Altpapiermengen äußerst klein sind und möglicherweise sogar auf Messungenauigkeiten (beim Wiegen des Papiers) zurückzuführen sind. Berechnen wir den Mittelwert für beide Szenarien, sehen wir jedoch keinen Unterschied; beide Mittelwerte sind identisch:

mean(c(600, 700, 800)) == mean(c(699, 700, 701))
## [1] TRUE

Mit Streuungsmaßen lässt sich, wie der Name schon sagt, die Streuung bzw. Entfernung der einzelnen Datenpunkte von einem Zentralitätsmaß beschreiben.

Spanne

Das wohl einfachste Streuungsmaß ist die Spanne (engl.: Range). Diese lässt sich berechnen, indem man den kleinsten Wert vom größen des Datenvektors abzieht:

spanne1 <- 800 - 600
spanne2 <- 701 - 699

Obwohl beide Datenvektoren denselben Mittelwert haben, unterscheiden sich die Spannen deutlich (200 vs. 2). Die Spanne ist als Streuungsmaß allerdngs nur sehr eingeschränkt zu gebrauchen, weil sie sehr stark durch Extremwerte beeinflusst wird. So könnte man sich vorstellen, dass bei dem besagten Altpapiersammelwettbewerb 8 von 10 Schulklassen zwischen 400 und 500 kg Altpapier sammeln, eine Klasse 1.000 kg und eine Klasse überhaupt kein Papier. Obwohl die meisten Datenpunkte zwischen 400 und 500 kg liegen, nimmt die Spanne den Wert 1.000 (1.000 – 0) an.

Abweichungen vom Mittelwert

Eine weitere Möglichkeit zur Beschreibung der Streuung eines Datenvektors ist die Abweichung der einzelnen Datenpunkte vom Mittelwert aller Datenpunkte. Bezugnehmend auf unser Beispiel, weichen im ersten Szenario zwei von drei Datenpunkten um jeweils 100 kg vom Mittelwert ab, während der dritte Wert mit dem Mittelpunkt identisch ist. Bildet man die Summe dieser Abweichungen, erhalten wir das folgende Ergebnis:

-100 + 0 + 100
## [1] 0

Berechnet man die Streuung nach dieser Gleichung, muss diese zwangsläufig den Wert 0 ergeben, da sich positive und negative Differenzen gegenseitig aufheben. Um dies zu verhindern, ließen sich z.B. die absoluten Beträge der Differenzen aufaddieren. In der Statistik entledigt man sich des Minuszeichens jedoch durch eine Quadrierung der einzelnen Differenzen.

Summe der quatrierten Abweichungen

Die einzelnen quatrierten Differenzen werden zu einer Quadratsumme zusammengefasst. Für unsere beiden Szenarien der Altpapiersammelaktion lassen sich die Quadratsummen folgendermaßen berechnen:

(qs1 <- (-100)^2 + 0^2 + 100^2)
## [1] 20000
(qs2 <- (-1)^2 + 0^2 + 1^2)
## [1] 2

Als Streuungsmaß hat die Quadratsumme jedoch den Nachteil, dass ihr Betrag mit steigender Anzahl an Datenpunkten zunimmt.

Varianz

Um eine Abhängikeit des Streungsmaßes von der Anzahl der Datenpunkte zu verhindern, lässt sich die Summe der quatrierten Abweichungen durch die Anzahl der Datenpunkte (Beobachtungen) teilen. Mit dieser Gleichung haben wir die Varianz eines Datenvektors definiert. Um zu demonstrieren, wie sich die Varianz eines Datenvektors mit R berechnen lässt, kommen wir wieder auf unser Beispiel vom Altpapiersammelwettbewerb zurück:

altpapier <- c(600, 700, 800) # vgl. Szenario 1
(altpapier - mean(altpapier)) ^ 2 # Liste quatrierter Abweichungen vom Mittelwert
## [1] 10000     0 10000
sum((altpapier - mean(altpapier)) ^ 2) # Summe der quatrierten Abweichungen
## [1] 20000
(varianz <- sum((altpapier - mean(altpapier)) ^ 2) / length(altpapier)) # Teile durch Anzahl der Beobachtungen
## [1] 6666.667

Standardabweichung

Die Interpretation der Varianz gestaltet sich jedoch schwierig, da sie die Einheiten des Datenvektors (in unserem Beispiel Kilogramm) quatriert widergibt. So hat der Vektor altpapier eine Varianz von etwa 6666.7 kg². Um die Standardabweichung zu berechnen, braucht man nun nur noch die Quatratwurzel der Varianz ziehen:

sqrt(varianz)
## [1] 81.64966

Im Gegensatz zur Varianz, hat die Standardabweichung dieselbe Einheit wie der Datenvektor, in unserem Fall etwa 81.6 kg (Szenario 1) und 0.816 kg (Szenario 2).

Empfohlene Literatur

  • Jeffrey M. Stanton, Reasoning with Data: An Introduction to Traditional and Bayesian Statistics Using R, Guilford Press: 2017.
Design a site like this with WordPress.com
Get started