ggplot2 – Scripts & Statistics

Drawing a Fish Curve using R and ggplot2

Intro

Recently, I wondered whether there is a way to draw a fish shape using a mathematical function. Since I did not find a ready-made R function, I tried to write the function by myself. The equations, I've used for writing this function can be found on WolframMathWorld.

The function

The fish_curve() function requires the ggplot2 and the dplyr package. It creates a data frame with two variables (x and y) and 10.000 observations. Finally, the data points are plotted using ggplot2.

fish_curve <- function(colour='black', size = 5){
  library(ggplot2)
  library(dplyr)
  data.frame(
    x = cos(1:10000) - sin(1:10000)^2 / sqrt(2),
    y = cos(1:10000) * sin(1:10000)
  ) %>%
    ggplot(., aes(x, y)) +
    geom_point(colour = colour, size = size) +
    theme_void()
}

Function call with default parameters

With colour and size the fish_curve() function allows the user to specify two parameters; that is colour and size of the plotted points. The default values are black for colour and 5 for size.

(p1 <- fish_curve())

plot of chunk fish-1

Customization

In the following example, we customize colour and size of the fish shape:

(p2 <- fish_curve(colour = 'blue', size = 1))

plot of chunk fish-2

And finally, we place the two plots side by side using the patchwork package:

library(patchwork)
p1 + p2

plot of chunk unnamed-chunk-1

How to order factors by level frequency and level name

Intro

Quite frequently, factor variables are ordered by level frequency. However, factor levels having only a few observations are sometimes collapsed into one level usually named “others”. Since this level is usually not of particular interest, it may be a good idea to put this level in the last position of the plot rather than ordering it by level frequency. In this blog post, I’m going to show how to order a factor variable by level frequency and level name.

To replicate the R code I’m going to use in this post, four R packages must be loaded:

library(dplyr) # for data manipulation
library(ggplot2) # for plotting data
library(gghighlight) # ggplot2 extension for highlighting values

The dataset I’m going to use in this post (mtcars) is part of the datasets package.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In the first code chunk, we:

extract the first word of each car name and write it into a new variable called “brand”,
rename all car brands starting with “M” (Mazda, Merc, Maserati) to “Others” and
calculate the median miles per gallon (mpg) for each car brand.

df.mtcars %
  mutate(name = str_extract(rownames(.), "^\\w+\\b"),
         brand = str_replace(name, "^M\\w+", 'Others')) %>%
  group_by(brand) %>%
  summarize(mpg = median(mpg))
df.mtcars$brand

##  [1] "AMC"      "Cadillac" "Camaro"   "Chrysler" "Datsun"   "Dodge"   
##  [7] "Duster"   "Ferrari"  "Fiat"     "Ford"     "Honda"    "Hornet"  
## [13] "Lincoln"  "Lotus"    "Others"   "Pontiac"  "Porsche"  "Toyota"  
## [19] "Valiant"  "Volvo"

The following code chunk is to reorder the brand variable by level frequency using the reorder() function.

df.mtcars %
  mutate(brand = as.factor(brand),
         brand = reorder(brand, mpg))
levels(df.mtcars$brand)

##  [1] "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler" "AMC"     
##  [7] "Dodge"    "Ford"     "Valiant"  "Others"   "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

As we can see, the bar representing the “Others” level is roughly in the middle of the plot.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency")

plot of chunk unnamed-chunk-4

To put the bar representing the “Others” level at the bottom of the plot, we have to set “Others” as reference category using the relevel() function.

df.mtcars %
  mutate(brand = relevel(brand, ref = "Others"))
levels(df.mtcars$brand)

##  [1] "Others"   "Cadillac" "Lincoln"  "Camaro"   "Duster"   "Chrysler"
##  [7] "AMC"      "Dodge"    "Ford"     "Valiant"  "Pontiac"  "Ferrari" 
## [13] "Hornet"   "Volvo"    "Datsun"   "Porsche"  "Toyota"   "Fiat"    
## [19] "Honda"    "Lotus"

Finally, the bar representing the “Others” level appears at the desired position.

ggplot(df.mtcars, aes(brand, mpg, fill = brand)) +
  coord_flip() +
  geom_col(width = 0.5) +
  gghighlight(brand == 'Others', unhighlighted_colour = "cornflowerblue") +
  scale_fill_manual(values = c("grey")) +
  theme_bw() +
  theme(legend.position = 'none') +
  labs(x = NULL, 
       y = 'Miles per Gallon',
       title = "Factor variable ordered by level frequency and level name")

plot of chunk unnamed-chunk-6

PS: In both plots, the gghighlight() function of the gghighlight package was used to highlight the desired factor level.

Sample Size Calculation Using R

Preample

“Sample Size Calculation Using R” – the title of this blog post – sounds very comprehensive. However, this post only deals with the sample size collection for one particular test (two sample t-test). A collection of functions for basic sample size calculation can be found in the pwr package.

Introduction

A couple of months ago, I attended some statistics course about sample size calculation and sample size adjustment. Amongst others, we spoke about a randomized placebo-controlled trial with patients suffering of depression. The trial's primary end point was the difference between placebo and treatment group in the HAM-D total score (some depression score) between baseline and the end of therapy. During the course, the professor asked us to write a “two-liner” returning the sample size per group for a two-sided, two sample t-test with the following parameters:

alpha = 0.05,
sigma = 8 ,
delta = 4 and
power ≤ 0.90

Here is my solution:

In a first step, I wrote a loop returning the power rising from 0.1 to 0.9 and the corresponding sample sizes respectively. The loop may either be a repeat or a while loop.

While Loop

A while loop runs and repeats while a specified condition returns TRUE. The loop terminates only when the condition evaluates to FALSE. In our case, the loop stops evaluating as soon as p > 0.9.

p <- 0.05
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
}

Repeat Loop

In a repeat loop it is required to define a so-called break declaration to stop repeating the code. Just like the while loop, our repeat loop stops evaluating as soon as p > 0.9.

Both, the while and the repeat loop return the same result.

p <- 0.05
repeat{
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  print(paste(t$power, ceiling(t$n), sep = ' - '))
  if(p > 0.90){
    break
  }
}

## [1] "0.1 - 5"
## [1] "0.15 - 8"
## [1] "0.2 - 12"
## [1] "0.25 - 15"
## [1] "0.3 - 18"
## [1] "0.35 - 21"
## [1] "0.4 - 25"
## [1] "0.45 - 28"
## [1] "0.5 - 32"
## [1] "0.55 - 36"
## [1] "0.6 - 41"
## [1] "0.65 - 45"
## [1] "0.7 - 51"
## [1] "0.75 - 57"
## [1] "0.8 - 64"
## [1] "0.85 - 73"
## [1] "0.9 - 86"

For a very good introduction to loops, see Davies (2016): The Book of R. No Starch Press: San Francisco.

Storing the Results

In case we want to use the results for whatever purpose, we need to store them in a vector. The following example heavily borrows from StackOverflow. The loop returns a data frame containing two variables: pwr (power) and num (sample size).

p <- 0.05
pwr <- c()
num <- c()
while(p <= 0.9){
  p <- p + 0.05
  t <- power.t.test(sd = 8, delta = 4, sig.level=0.05, power = p, type="two.sample", alternative = "two.sided")
  pwr <- c(pwr, t$power)
  num <- c(num, ceiling(t$n))
  df <- data.frame(pwr = pwr, n = num)
}
rm(list = c(setdiff(ls(), c("df"))))
dplyr::glimpse(df)

## Observations: 17
## Variables: 2
## $ pwr <dbl> 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55...
## $ n   <dbl> 5, 8, 12, 15, 18, 21, 25, 28, 32, 36, 41, 45, 51, 57, 64, ...

Plotting Power and Sample Size

Finally, we plot the relation between power and sample size using the ggplot2 package. The theme I use, stems from the ggpubr package.

ggplot(df, aes(pwr, n)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
  scale_y_continuous(breaks = seq(0, max(df$n)+10, 10)) +
  geom_text(aes(label = n), nudge_y = 4) +
  ggpubr::theme_pubr() +
  labs(x = 'Power', y = 'Sample Size', 
       title = 'Power and Sample Size', 
        subtitle = expression('Two-sided, two sample t-test with' ~ sigma ~ '= 8,' ~ delta ~ '= 4,' ~ alpha ~ '= 0.05'))

plot of chunk unnamed-chunk-4

As we see, a power of 0.90 requires a sample size of 86 patients per treatment group for a two-sided, two sample t-test with sd = 8, delta = 4, and alpha = 0.05.

How to Plot Venn Diagrams Using R, ggplot2 and ggforce

Intro

Venn diagrams – named after the English logician and philosopher John Venn – “illustrate the logical relationships between two or more sets of items” with overlapping circles.

In this tutorial, I'll show how to plot a three set venn diagram using R and the ggplot2 package.

Packages and Data

For the R code to run, we need to install and load three R packages. Unlike tidyverse and ggforce, the limma package must be installed from Bioconductor rather than from CRAN.

Moreover, we create a random data frame using the rbinom() function.

source("http://www.bioconductor.org/biocLite.R")
biocLite("limma")
library(limma)
library(tidyverse)
library(ggforce)
set.seed((123))
mydata <- data.frame(A = rbinom(100, 1, 0.8),
                     B = rbinom(100, 1, 0.7),
                     C = rbinom(100, 1, 0.6)) %>%
                       mutate_all(., as.logical)

Drawing the Circles

Next, we create a data frame defining the x and y coordinates for the three circles we want to draw and a variable defining the labels. For plotting the circles – the basic structure of our venn diagram – we need the geom_circle() function of the ggforce package. We employ the geom_circle()-function of the ggforce package to actually draw the circles. With the parameter r (= 1.5), we define the radius of the circles.

df.venn <- data.frame(x = c(0, 0.866, -0.866),
                      y = c(1, -0.5, -0.5),
                      labels = c('A', 'B', 'C'))
ggplot(df.venn, aes(x0 = x, y0 = y, r = 1.5, fill = labels)) +
    geom_circle(alpha = .3, size = 1, colour = 'grey') +
      coord_fixed() +
        theme_void()

plot of chunk unnamed-chunk-2

Furthermore, we need a data frame with the values we want the plot and the coordinates for plotting these values. The values can be obtained using the vennCounts() function of the limma package. Since ggplot2 requires data frames we need to first transform the vdc object (class VennCounts) into a matrix and then into a data frame. In addition, we need to add the x and y coordinates for plotting the values.

vdc <- vennCounts(mydata)
class(vdc) <- 'matrix'
df.vdc <- as.data.frame(vdc)[-1,] %>%
  mutate(x = c(0, 1.2, 0.8, -1.2, -0.8, 0, 0),
         y = c(1.2, -0.6, 0.5, -0.6, 0.5, -1, 0))

The final Plot

Finally, we'll customize the look of our venn diagram and plot the values.

ggplot(df.venn) +
  geom_circle(aes(x0 = x, y0 = y, r = 1.5, fill = labels), alpha = .3, size = 1, colour = 'grey') +
  coord_fixed() +
  theme_void() +
  theme(legend.position = 'bottom') +
  scale_fill_manual(values = c('cornflowerblue', 'firebrick',  'gold')) +
  scale_colour_manual(values = c('cornflowerblue', 'firebrick', 'gold'), guide = FALSE) +
  labs(fill = NULL) +
  annotate("text", x = df.vdc$x, y = df.vdc$y, label = df.vdc$Counts, size = 5)

plot of chunk unnamed-chunk-4

How to plot GPS data using R, ggplot2 and ggmap

Intro

A couple of months ago, I bought a mobile GPS navigation device. The device makes it possible to record the routes we travel with our car. Since the routes can by exported as .gpx files, it is rather easy to plot them using R.

Packages

To replicate this blog post, three R packages are required: plotKML to import the data into R and ggplot2 and ggmap to plot the data.

library(plotKML)
library(ggplot2)
library(ggmap)

As a prerequisite for the installation of the plotKML package on my Linux Mint 18.3 OS (based on Ubuntu 16.04), I needed to add the ubuntugis PPA.

sudo add-apt-repository ppa:ubuntugis/ppa
sudo apt-get update
sudo apt-get dist-upgrade

Import

All what's required for importing the .gpx file into R is the readGPX() function of the plotKML package. After importing the file into R, we receive a list (lst.rd) conatining some meta information and one data frame. To be able to plot the GPS data using ggplot2, we need to subset this list and save the data frame into a new object (df).

lst.rd <- readGPX('29 Dez. 1413.gpx')
df <- lst.rd$tracks[[1]][[1]]

Using the head function, we have a short look on the data:

head(df)

##        lon      lat                 time
## 1 12.98376 51.05915 2017-12-29T12:16:28Z
## 2 12.98377 51.05912 2017-12-29T12:16:29Z
## 3 12.98374 51.05912 2017-12-29T12:16:30Z
## 4 12.98372 51.05914 2017-12-29T12:16:31Z
## 5 12.98372 51.05914 2017-12-29T12:16:32Z
## 6 12.98372 51.05914 2017-12-29T12:16:33Z

As we can see, our data frame consists of three variables: longitude, latitude and time.

Plotting

Whithout much efford, we can plot our GPS data using ggplot2. Please note that GPS coordinates must be plotted as points.

ggplot(df, aes(x = lon, y = lat)) +
  coord_quickmap() +
  geom_point()

plot of chunk plot-1

In addition, the ggmap package offers some functionality to plot the data on maps.

The first example shows how to plot the data on a map provided by Google Maps.

mapImageData <- get_googlemap(center = c(lon = mean(df$lon), lat = mean(df$lat)),
                              zoom = 10,
                              color = 'bw',
                              scale = 1,
                              maptype = "terrain")
ggmap(mapImageData, extent = "device") + # removes axes, etc.
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             alpha = .1,
             size = .1)

plot of chunk plot-2

The second example shows how to plot the data on a Stamen map.

mapImage <- get_map(location = c(lon = mean(df$lon) - 0.05, lat = mean(df$lat)),
                    source = "stamen",
                    maptype = "toner",
                    zoom = 10)

ggmap(mapImage, extent = "device") + 
  geom_point(aes(x = lon,
                 y = lat),
             data = df,
             colour = "red3",
             size = .2)

plot of chunk plot-3

References

This blog post heavily borrows from the tutorial Mapping GPS Tracks in R. Thanks very much!

How to install and use the hexSticker package

Intro

A couple of days ago, the hexSticker package was published on CRAN. The package provides some functions to plot hexagon stickers that may be used to promote R packages. As described on GitHub, the stickers can be plotted either using base R's plotting function, the lattice package or the ggplot2 package. Moreover, it is also possible to plot image files.

Since I found it quite demanding to install the hexSticker package on a current Linux os (Linux Mint 18.1), I decided to write a short tutorial explaining how to install and use the package on Linux Ubuntu-based operating systems.

Linux packages required

In a first step, we need to open the terminal to install the following software packages. While texinfo is required to build R packages from source, libudunits2-dev, fftw-dev and mffm-fftw1 are needed to install some R packages the hexSticker package depends on (ggforce, fftwtools).

sudo apt-get install texinfo libudunits2-dev fftw-dev mffm-fftw1 libfftw3-dev libtiff5-dev

R packages required

Recently, the fftwtools package was added to CRAN. Thus, it can be installed the usual way:

installed.packages('fftwtools', dep = TRUE)

Finally, the EBImage package must be installed from the Bioconductor repository and the packages ggimage, ggforce and hexSticker must be installed from CRAN.

source("https://bioconductor.org/biocLite.R")
biocLite("EBImage")
install.packages("ggimage")
install.packages("ggforce")
install.packages("hexSticker")

For plotting an example hexsticker we need some data provided by the streetsofle package which must be installed from GitHub:

if (!require("devtools")) install.packages("devtools")
devtools::install_github("nrkoehler/streetsofle")

Plotting

With the following code chunk we create two hexstickers for the streetsofle package. The colours chosen stem from the city flag of Leipzig. The arguments of the sticker() function are explained within the code's comments.

library(hexSticker)
library(ggplot2)
library(streetsofle)
data(streetsofle)

p.1 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#FFCB00") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#FFCB00", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.1 <- sticker(p.1,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#FFCB00", # font colour
               h_size = 3, # hexagon border size
               h_fill = "#004CFF", # hexagon fill colour
               h_color = "#FFCB00") # hexagon border colour

p.2 <- ggplot(aes(x = lon, y = lat), data = shape.ortsteile) + 
  theme_map_le() + 
  coord_quickmap() + 
  geom_polygon(aes(x = lon, y = lat, group = group), 
               fill = NA, 
               size = 0.2, 
               color = "#004CFF") + 
  geom_polygon(aes(x = lon, y = lat, group = group),
               color = "#004CFF", 
               size = 1, 
               fill = NA, 
               data = shape.bezirke) 

p.2 <- sticker(p.2,
               package="streetsofle", 
               s_x = 1, # horizontal position of subplot
               s_y = 1.1, # vertical position of subplot
               s_width = 1.4, # width of subplot
               s_height = 1.4, # height of subplot
               p_x = 1, # horizontal position of font
               p_y = .43, # vertical position of font
               p_size = 6, # font size
               p_color = "#004CFF", # font color
               h_size = 3, # hexagon border size
               h_fill = "#FFCB00", # hexagon fill colour
               h_color = "#004CFF") # hexagon border colour

Finally, both plots are put into a grid layout using the grid.arrange() function of the gridExtra package.

library(gridExtra)
grid.arrange(p.1, p.2, ncol = 2, respect = TRUE)

plot of chunk unnamed-chunk-6

I'm not sure which sticker looks better. What do you think?

How to plot a companion planting guide using ggplot2

Intro

Recently my girl-friend asked me whether I could do some data project being useful for our family rather than just for myself. A couple of years ago – after our first son was born – we decided to rent a small garden (allotment garden) very close to our flat. In our garden we grow some fruit and vegetables, e.g. strawberries, peas, beans and beetroot.

When growing fruit and vegetables it is considered important to know that plants compete for resources. While some plants benefit from one another, other plants sould not be grown together. The internet provides loads of so-called companion planting guides or charts (see Link). For each plant, they define lists of companions (growing well together) and antogonists (not growing well together).

This blog post is to show how to visualize a companion planting guide using R and the ggplot2 package.

Packages and data

With readxl for importing data from Excel, dplyr for data wrangling and ggplot2 for data visualization three R packages are required to reproduce the results of this blog post.

library(readxl)
library(dplyr)
library(ggplot2)

The data I'm going to visualize stem from some companion planting guides published in German (see Link). I only selected the plants relevant for our garden and entered the data manually into an Excel worksheet.

In the first step, I saved the imported data into a data frame (mydata.1). The second data frame mydata.2 is a copy of mydata.1 with the first two variables in reverse order. In order to receive a matrix (with the same number of factor levels in each column), I merged mydata.1 and mydata.2 into mydata using the rbind() function.

mydata.1 <- readxl::read_excel('guide.xlsx', sheet = 1) 
mydata.2 <- mydata.1[, c(2, 1, 3)]
colnames(mydata.2) <- colnames(mydata.1)
mydata <- rbind(mydata.1, mydata.2)
rm(list = c(setdiff(ls(), c('mydata'))))
head(mydata, 10)

## # A tibble: 10 × 3
##          plant_1  plant_2    status
##            <chr>    <chr>     <chr>
## 1       cucumber     peas companion
## 2           corn     peas companion
## 3         radish     peas companion
## 4           peas cucumber companion
## 5           corn cucumber companion
## 6       beetroot cucumber companion
## 7           corn potatoes companion
## 8       potatoes  cabbage companion
## 9         radish  cabbage companion
## 10 strawberries   lettuce companion

Finally, I removed all objects from workspace not required for data visualization and printed the first ten lines of the tibble (slightly modified data frame) mydata.

Wrangling

With the following code snippet, all variables of mydata are transformed into factors. Furthermore, a parameter needed for plotting is saved as a vector named labs.n.

mydata <- dplyr::mutate_all(mydata, as.factor)
labs.n <- length(levels(mydata$plant_1)) + .5

Plotting

Finally, we plot the matrix using the ggplot2 package. In order to adjust the position of the grid lines, we remove to default grid lines with panel.grid.major = element_blank() and panel.grid.minor = element_blank() and draw new grid lines with geom_vline() and geom_hline(). The new grid lines are in accordance with the boundaries of the tiles.

ggplot(mydata, aes(x = plant_1, y = plant_2, fill = status)) + 
  theme_grey() +
  coord_equal() +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        legend.position = 'bottom') +
  geom_vline(xintercept = seq(0.5, labs.n, 1), color='white') +
  geom_hline(yintercept = seq(0.5, labs.n, 1), color='white') +
  scale_fill_manual('', values = c("red3", "green3")) +
  labs(x='', 
       y='',
       title = "Companion Planting Guide")

plot of chunk plotting

The interpretation of the companion planting guide is very easy: While green tiles signal companion plants that can be grown close to each other, red tiles flag antagonist plants that should not be grown together. The squares are gray when the plants are neither companions nor antagonists.

How to plot animated maps with gganimate

Intro

Leipzig is the largest city in the federal state of Saxony, Germany. Since about 2009, Leipzig is one of the fastest growing German cities. Between 2006 and 2016, the population has grown from about 506.000 to 580.000 inhabitants.The largest part of the population growth can be attributed to inward migration.

In this blog post, I show how to create an animated map of Leipzig visualizing the migration balances for each of Leipzig's 63 districts (“Ortsteile”).

The data

The data I will visualize in this blog post are provided by the local Statistical Bureau. They can be downloaded as .csv file from the following website. The shapefile I use, is not publicly available and can be purchased for about 25 Euro from the Statistical Bureau.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, rgdal, ggplot2, reshape2)
pacman::p_load_current_gh("dgrtwo/gganimate")

While the packages readr and rgdal are required for importing the .csv and the shapefile, ggplot2 and gganimate are needed to plot and animate the data. The reshape2 package is required for transforming data frames from wide to long format.

After importing the csv file containing the migration data, we add a key variable (id) to identify each of Leipzig's 63 districts. Using the melt-function of the reshape2 package, we reformat the data frame from wide into long form.

# import csv file
df.migrat <- read_csv2("./Data_LE/Bevölkerungsbewegung_Wanderungen_Wanderungssaldo.csv",
                      col_types = cols(`2015` = col_number()),
                      n_max = 63,
                      locale = locale(encoding = "latin1"),
                      trim_ws = TRUE)
# add id variable
df.migrat$id <- c(1:63)
# drop variables not needed
df.migrat <- df.migrat[,c(1,18,2:17)]
# change name of first variable
colnames(df.migrat)[1] <- 'district'
# convert from wide to long
df.migrat <- reshape2::melt(df.migrat, id.vars=c("district","id"))
# change name of third variable
colnames(df.migrat)[3] <- 'year'

With the next code snippet, we import the shapefile using the readOGR-function of the rgdal package. The object we receive (sh.file) is of class SpatialPolygonsDataFrame. After adding a key variable, we transform this object into an ordinary data frame using the fortify-function of the ggplot2 package. Furthermore, we join both data frames by the key variable id. Before we plot the data, we create a new variable (co) categorizing the migration balance variable (value).

# import shapefile
sh.file <- readOGR("ot.shp", layer="ot")
# add id variable
sh.file$id <- c(1:63)
# Create a data frame
sh.file <- fortify(sh.file, region = 'id')
# Merge data frames by id variable
mydata <- merge(sh.file, df.migrat, by='id')
# create a categorized variable
mydata$co <- ifelse(mydata$value >= 1000, 4,
                   ifelse(mydata$value >= 0, 3,
                                ifelse(mydata$value >= -1000, 2, 1)))
# define levels and labels
mydata$co  <- factor(mydata$co, levels = c(4:1),
                     labels = c('> 1000', '0-1000', '-1000-0', '< -1000'))

Moreover, we define a theme suitable for plotting maps with ggpot2. The following theme is a customized version of a theme I found on Stack Overflow.

theme_opts <- list(theme(panel.grid.minor = element_blank(),
                         panel.grid.major = element_blank(),
                         panel.background = element_blank(),
                         plot.background = element_blank(),
                         panel.border = element_blank(),
                         axis.line = element_blank(),
                         axis.text.x = element_blank(),
                         axis.text.y = element_blank(),
                         axis.ticks = element_blank(),
                         axis.title.x = element_blank(),
                         axis.title.y = element_blank(),
                         legend.position="right",
                         plot.title = element_text(size=16)))

Plotting the data

Finally, we can plot the data. The first plot (static) visualizes the migration data for the whole period of time (2000–2015).

ggplot(mydata, aes(x = long, y = lat, group = group, fill=value)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance between 2000 and 2015",
       subtitle = "Static Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

plot of chunk map

For creating an animated plot, we add the frame argument stemming from the gganimate package. With interval = 5 we define the time (in seconds) between the “slides”.

p <- ggplot(mydata, aes(x = long, y = lat, group = group, fill=value, frame=year)) +
  theme_opts + 
  coord_equal() +
  geom_polygon(color = 'grey') +
  labs(title = "Migration balance in year: ",
       subtitle = "Animated Map",
       caption = 'Data source: Stadt Leipzig, Amt für Statistik und Wahlen, 2015') +
  scale_fill_distiller(name='Migration\nbalance', direction = 1, palette='RdYlGn')

gg_animate(p, interval = 5)

Text mining PubMed using R – a case study

Intro

For a couple of years, I've been doing research in the field of psychooncology, which basically is “concerned both with the effects of cancer on a person's psychological health as well as the social and behavioral factors that may affect the disease process of cancer and/or the remission of it.” The most influential psycho-oncological journal is named “Psycho-Oncology”. The journal was founded in 1992 and is published monthly by Wiley-Blackwell.

Data

In this blog post, I'll play arround with some bibliographic data from the PubMed data base. With the following query, we receive bibliographical information for every paper published in Psycho-Oncology from 1997 up today.

("Psycho-oncology"[Jour])

As the following screen shot shows, our search query returned 2785 entries. The search results can be downloaded and saved into a .csv file by clicking on the arrow right beside Send to, choosing destination (File) and format (CSV).

Fig. 1: Screenshot from PubMed

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pacman::p_load(readr, stringr, ggplot2, qdap, qdapRegex))

Import

The data can be easily imported into R using the read_csv() function from the readr package.

# import .csv file
mydata <- read_csv("pubmed_result.csv")

Data manipulation

With the next code snippet, we do some data manipulation in order to make our data frame tidy. We start with removing the last row of the data frame containing data about a paper published in 1995, remove superfluous headlines, entries without author and select the variables we need for analysis. The str_detect() function stems from the stringr package.

# remove only article from 1995
mydata <- mydata[1:nrow(mydata)-1,]
# remove headlines
mydata <- subset(mydata, str_detect(URL, 'URL')==FALSE)
# remove entries without author
mydata <- subset(mydata, str_detect(Description, 'Author A.')==FALSE)
mydata <- subset(mydata, str_detect(Description, '\\[No authors listed\\]')==FALSE)
# Select variables
mydata <- mydata[, c('Title', 'Description', 'ShortDetails')]

In the next step, we extract the year of publication from the ShortDetails variable. Based on the year variable, we create a variable named time merging the years to four periods of time (five years sub-periods between 1998 and 2016). Furthermore, we calculate the number of authors for each paper using the str_count function of the stringr package. Since each author except for the last is followed by a comma, we just need to count the number of commas and add 1.

# Return Year of publication
mydata$Year <- as.numeric(str_sub(mydata$ShortDetails, 
                                  nchar(mydata$ShortDetails)-3, 
                                  nchar(mydata$ShortDetails)))
mydata$ShortDetails <- NULL
colnames(mydata) <- c('title', 'authors', 'year')
mydata$time <- ifelse(mydata$year <= 2001, 1,
                      ifelse(mydata$year >= 2002 & mydata$year <=2006, 2,
                             ifelse(mydata$year >= 2007 & mydata$year <=2011, 3, 4)))
mydata$year <- as.factor(mydata$year)
# Categorize years
mydata$time <- factor(mydata$time, levels = c(1:4),
                      labels = c('1997-2001', '2002-2006', '2007-2011', '2012-2016'))
# Calculate number of authors
mydata$auth.n <- str_count(mydata$authors, '\\,') + 1

Before we can plot these data, we need to create 4 more data frames:

df.pup containing the number of papers published each year between 1997 and 2006,
df.auth.1 containing the mean number of authors per paper and year,
df.auth.2 containing the mean number of authors per paper and time period,
df.auth.3 containing the maximum number of authors per paper and year,

df.pup <- data.frame(table(mydata$year))
df.auth.1 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), mean))
df.auth.2 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$time), mean))
df.auth.3 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), max))

Now, we can plot the data using the ggplot2 package.

ggplot(df.pup, aes(x=Var1, y=Freq, fill=0-Freq)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=Freq), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.pup$Freq)+10), breaks = seq(0, max(df.pup$Freq)+10, by=20)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 2: Number of papers published each year",
       caption = "Data source: PubMed")

plot of chunk ggplot1

Except for a few years, there has been a continuous rise in the number of publications. Since January 2013, Psycho-Oncology has been available exclusively online. Thus, the number of annual papers published is not restricted by requirements of the printed version anymore. Consequently, there was an enormous rise in the number of publications in 2013. However, in the following year (2014) the number of publications declined quite remarkably. The reason for this decline may be a lack of submissions.

ggplot(df.auth.1, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 3: Mean number of authors per paper and year",
       caption = "Data source: PubMed")

plot of chunk ggplot2

According to figure 3 and 4, the number of authorships per article has been rising over the whole time period. This is in accordance with the general trend towards a rise in the number of personal authors per citation.

ggplot(df.auth.2, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 4: Mean number of authors per paper and time period",
       caption = "Data source: PubMed")

plot of chunk ggplot3

Between 1997 and 2016, the mean number of authors per publication has increased from about four to six. The enormous rise in the maximum number of authors (see Fig. 5) supports the idea of an inflation in the number of authors per paper.

ggplot(df.auth.3, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.auth.3$x)+5), breaks = seq(0, max(df.auth.3$x)+5, by=5)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 5: Max number of authors per paper and year",
      caption = "Data source: PubMed")

plot of chunk ggplot4

Text mining

Before we analyze the data, some janitorial work needs to be done.
We start with (1) defining a list of stopwords, (2, 3) remove non-word characters, (4, 5) replace spaces in two word groups we want to be grouped together, and (6) remove the defined stopwords from the variable containing the citation titles.

sw <- c(Top200Words, 'among') # qdap
mydata$txt <- rm_non_words(mydata$title) # qdapRegex
mydata$txt <- str_replace_all(mydata$txt, "'", "") # stringr
keeps <- c('quality of life', 'health related')
mydata$txt <- space_fill(mydata$txt, keeps, sep = "~~") # qdap
mydata$txt <- rm_stop(mydata$txt, sw,
                      separate = FALSE) # qdap

In the next step, we generate a distribution table for all words of the citation titles. The resulting data frame (df.words) consits of three variables: interval (citation title terms), freq (term frequency), and percent (percentage of the term).

df.words <- dist_tab(breaker(mydata$txt)) # qdap
df.words <- df.words[,c(1, 2, 4)]
df.words <- subset(df.words, percent > 0.5)
df.words <- df.words[order(df.words$percent),] 
df.words$interval <- factor(df.words$interval, 
                            levels=df.words$interval,
                            labels=df.words$interval, 
                            ordered=TRUE) 
kable(head(df.words))

	interval	freq	percent
666	coping	141	0.51
2407	prostate	140	0.51
789	depression	143	0.52
1335	health	147	0.53
2678	review	150	0.54
2190	patient	154	0.56

With the next R code we create a list l.txt_time containing four sub lists.

# group titles by time period
l.txt_time <- by(mydata$txt, mydata$time, breaker)
str(l.txt_time)

## List of 4
##  $ 1997-2001: chr [1:2160] "relationship" "between" "trait" "anxiety" ...
##  $ 2002-2006: chr [1:3608] "measurement" "coping" "stress" "responses" ...
##  $ 2007-2011: chr [1:7310] "novel" "stop" "multidisciplinary" "clinic" ...
##  $ 2012-2016: chr [1:14606] "relationship" "between" "objectively" "measured" ...
##  - attr(*, "dim")= int 4
##  - attr(*, "dimnames")=List of 1
##   ..$ mydata$time: chr [1:4] "1997-2001" "2002-2006" "2007-2011" "2012-2016"
##  - attr(*, "call")= language by.default(data = mydata$txt, INDICES = mydata$time, FUN = breaker)
##  - attr(*, "class")= chr "by"

The following code snippet is to convert these four lists into separate data frames, each containing the most frequent citation title terms of the different periods of time.

# time span 1997-2001
df.t1 <- dist_tab(l.txt_time[[1]])
df.t1 <- df.t1[,c(1, 2, 4)]
df.t1 <- subset(df.t1, percent > 0.2)
df.t1 <- df.t1[order(df.t1$percent),] 
df.t1$interval <- factor(df.t1$interval, 
                         levels=df.t1$interval,
                         labels=df.t1$interval, 
                         ordered=TRUE) 
df.t1$year <- levels(mydata$time)[1]
# time span 2002-2006
df.t2 <- dist_tab(l.txt_time[[2]])
df.t2 <- df.t2[,c(1, 2, 4)]
df.t2 <- subset(df.t2, percent > 0.2)
df.t2 <- df.t2[order(df.t2$percent),] 
df.t2$interval <- factor(df.t2$interval, levels=df.t2$interval,
                         labels=df.t2$interval, ordered=TRUE) 
df.t2$year <- levels(mydata$time)[2]
# time span 2007-2011
df.t3 <- dist_tab(l.txt_time[[3]])
df.t3 <- df.t3[,c(1, 2, 4)]
df.t3 <- subset(df.t3, percent > 0.3)
df.t3 <- df.t3[order(df.t3$percent),] 
df.t3$interval <- factor(df.t3$interval, levels=df.t3$interval,
                         labels=df.t3$interval, ordered=TRUE) 
df.t3$year <- levels(mydata$time)[3]
# time span 2002-2016
df.t4 <- dist_tab(l.txt_time[[4]])
df.t4 <- df.t4[,c(1, 2, 4)]
df.t4 <- subset(df.t4, percent > 0.3)
df.t4 <- df.t4[order(df.t4$percent),] 
df.t4$interval <- factor(df.t4$interval, levels=df.t4$interval,
                         labels=df.t4$interval, ordered=TRUE) 
df.t4$year <- levels(mydata$time)[4]

Eventually, we create a data frame (df.time) containing the same variables as the data frame df.words and a variable (year) defining the period of time the paper was published.

# merging
df.time <- rbind(df.t1, df.t2, df.t3, df.t4)
df.time <- merge(df.time, df.words, by='interval')
df.time <- df.time[,1:4]
colnames(df.time) <- c('term', 'freq', 'percent', 'year')
df.time$year <- as.factor(df.time$year)

Finally, we plot the most frequent citation title terms over four periods of time.

ggplot(df.time, aes(x=term, y=percent, fill=year)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(percent,1)), size = 4, 
            fontface=2, hjust =  -0.3, vjust = 0.4) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', 
                     limits=c(0, max(df.time$percent)+1), 
                     breaks = seq(0, max(df.time$percent)+1, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  coord_flip() +
  facet_grid(. ~ year) +
  labs(title = "Fig. 6: Most frequent terms in time",
       caption = "Data source: PubMed")

plot of chunk ggplot5

Above all, the terms survivors and distress were increasingly used over time which reflects a growing interest in so-called psychological distress (for a critical discussion see) and cancer survivorship.

How to add a background image to ggplot2 graphs

When producing so called infographics, it is rather common to use images rather than a mere grid as background. In this blog post, I will show how to use a background image with ggplot2.

Packages required

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(jpeg, png, ggplot2, grid, neuropsychology)

Choosing the data

The data set I will be using in this blog post is named diamonds and part of the ggplot2 package. It contains information about – surprise, surprise – diamonds, e.g. price and cut (Fair, Good, Very Good, Premium, Ideal). Using the tapply-function, we create a table returning the maximum prices per cut. Since we need the data to be organized in a data frame, we must transform the table using the data.frame-function.

mydata <- data.frame(price = tapply(diamonds$price, diamonds$cut, max))
mydata$cut <- rownames(mydata)

cut	price
Fair	18574
Good	18788
Very Good	18818
Premium	18823
Ideal	18806

Importing the background image

The file format of the background image we will be using in this blog post is JPG. Since the image imitates a blackboard, we name it “blackboard.jpg”. The image file must be imported using the readJPEG-function of the jpeg package. The imported image will be saved into an object named image.

imgage <- jpeg::readJPEG("blackboard.jpg")

To import other image file formats, different packages and functions must be used. The next code snippet shows how to import PNG images.

image <- png::readPNG("blackboard.png")

Drawing the plot

In the next step, we actually draw a bar chart with a backgriund image. To make blackboard.jpg the background image, we need to combine the annotation_custom-function of the ggplot2 package and the rasterGrob-function of the grid package.

ggplot(mydata, aes(cut, price, fill = -price)) +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white') +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1)

plot of chunk plot1

Adding opacity

Using the specification alpha = 0.5, we add 50% opacity to the bars. alpha ranges between 0 and 1, with higher values indicating greater opacity.

ggplot(mydata, aes(cut, price, fill = -price)) +
  theme_neuropsychology() +
  ggtitle("Bar chart with background image") +
  scale_fill_continuous(guide = FALSE) +
  annotation_custom(rasterGrob(imgage, 
                               width = unit(1,"npc"), 
                               height = unit(1,"npc")), 
                               -Inf, Inf, -Inf, Inf) +
  geom_bar(stat="identity", position = "dodge", width = .75, colour = 'white', alpha = 0.5) +
  scale_y_continuous('Price in $', limits = c(0, max(mydata$price) + max(mydata$price) / 4)) +
  scale_x_discrete('Cut') +
  geom_text(aes(label = round(price), ymax = 0), size = 7, fontface = 2, 
            colour = 'white', hjust = 0.5, vjust = -1)

plot of chunk plot2

The recently published R package neuropsychology contains a theme named theme_neuropsychology(). This theme may be used to get bigger axis titles as well as bigger axis and legend text.

Intro

The function

Function call with default parameters

Customization

Share this:

Intro

Share this:

Preample

Introduction

While Loop

Repeat Loop

Storing the Results

Plotting Power and Sample Size

Share this:

Intro

Packages and Data

Drawing the Circles

The final Plot

Share this:

Intro

Packages

Import

Plotting

References

Share this:

Intro

Linux packages required

R packages required

Plotting

Share this:

Intro

Packages and data

Wrangling

Plotting

Share this:

Intro

The data

R packages

Plotting the data

Share this:

Intro

Data

Fig. 1: Screenshot from PubMed

R packages

Import

Data manipulation

Text mining

Share this:

Packages required

Choosing the data

Importing the background image

Drawing the plot

Adding opacity

Share this: