Scripts & Statistics

How to combine box and jitter plots using R and ggplot2

R makes it easy to combine different kinds of plots into one overall graph. This may be useful to visualize both basic measures of central tendency (median, quartiles etc.) and the distribution of a certain variable. Moreover, so called cut-off values can be added to the graph.

In this blog post, I show how to combine box and jitter plots using the ggplot2 package.

First of all, we need to install and load the R packages required for the following steps. Since we want to do the installation and loading using the pacman package, we need to check whether this package has been installed already. If not, it will be installed and loaded. If yes, it will just be loaded (line 1). Furthermore we need the R packages ggplot2 and Hmisc. This time, the p_load function checks whether these packages have been installed already and either installs and loads or just loads them (line 2).

if (!require("pacman")) install.packages("pacman")
pacman::p_load(ggplot2, Hmisc)

In a second step, we create three random variables (var.scale, var.group, var.cutoff) with n=300.

var.scale is a numeric variable with a mean value of about 50 and a standard deviation of about 17.
var.group is a factor variable comprising the groups male dnd female.
var.cutoff was calculated based on var.scale using predefined cut-off values (0 – 40 == low, 41 –60 = medium, >60 == high).

var.scale <- round(rnorm(300, 50, 17))
var.group <- rbinom(300, 1, .5)
var.group <- factor(var.group, 
                     levels = c(0:1), 
                     labels = c("male", "female"))

var.cutoff <- ifelse(var.scale <= 40, 1, 
                     ifelse(var.scale > 40 & var.scale <= 60, 2, 3))

var.cutoff <- factor(var.cutoff, 
                     levels = c(3:1), 
                     labels = c("high", "medium", "low"))

The describe() function of the Hmisc package returns some basic measures of central tendency.

Hmisc::describe(var.scale)

## var.scale 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     300       0      71       1   51.25   24.00   30.90   41.00   50.00 
##     .75     .90     .95 
##   63.25   70.00   76.00 
## 
## lowest :   8  10  14  16  17, highest:  85  97 100 102 104

Hmisc::describe(var.group)

## var.group 
##       n missing  unique 
##     300       0       2 
## 
## male (141, 47%), female (159, 53%)

Hmisc::describe(var.cutoff)

## var.cutoff 
##       n missing  unique 
##     300       0       3 
## 
## high (87, 29%), medium (141, 47%), low (72, 24%)

Since the ggplot2 package requires the variables to be in a data frame, we have to create a new data frame df comprising our predefined variables using the data.frame() function.

df <- data.frame(var.scale, var.cutoff, var.group)

Using the functions xlab(), ylab() and ggtitle(), axis labels and plot title will be defined.

Box plots will be created using the geom_boxplot() function, with width specifying the boxes' width :-).

Jitter plots will be created using the geom_jitter() function. In addition, specifications have been made for colour and position and size of the dots.

ggplot(df) +
  xlab("Group") +
  ylab("Scale") +
  ggtitle("Combination of Box and Jitter Plot") + 
  geom_boxplot(aes(var.group, var.scale), 
               width=0.5) + 
  geom_jitter(aes(var.group, var.scale, colour = var.cutoff), 
              position = position_jitter(width = .15, height=-0.7),
              size=2) +
  scale_y_continuous(limits=c(0, 101), 
                     breaks = seq(0, 110, 10)) +
  scale_color_manual(name="Legend", 
                     values=c("red", "blue3", "green3"))

plot of chunk plot

Finally, we are going to format both Y-axis and legend using the functions scale_y_continuous() and scale_color_manual().

How to use R for matching samples (propensity score)

According to Wikipedia, propensity score matching (PSM) is a “statistical matching technique that attempts to estimate the effect of a treatment, policy, or other intervention by accounting for the covariates that predict receiving the treatment”. In a broader sense, propensity score analysis assumes that an unbiased comparison between samples can only be made when the subjects of both samples have similar characteristics. Thus, PSM can not only be used as “an alternative method to estimate the effect of receiving treatment when random assignment of treatments to subjects is not feasible” (Thavaneswaran 2008). It can also be used for the comparison of samples in epidemiological studies. Let's give an example:

Health-related quality of life (HRQOL) is considered an important outcome in cancer therapy. One of the most frequently used instruments to measure HRQOL in cancer patients is the core quality-of-life questionnaire of the European Organisation for Research and Treatment of Cancer. The EORTC QLQ-C30 is a 30-item instrument comprised of five functioning scales, nine symptom scales and one scale measuring Global quality of life. All scales have a score range between 0 and 100. While high scores of the symptom scales indicate a high burden of symptoms, high scores of the functioning scales and on the GQoL scale indicate better functioning resp. quality of life.

However, without having any reference point, it is difficult if not impossible to interpret the scores. Fortunately, the EORTC QLQ-C30 questionnaire was used in several general population surveys. Therefore, patient scores may be compared against scores of the general population. This makes it far easier to decide whether the burden of symptoms or functional impairments can be attributed to cancer (treatment) or not. PSM can be used to make both patient and population samples comparable by matching for relevant demographic characteristics like age and sex.

In this blog post, I show how to do PSM using R. A more comprehensive PSM guide can be found under: A Step-by-Step Guide to Propensity Score Matching in R.

Creating two random dataframes

Since we don't want to use real-world data in this blog post, we need to emulate the data. This can be easily done using the Wakefield package.

In a first step, we create a dataframe named df.patients. We want the dataframe to contain specifications of age and sex for 250 patients. The patients' age shall be between 30 and 78 years. Furthermore, 70% of patients shall be male.

set.seed(1234)
df.patients <- r_data_frame(n = 250, 
                            age(x = 30:78, 
                                name = 'Age'), 
                            sex(x = c("Male", "Female"), 
                                prob = c(0.70, 0.30), 
                                name = "Sex"))
df.patients$Sample <- as.factor('Patients')

The summary-function returns some basic information about the dataframe created. As we can see, the mean age of the patient sample is 53.7 and roughly 70% of the patients are male (69.2%).

summary(df.patients)

##       Age            Sex           Sample   
##  Min.   :30.00   Male  :173   Patients:250  
##  1st Qu.:42.00   Female: 77                 
##  Median :54.00                              
##  Mean   :53.71                              
##  3rd Qu.:66.00                              
##  Max.   :78.00

In a second step, we create another dataframe named df.population. We want this dataframe to comprise the same variables as df.patients with different specifications. With 18 to 80, the age-range of the population shall be wider than in the patient sample and the proportion of female and male patients shall be the same.

set.seed(1234)
df.population <- r_data_frame(n = 1000, 
                              age(x = 18:80, 
                                  name = 'Age'), 
                              sex(x = c("Male", "Female"), 
                                  prob = c(0.50, 0.50), 
                                  name = "Sex"))
df.population$Sample <- as.factor('Population')

The following table shows the sample's mean age (49.5 years) and the proportion of men (48.5%) and women (51.5%).

summary(df.population)

##       Age            Sex             Sample    
##  Min.   :18.00   Male  :485   Population:1000  
##  1st Qu.:34.00   Female:515                    
##  Median :50.00                                 
##  Mean   :49.46                                 
##  3rd Qu.:65.00                                 
##  Max.   :80.00

Merging the dataframes

Before we match the samples, we need to merge both dataframes. Based on the variable Sample, we create a new variable named Group (type logic) and a further variable (Distress) containing information about the individuals' level of distress. The Distress variable is created using the age-function of the Wakefield package. As we can see, women will have higher levels of distress.

mydata <- rbind(df.patients, df.population)
mydata$Group <- as.logical(mydata$Sample == 'Patients')
mydata$Distress <- ifelse(mydata$Sex == 'Male', age(nrow(mydata), x = 0:42, name = 'Distress'),
                                                age(nrow(mydata), x = 15:42, name = 'Distress'))

When we compare the distribution of age and sex in both samples, we discover significant differences:

pacman::p_load(tableone)
table1 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = mydata, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table1 <- print(table1, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table1[,1:3],  
      align = 'c', 
      caption = 'Table 1: Comparison of unmatched samples')

	Patients	Population	p
n	250	1000
Age (mean (sd))	53.71 (13.88)	49.46 (18.33)	0.001
Sex = Female (%)	77 (30.8)	515 (51.5)	<0.001
Distress (mean (sd))	22.86 (11.38)	25.13 (11.11)	0.004

Furthermore, the level of distress seems to be significantly higher in the population sample.

Matching the samples

Now, that we have completed preparation and inspection of data, we are going to match the two samples using the matchit-function of the MatchIt package. The method command method="nearest" specifies that the nearest neighbors method will be used. Other matching methods are exact matching, subclassification, optimal matching, genetic matching, and full matching (method = c("exact", "subclass", "optimal", ""genetic", "full")). The ratio command ratio = 1 indicates a one-to-one matching approach. With regard to our example, for each case in the patient sample exactly one case in the population sample will be matched. Please also note that the Group variable needs to be logic (TRUE vs. FALSE).

set.seed(1234)
match.it <- matchit(Group ~ Age + Sex, data = mydata, method="nearest", ratio=1)
a <- summary(match.it)

For further data presentation, we save the output of the summary-function into a variable named a.

After matching the samples, the size of the population sample was reduced to the size of the patient sample (n=250; see table 2).

kable(a$nn, digits = 2, align = 'c', 
      caption = 'Table 2: Sample sizes')

	Control	Treated
All	1000	250
Matched	250	250
Unmatched	750	0
Discarded	0	0

The following output shows, that the distributions of the variables Age and Sex are nearly identical after matching.

kable(a$sum.matched[c(1,2,4)], digits = 2, align = 'c', 
      caption = 'Table 3: Summary of balance for matched data')

	Means Treated	Means Control	Mean Diff
distance	0.23	0.23	0.00
Age	53.71	53.65	0.06
SexMale	0.69	0.69	0.00
SexFemale	0.31	0.31	0.00

The distributions of propensity scores can be visualized using the plot-function which is part of the MatchIt package .

plot(match.it, type = 'jitter', interactive = FALSE)

plot of chunk plot

Saving the matched samples

Finally, the matched samples will be saved into a new dataframe named df.match.

df.match <- match.data(match.it)[1:ncol(mydata)]
rm(df.patients, df.population)

Eventually, we can check whether the differences in the level of distress between both samples are still significant.

pacman::p_load(tableone)
table4 <- CreateTableOne(vars = c('Age', 'Sex', 'Distress'), 
                         data = df.match, 
                         factorVars = 'Sex', 
                         strata = 'Sample')
table4 <- print(table4, 
                printToggle = FALSE, 
                noSpaces = TRUE)
kable(table4[,1:3],  
      align = 'c', 
      caption = 'Table 4: Comparison of matched samples')

	Patients	Population	p
n	250	250
Age (mean (sd))	53.71 (13.88)	53.65 (13.86)	0.961
Sex = Female (%)	77 (30.8)	77 (30.8)	1.000
Distress (mean (sd))	22.86 (11.38)	24.13 (11.88)	0.222

With a p-value of 0.222, Student's t-test does not indicate significant differences anymore. Thus, PSM helped to avoid an alpha mistake.

PS 1: The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(knitr, wakefield, MatchIt, tableone, captioner)

PS 2: Thanks very much to my colleague Katharina Kuba for for telling me about the MatchIt package.

How to parse Evernote export files (.enex) using R

Evernote is a “cross-platform […] app designed for note taking, organizing, and archiving” (Wikipedia). All notes can be tagged and exported. I'm using Evernote, above all, to save and tag interesting blog posts related to R.

plot of chunk logo

In this blog post, I show how to import and parse an exported Evernote file with R.

Exporting the data from Evernote

In a first step, I've exported all of my notes tagged with 'R':

Open the Evernote client;
Select all notes to be exported;
Go to 'File' > 'Export';
Select option 'Export as a file in ENEX format (.enex)' from the format options box;
Name the file 'Evernote.enex' and save it into your RStudio project folder.

Importing the data into R

Since the '.enex' file has xml properties, the 'Evernote.enex' file can be imported using the XML package. Because of its structure, the imported file cannot be transformed into a dataframe right away. Instead, we need to transform it into a list (using the XML::xmlToList function).

library(XML)
xmlfile <- xmlParse("Evernote.enex")
xmllist <- xmlToList(xmlfile, addAttributes = FALSE)

In the following section, I show how to create a dataframe based on the xmllist object.

Building a data frame

First, we generate an empty data frame. The number of rows (262) is determined by the number of elements in the xmllist object and the number of columns is set to zero.

mydata <- data.frame(matrix(NA, ncol = 0, nrow = length(xmllist)))
dim(mydata)

[1] 262 0

Second, we read the names of the note titles and save it into a variable called title which is part of our data frame mydata.

for (i in 1:length(xmllist)){
  mydata$title[i] <- unlist(xmllist[[i]]['title'])
}

head(mydata$title, 10)

[1] “Network visualization in R with the igraph package | Rules of Reason”
[2] “More debate analysis with R”
[3] “Analyzing networks of characters in 'Love Actually' – Variance Explained”
[4] “Web scraping in R”
[5] “Color Quantization in R”
[6] “Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is”
[7] “Waterfall plots – what and how?”
[8] “Sentiment Analysis on Donald Trump using R and Tableau | DataScience+”
[9] “Version 0.9 of timeline on CRAN”
[10] “Date Formats in R”

In a next step, we obtain the dates the notes were created. In order to receive a variable of the date class, the variable 'create' must be formated. Using the stringr package, we extract year, month and day and save it into the same variable.

for (i in 1:nrow(mydata)){
  mydata$created[i] <- xmllist[[i]]['created']
}


mydata$created <- as.Date(paste0(stringr::str_sub(mydata$created, 1, 4), 
                                 '-', 
                                 stringr::str_sub(mydata$created, 5, 6), 
                                 '-',
                                 stringr::str_sub(mydata$created, 7, 8)))

head(mydata$created, 5)

[1] “2016-01-06” “2016-01-06” “2016-01-05” “2016-01-05” “2016-01-04”

Furthermore, the http addresses of the notes can be read like this:

for (i in 1:nrow(mydata)){
  mydata$www[i] <- xmllist[[i]]['note-attributes']
}

mydata$www <- unlist(qdapRegex::ex_url(mydata$www,
                        trim=TRUE,
                        clean=TRUE,
                        extract=TRUE))

mydata$www <- stringr::str_sub(mydata$www, 1, nchar(mydata$www)-2)

head(mydata$www)

[1] “https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/”
[2] “http://www.r-bloggers.com/more-debate-analysis-with-r/”
[3] “http://varianceexplained.org/r/love-actually-network/”
[4] “http://cpsievert.github.io/slides/web-scraping/#1”
[5] “http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html”
[6] “http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/”

Finally, we want to read the tags and save them into a variable. Since the number of tags differs between the notes, we have to assess the number of tags for each note:

# number of tags
for (i in 1:nrow(mydata)){
  mydata$num.tag[i] <- length(which(names(xmllist[[i]])=="tag"))
}

head(mydata$num.tag, 20)

[1] 2 2 3 2 2 3 2 5 2 3 3 2 2 3 3 2 2 3 3 3

Since we want to save each tag into a single variable, we need to know the maximum number of tags.

tag.num <- max(mydata$num.tag)
tag.num

[1] 5

With the next code snippet we add three variables to our dataframe: both the position of the first and last tag as numeric variables and a variable (of class list) containing the positions of all tags.

# position of first tag
for (i in 1:nrow(mydata)){
  mydata$pos.1[i] <- which(names(xmllist[[i]])=="tag")[1]
}
# position of last tag
mydata$pos.2 <- mydata$pos.1 + mydata$num.tag - 1
# position of tags
for (i in 1:nrow(mydata)){
  mydata$pos.all[i] <- list(c(mydata$pos.1[i]:mydata$pos.2[i]))
}
# remove pos.1 and pos.2
mydata$pos.1 <- NULL
mydata$pos.2 <- NULL

Since we don't need the variables pos.1 and pos.2 for further processing, we remove them from our dataframe.

In the next step, we create 5 empty variables that will later on contain the tag names.

# create 5 new columns
num.col <- ncol(mydata) 
for (i in (ncol(mydata) + 1):(ncol(mydata) + tag.num)){
  mydata[, i] <- NA
  colnames(mydata)[i] <- paste0('tag.', i - num.col)
}

The following code snipped intents to write the tag names into the variables tag.1 to tag.5.

for (j in (num.col + 1):ncol(mydata)){
  for (i in 1:nrow(mydata)){
    mydata[i, j]  <- xmllist[[i]][mydata$pos.all[[i]][j - num.col]][[1]]
  }}

However, evaluating the code returns the following error message:

Error in '[<-.data.frame'('*tmp*', i, j, value = NULL) : replacement has length zero

Has anybody got an idea how to get the preceding code snippet working? I'd appreciate every piece of advice.

Thus, I decided to write one loop for each of the five variables. This is definetely not best practice, but it works.

# 1st tag
for (i in 1:nrow(mydata)){
  mydata$tag.1[i]  <- xmllist[[i]][mydata$pos.all[[i]][1]][1]
}
# 2nd tag
for (i in 1:nrow(mydata)){
  mydata$tag.2[i]  <- xmllist[[i]][mydata$pos.all[[i]][2]][1]
}
# 3rd tag
for (i in 1:nrow(mydata)){
  mydata$tag.3[i]  <- xmllist[[i]][mydata$pos.all[[i]][3]][1]
}
# 4th tag
for (i in 1:nrow(mydata)){
  mydata$tag.4[i]  <- xmllist[[i]][mydata$pos.all[[i]][4]][1]
}
# 5th tag
for (i in 1:nrow(mydata)){
  mydata$tag.5[i]  <- xmllist[[i]][mydata$pos.all[[i]][5]][1]
}

In the following step, we define a function (source) replacing NULL by NA and apply this function to each of the five tag variables:

# define function
nullToNA <- function(x) {
  x[sapply(x, is.null)] <- NA
  return(x)
}

# apply function
for (i in (num.col+1):ncol(mydata)){
  for (j in 1:nrow(mydata)){
  mydata[j, i] <- nullToNA(mydata[j, i])
}}

Finally, we paste the values of the five tag variables into a single variable named tags. To do this, we use the paste2 function of the qdap package. Since we don't need the variables tag.1 to tag.5 for further processing, we remove them from the dataframe using the select function of the dplyr package.

mydata$tags <- qdap::paste2(mydata[(num.col+1):ncol(mydata)], 
                            sep = ", ", 
                            handle.na = TRUE, 
                            trim = TRUE)

mydata <- dplyr::select(mydata, -starts_with('tag.'))
mydata$pos.all <- NULL

The final dataframe consists of the following variables:

title containing the titles of the notes;
created containing the dates the notes were created;
www containing the notes' http addresses;
num.tag containing the number of tags for each note;
tags containing the tag names.

The following table gives an impression about how our final dataframe looks like.

knitr::kable(head(mydata), align = c('l', 'c', 'l', 'c', 'c'))

title	created	www	num.tag	tags
Network visualization in R with the igraph package \| Rules of Reason	2016-01-06	https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/	2	network analysis, R, NA, NA, NA
More debate analysis with R	2016-01-06	http://www.r-bloggers.com/more-debate-analysis-with-r/	2	text mining, R, NA, NA, NA
Analyzing networks of characters in 'Love Actually' – Variance Explained	2016-01-05	http://varianceexplained.org/r/love-actually-network/	3	network analysis, text mining, R, NA, NA
Web scraping in R	2016-01-05	http://cpsievert.github.io/slides/web-scraping/#1	2	webscraping, R, NA, NA, NA
Color Quantization in R	2016-01-04	http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html	2	R, image processing, NA, NA, NA
Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R \| rud.is	2016-01-04	http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/	3	text mining, geo, R, NA, NA

The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(XML, knitr, dplyr, qdap, stringr)

The xmllist object may be downloaded as an .RData file under the following link.

In one of my next blog posts, I will show how to analyse the tags.

RMarkdown: How to format tables and figures in .docx files

In research, we usually publish the most important findings in tables and figures. When writing research papers using Rmarkdown (*.Rmd), we have several options to format the output of the final MS Word document (.docx).
Tables can be formated using either the knitr package’s kable() function or several functions of the pander package.
Figure sizes can be determined in the chunk options, e.g.

{r name_of_chunk, fig.height=8, fig.width=12}.

However, options for customizing tables and figures are rather limited in Rmarkdown. Thus, I usually customize tables and figures in the final MS Word document.

In this blog post, I show how to quickly format tables and figures in the final MS Word document using a macro). MS Word macros are written in VBA (Visual Basic for Applications) and can be accessed from a menu list or from the toolbar and run by simply clicking. There are loads of tutorials explaining how to write a macro for MS Word, e.g http://www.addictivetips.com/microsoft-office/create-macros-in-word-2010/.

The following two macros are very helpful to format drafts. Since I want drafts to be as compact as possible, tables and figures should not to be too space consuming.

The first macro called FormatTables customizes the format of all tables of the active MS Word document. With wdTableFormatGrid2, we use a table style predefined in MS Word. A list of other table styles can be found under the follwing link. Furthermore, we define font name (Arial) and font size (8 pt), space before (6 pt) and after (10 pt) the table. Finally, the row height is set to 18 pt exactly.

Sub FormatTables()

 Dim tbl As Table
    For Each tbl In ActiveDocument.Tables
         tbl.AutoFormat wdTableFormatGrid2
         tbl.Range.Font.Name = "Arial"
         tbl.Range.Font.Size = 8
         tbl.Range.ParagraphFormat.SpaceBefore = 6
         tbl.Range.ParagraphFormat.SpaceAfter = 10
         tbl.Range.Cells.SetHeight RowHeight:=18, HeightRule:=wdRowHeightExactly

    Next

End Sub

The second macro called FormatFigures merely reduces the size of all figures in the active MS Word document to 45% of its original size.

Sub FormatFigures()

Dim shp As InlineShape


For Each shp In ActiveDocument.InlineShapes
    shp.ScaleHeight = 45
    shp.ScaleWidth = 45
Next

End Sub

Please also see my blog post RMarkdown: How to insert page breaks in a MS Word document.

How to install the ‘RWordPress’ package in R

The RWordPress package is a very convenient tool for publishing blog posts from R to WordPress. In his blog post Publish blog posts from R + knitr to WordPress, Yihui Xie explains how to install and use the package. Furthermore, the blog post How to publish with R Markdown in WordPress gives some additional information on how to use the package.

However, the package repository http://www.omegahat.org/R does not seem to exist anymore (2016-08-30).

Fortunatelly, the RWordPress package is also available from Github and, thus, can be easily installed using the devtools package.

Since RWordPress depends on the packages RCurl, XML, and XMLRPC, these packages need to be installed before we can actually install RWordPress.

Unlike RCurl and XML, the XMLRPC package is not available from the CRAN repository. Instead, it is available from Github.

Here is the code to install all required packages:

install.packages("devtools")
install.packages("RCurl")
install.packages("XML")
devtools:::install_github("duncantl/XMLRPC")
devtools:::install_github("duncantl/RWordPress")

Auswertung von LEGIDA-Polizeiberichten. Teil II: Worthäufigkeiten

In Teil II meiner Serie über Pressemitteilungen, die von der Polizeidirektion Leipzig anlässlich der Demonstrationen der fremdenfeindlichen LEGIDA-Bewegung veröffentlicht wurden, zeige ich heute, welche Worte in diesen Berichten am häufigsten verwendet werden.

## Warning in readChar(con, 5L, useBytes = TRUE): kann komprimierte Datei 'C:/
## ProgrammeNK/GDrive/Projects/R/Polizeiberichte/Legida.RData' nicht öffnen.
## Grund evtl. 'Datei oder Verzeichnis nicht gefunden'

## Error in readChar(con, 5L, useBytes = TRUE): kann Verbindung nicht öffnen

Die Auszählung von Worthäfigkeiten bezieht sich dabei (fast) nur auf bedeutungstragende Worte. Sogenannte Stopwords wurden von der Analyse ausgeschlossen. Die verwendete Stopwords-Liste findet sich unter dem folgenden Link zum Download.

library(ggplot2)

plt.words <- ggplot(df.words, aes(interval, freq, fill = 500 - freq)) +
  geom_bar(stat="identity", position="dodge", width = 0.75) + 
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.words$freq)+10), breaks = seq(0, max(df.words$freq)+10, by = 20)) +
  scale_x_discrete('') +
  theme(legend.position="none") + 
  coord_flip() +
  ggtitle("HÃ¤ufigste Wortnennungen") +
  geom_text(aes(label = paste0(percent, '%'), ymax = 0), size = 3, fontface=2, 
            hjust = -0.5, vjust = 0.2)

plt.words

plot of chunk bar

Mit dem wordcloud-Paket lässt sich das Ganze auch als Wordcloud darstellen.

colfunc <- colorRampPalette(c("blue", "red"))

set.seed(4)
par(mar = c(0, 0, 0, 0))
wordcloud::wordcloud(txt.wc, 
                     scale=c(3,.3),
                     min.freq=3,
                     max.words=150,
                     random.order=FALSE,
                     colors = colfunc(200))

plot of chunk wc

Schaut man sich die Auszählung der Worthäufigkeiten an, so ist erkennbar, dass es in den polizeilichen Pressemitteilungen häufig um eine zeitliche und r?umliche Einordnung des Geschehens geht. So werden zum einen sehr oft Uhrzeiten berichtet (vgl.). Zum anderen lässt sich erkennen, dass vor allem der Richard-Wagner-Platz, der Augustusplatz sowie der Leipziger Hauptbahnhof zentrale Örtlichkeiten der LEGIDA-Demonstartionen sind.

LEGIDA rallies: A heat map of day times using ggplot2

LEGIDA is a Leipzig based offshoot of the rigth-wing and xenophobic PEGIDA movement.

Since January 2015, LEGIDA has held at least one rally per month against what they call the “Islamisation of the Western world”.

Usually, the Leipzig Police Department publishes online reports describing what happened at these rallies.

In my following blog posts I will show, what kind of information can be derived from these reports and how these information can be visualized. My blog posts will have a technical rather than a political character.

Today, I'm going to show how to find information about the time of the day the rallays took place and how to visualize these specifications of time using a pie chart.

In the first code chunk, we will simply load the concatenated police report as a character vector named txt. The text of the police reports is rather unstructured. However, specifications of time are constantly made in the format: two digits followed by one colon followed by two digits, e.g. '19:00' for 7 p.m. Thus, all specifications of time can be extracted with a simple Regular Expression. Since we are not particulary interested in exact to the minute specifications, we save the speifications of hours as a numeric vector.

time.str <- unlist(regmatches(txt, gregexpr("\\d{2}\\:\\d{2}", txt)))
time.str

[1] “19:00” “17:00” “21:00” “15:00” “18:00” “17:00” “19:00” “18:45”
[9] “18:30” “20:20” “17:44” “21:45” “18:00” “17:30” “19:00” “20:20”
[17] “21:30” “19:00” “20:00” “21:15” “21:30” “22:30” “19:15” “19:45”
[25] “21:00” “18:15” “16:15” “19:30” “20:15” “21:30” “22:00” “16:45”
[33] “19:00” “17:20” “18:00” “19:00” “20:15” “19:00” “21:15” “22:45”
[41] “19:10” “19:40” “21:00” “22:00” “19:30” “21:30” “23:00” “21:45”
[49] “17:30” “19:00” “21:00” “19:00” “20:00” “21:15” “21:45” “17:15”
[57] “17:00” “18:00” “19:10” “21:00” “17:00” “19:15” “20:40” “20:15”
[65] “21:45” “18:00” “22:00” “18:30” “18:00” “18:30” “18:00” “18:45”
[73] “19:50” “20:00” “20:20” “21:15” “21:00” “18:45” “19:00” “20:40”
[81] “20:00” “20:45” “21:15” “19:50” “20:40” “21:00” “20:00” “20:45”
[89] “21:00” “19:00” “20:00” “20:50” “21:45” “21:00” “21:30” “19:00”
[97] “20:00” “20:45” “21:20” “21:30” “19:00” “19:30” “20:20” “21:00”
[105] “21:35” “18:00” “19:00” “18:00” “20:50” “19:00” “20:00” “20:50”
[113] “20:50” “19:00” “19:50” “20:35” “20:55” “17:35” “18:35” “19:00”
[121] “20:45” “21:00” “18:00” “18:45” “19:00” “19:35” “20:20” “20:30”
[129] “20:40” “22:00” “18:00” “19:20” “02:30” “19:05” “21:10” “17:40”
[137] “18:45” “20:00” “21:20” “18:30” “19:20” “20:00” “21:45” “19:00”
[145] “19:30” “20:45” “21:45” “23:00”

time.str <- as.numeric(stringr::str_sub(time.str, 1, 2))

[1] 19 17 21 15 18 17 19 18 18 20 17 21 18 17 19 20 21 19 20 21 21 22 19
[24] 19 21 18 16 19 20 21 22 16 19 17 18 19 20 19 21 22 19 19 21 22 19 21
[47] 23 21 17 19 21 19 20 21 21 17 17 18 19 21 17 19 20 20 21 18 22 18 18
[70] 18 18 18 19 20 20 21 21 18 19 20 20 20 21 19 20 21 20 20 21 19 20 20
[93] 21 21 21 19 20 20 21 21 19 19 20 21 21 18 19 18 20 19 20 20 20 19 19
[116] 20 20 17 18 19 20 21 18 18 19 19 20 20 20 22 18 19 2 19 21 17 18 20
[139] 21 18 19 20 21 19 19 20 21 23

Since we are only interested in the time span between 12 p.m and 12 a.m., we transform our numeric vector time.str into a vector of class factor containing only day times of the specified span. Afterwards, we save this vector as a table.

time.str <- factor(time.str, levels = c(13:24))
time.tab <- table(time.str)

In the next step, we create a table containing the proportions for each hour and save these specifications into a new vector named time.vecp.

time.tabp <- round(prop.table(table(time.str)), 2)
time.vecp <- as.numeric(as.character((time.tabp)))*100

[1] 0 0 1 1 7 15 24 23 22 4 1 0

Finally, we want to visualize our results. Since the form of a clock can be very good reproduced with a pie chart, we first create a dataframe with twelf segments of the same size (time). To this dataframe, we add two more variables: our proportional time vector (value) and the labels for visualizing the clock (labs).

df <- data.frame(time = rep(1,12),
                 value = time.vecp,
                 labs <- c(1:12))

The pie chart is plotted using ggplot2. The result is kind of a heat map visualizing the day times the LEGIDA rallies usually take place.

library(ggplot2)

  ggplot(df, aes(x = "", y = time, fill = value)) +
    geom_bar(width = 1, stat = "identity", colour = "grey") +
    scale_y_continuous('', limits=c(0, 12), breaks = seq(1,12,1),
                       labels=df$labs) +
    scale_x_discrete('') +
    scale_fill_distiller('Percent', palette = 'Oranges', space = "Lab", direction = 1) +
    coord_polar(theta = "y", start = 0) +
    labs(title = "LEGIDA clock") +
    theme_minimal() +
    theme(axis.text = element_text(size = 18))

plot of chunk plot

Obviously, the rallies usually take place between 6 and 9 p.m.

Last update: 2016-08-30, after the 35^{th^} LEGIDA rally.

How to print tables with absolute and relative values in R

Introduction

In R, there are several ways to generate tables. while the table() function generates tables with absolute numbers, the prop.table() function returns tables with relative values (percentages). However, I couldn't find a function to return a table with both absolute and relative values.

In this blog post, I show how to generate such a table.

Generate a random dataframe

In the first code snippet, we generate a random dataframe with two variables: Sex and Age. As you can see, generating random dataframes is very easy and straightforward with Tyler Rinker's Wakefield package.

library(wakefield)
df <- r_data_frame(n=200, sex, age)

Write the function

My function tab.func() combines three R functions:

describe() from the Hmisc package to return an object of class describe containing absolute and relative frequency values of a factor variable. To access these values, we need to subset this object using $values. This will return a matrix with the desired values.
t() to transpose this matrix, and
as.table() to transform this matrix into a table.

tab.func <- function (x) {
  y <- as.table(t(Hmisc::describe(x)$values))
  colnames(y) <- c('**n**', '**%**')
  return(y)
}

(UPDATE: In version 4 of the Hmisc package, the describe() function was rewritten. My function only works up to version 3.17.4)

The double asterisks around n and % are Markdown code used to return bold text.

Deploy the function

In the following code snippet, we deploy this function to a categorial variable (Sex) which is part of the dataframe df.

mytable <- tab.func(df$Sex)


knitr::kable(mytable,
             caption = 'Table with absolute numbers and percentages')

	n	%
A	Male, Female	96, 104

Finaly, we print this table with the kable() function from the knitr package.

RMarkdown: How to insert page breaks in a MS Word document

Introduction

RStudio offers the opportunity to build MS Word documents from RMarkdown files. However, since formatting options in Markdown are very limited, there is no ‘native’ Markdown code to insert page breaks in the final MS Word output file.

In this blogpost I explain, how to define page breaks in the RMarkdown document that will be kept in the final MS Word document (.docx). My post is based on Richard Layton’s article Happy collaboration with Rmd to docx which explains how to create a MS Word .docx template in order to modify the document design of a MS Word file created from a .Rmd-file in RStudio.

The MS Word template

In the first step, we create a MS Word template called ‘mystyles.docx’ (How to…). This file must be saved in the same directory as the R Markdown file. For the following modifications we have to open this file with MS Word.

Modify style ‘Heading 5’

In the next step, we modify a predefined style. However, after modifying a predefined style, we cannot use it anymore in the originally intented way. Thus, we must choose a style hardly needed for any other purpose. In this blogpost, we use the Heading 5 style.

To modify this style, we select the ‘Home‘ ribbon tab and click the Styles window launcher in the Styles group (lower right corner, highlighted with red circle).

We select ‘Heading 5’ in the Word document. In the Styles window, we scroll down until we find the style already assigned to the text we selected. In our case, the assigned style is ‘Heading 5’. (In the figure it says ‘Heading 3’. However, we actually mean ‘Heading 5’)

The following modifications must be made in the Modify Style menu:

Set the font color to ‘white’ (rather than ‘Automatic’).
Select the smallest font size (8 rather than 11).
Select ‘Page break before’ in the ‘Line and Page Breaks’ tab.

Set the line spacing to ‘Exactly’ and ‘1 pt’ in the ‘Indents and Spacing’ tab.

After these tweaks, the ‘Heading 5’ style will no longer format a heading of level 5. Instead it will insert a very small and white (and, thus, invisible) line followed by a page break.

The RMarkdown document

In the RMarkdown document, a few specifications must be made.

The YAML header

RMarkdown documents contain a metadata section called YAML header. In this header, we specify the output format (word_document) and the name of the MS Word template (mystyles.docx).

---
title: 'Title'
date: "`r format(Sys.time(), '%d&period; %B %Y')`"
output: 
    word_document:
      reference_docx: mystyles.docx
---

The Markdown code ##### being originally reserved to format header 5 will be used to insert page breaks in the final .docx document. Since we modified the font color to ‘white’ in the MS Word template, the specification after the Markdown code (Page Break) will not appear in the final document.

The following example shows how to insert a page break between two paragraphs.

Example: Markdown code to insert a page break

Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break.

##### Page Break

Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break.

Download

My MS Word template may be downloaded here.

PS

Since I don’t have an English version of MS Word, I could not make the screenshots myself. Instead, I have used internet links. Please click on the pictures to get to the web pages.

Please also see my blog post RMarkdown: How to format tables and figures in .docx files.

Import von Google-Trends-Zeitreihen nach R

Eine der zahlreichen Anwendungen, die Google kostenlos zur Verfügung stellt, heißt Google Trends. Wie man auf Wikipedia nachlesen kann, handelt es sich dabei um einen Internetdienst, der “Informationen darüber bereitstellt, welche Suchbegriffe von Nutzern der Suchmaschine Google wie oft eingegeben wurden. Die Ergebnisse werden in Relation zum totalen Suchaufkommen gesetzt und sind in wöchentlicher Auflösung seit Anfang 2004 für die gesamte Welt oder einzelne Regionen verfügbar.”

Die von Google Trends errechneten Skalenwerte können – abhängig von der Popularität des Begriffs – einen Wert zwischen 0 und 100 annehmen, wobei höhere Werte eine größere Popularität anzeigen. Darüber hinaus finden sich für Interpretation der Skalenwerte auf den Seiten von Google Trends die folgenden Hinweise:

The numbers […] show total searches for a term relative to the total number of searches done on Google over time. A line trending downward means that a search term’s relative popularity is decreasing. But that does not necessarily mean the total number of searches for that term is decreasing. It just means its popularity is decreasing compared to other searches.

Für die Forschung ist Google Trends interessant, da sich mit dessen Hilfe die Popularität einzelner Begriffe im Zeitablauf abbilden lässt. Zum Beispiel lässt sich somit, wie eine Studie zeigte, der Auspruch einer Grippe-Epidemie erkennen.

Erleichtert wird die Analyse dieser Daten durch eine von Google bereitgestellte Schnittstelle (API), die sich mit dem Statistikprogramm GNU R ansteuern lässt.

In der sehr lesenswerten Blogpost GTrendsR package to Explore Google trending for Field Dependent Terms erklärt Tyler Rinker, wie man mit GNU R Google Trends-Daten importiert und graphisch darstellt.

In dieser Blogpost zeige ich, wie man Google Trends-Daten in R importiert und in einem Dataframe abspeichert. An der Analyse der Daten arbeite ich gerade.

Installieren der R-Pakete

Da die beiden für den Zugriff auf die Google Trends-API benötigten R-Pakete sich nicht auf dem CRAN-, sondern auf dem Github-Repository befinden, benötigt man für die Installation das devtools-Paket. Sobald dieses Paket installiert ist, lassen sich auch die Pakete GTrendsR und gtrend installieren.

install.packages('devtools', dep=TRUE)
devtools::install_github("dvanclev/GTrendsR")
devtools::install_github("trinker/gtrend")

Laden der R-Pakete

Im nächsten Schritt werden die für den Zugriff auf die Google Trends-API benötigten Pakete (gtrend und GTrendsR) geladen.

library(gtrend)
library(GTrendsR)

Datenimport

Im nächsten Code-Schnipsel werden die für die Analyse relevanten Begriffe festgelegt und in dem Vektor terms gespeichert. Mit der Funktion gtrend_scraper werden nun die Google Trends-Daten importiert. Diese Funktion erforert zwingend die Angabe einer Gmail-Adresse und des zugehörigen Passworts sowie der Suchbegriffe (terms). Darüber hinaus lässt sich mit der Angabe geo einstellen, auf welches Land sich die Suche beschränken soll. In diesem Beispiel habe ich mit ‘DE’ Deutschland ausgewählt (andere Länder: ‘US’=USA, ‘FR’=Frankreich, ‘UK’=Großbritannien, etc.).

terms <- c("Kino", "Theater", "Oper")
out <- gtrend_scraper("youremail@gmail.com", "password", terms, geo = 'DE')

Abschließend wird das Ergebnis der Suchabfrage in einem dataframe abgespeichert.

df.trends <- trend2long(out)

Mit der Funktion kable() aus dem R-Paket knitr und der head()-Funktion lassen sich die ersten 10 Zeilen des Datensatzes tabellarisch darstellen:

knitr::kable(head(df.trends), align = 'c')

term	start	end	trend
Kino	2004-01-04	2004-01-10	64
Kino	2004-01-11	2004-01-17	49
Kino	2004-01-18	2004-01-24	48
Kino	2004-01-25	2004-01-31	50
Kino	2004-02-01	2004-02-07	51
Kino	2004-02-08	2004-02-14	52

Wie man sieht, enthält der Datensatz 4 Variablen:

term enthält die Suchbegriffe;
start und end sind Datumsvariablen und markieren Anfang und Ende des Messzeitraums;
trend speichert den Scorewert mit einem Wertebereich von 0 bis 100.

Share this:

Creating two random dataframes

Merging the dataframes

Matching the samples

Saving the matched samples

Share this:

Exporting the data from Evernote

Importing the data into R

Building a data frame

Share this:

Share this:

Share this:

Share this:

Share this:

Introduction

Generate a random dataframe

Write the function

Deploy the function

Share this:

Introduction

The MS Word template

Modify style ‘Heading 5’

The RMarkdown document

The YAML header

Example: Markdown code to insert a page break

Download

PS

Share this:

Installieren der R-Pakete

Laden der R-Pakete

Datenimport

Share this: