Data Management – Scripts & Statistics

How to use lists in RMarkdown and Quarto

RMarkdown and Quarto documents are “dynamic analysis documents that combine code, rendered output (such as figures), and prose”. Since everything in R is an object, lots of R objects (numbers, charts, tables, statistical models, etc.) are created while rendering the document. In large documents, it can be difficult to keep track with where and when these objects are created which sometimes makes debugging a cumbersome and time consuming job.

However, in this blog post, I will show how to use lists to improve clarity throughout the document. Let me give you an example:

Creating a `list` with named objects

In R, a “list” itself is an object and can be created using the native list() function. In the following code snippet, we save list as an object (“lst.a”) and add one named object which is numeric (‘nmb.1’). The str()function shows all objects our list contains.

lst.a <- list(
  nmb.1 = 10
)

str(lst.a)

## List of 1
##  $ nmb.1: num 10

Reusing objects

Next, we try to create a new list and check whether it’s possible to use one object (‘nmb.1’) to compute another object (‘nmb.2’):

lst.b <- list(
  nmb.1 = 10,
  nmb.2 = nmb.1 + 10
)

## Error:
## ! object 'nmb.1' not found

As we can see, the list() function cannot create objects and reuse them at the same time. Fortunately, the {tibble} package contains the lst() function which is able to do so:

lst.b <- tibble::lst(
  nmb.1 = 10,
  nmb.2 = nmb.1 + 10
)
str(lst.b)

## List of 2
##  $ nmb.1: num 10
##  $ nmb.2: num 20

Saving plots in lists

Unfortunately, base R plots cannot be saved as object and, thus, cannot be put into a list. When we evaluate the following line of Rcode, we see that the plot() function is evaluated immediately (as a side effect). The object it was assigned to remains empty (“NULL”).

lst.c <- list(
  plt = plot(1:10)
)

plot of chunk lst-plot

lst.c$plt

## NULL

However, the {ggplot2} package produces plots which can be assigned to objects and, thus, can be put into lists.

lst.d <- list(
  ggplt = data.frame(x = 1:10, y = 1:10) %>% ggplot(aes(x, y)) + geom_point()
)
lst.d$ggplt

plot of chunk lst-ggplot

Statistical models, data.frames etc.

It probably doesn’t come as a surprise that many more objects can be put into a list. In the following example, we us the ‘mtcars’ data.frame to build a simple linear model explaining horsepower (hp) by miles per gallon (mpg). At the same time, we put this model into a list (‘lst.e’) both as a list (‘mdl.lm’) and a tibble.

put a linear model explaining horsepower (hp) by miles per gallon (mpg) both as a list and a tibble (‘tib.lm’).

lst.e <- lst(
  mdl.lm = lm(hp ~ mpg, data = mtcars),
  tib.lm = broom::tidy(mdl.lm)
)

str(lst.e,  list.len = 5)

## List of 2
##  $ mdl.lm:List of 12
##   ..$ coefficients : Named num [1:2] 324.08 -8.83
##   .. ..- attr(*, "names")= chr [1:2] "(Intercept)" "mpg"
##   ..$ residuals    : Named num [1:32] -28.7 -28.7 -29.8 -25.1 16 ...
##   .. ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   ..$ effects      : Named num [1:32] -829.8 296.3 -23.6 -20 19.3 ...
##   .. ..- attr(*, "names")= chr [1:32] "(Intercept)" "mpg" "" "" ...
##   ..$ rank         : int 2
##   ..$ fitted.values: Named num [1:32] 139 139 123 135 159 ...
##   .. ..- attr(*, "names")= chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##   .. [list output truncated]
##   ..- attr(*, "class")= chr "lm"
##  $ tib.lm: tibble [2 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ term     : chr [1:2] "(Intercept)" "mpg"
##   ..$ estimate : num [1:2] 324.08 -8.83
##   ..$ std.error: num [1:2] 27.43 1.31
##   ..$ statistic: num [1:2] 11.81 -6.74
##   ..$ p.value  : num [1:2] 8.25e-13 1.79e-07

Using lists with RMarkdown and Quarto

Finally, we are going to save everything into a single list and put the list elements into a text.

lst.final <- lst(
  nmb.1 = nrow(mtcars),
  nmb.2 = ncol(mtcars),
  ggplt = mtcars %>% ggplot(aes(x = hp, y = mpg)) +
    geom_point() + geom_smooth(method='lm', se = FALSE) + ggthemes::theme_clean(),
  mdl.lm = lm(hp ~ mpg, data = mtcars),
  tib.lm = broom::tidy(mdl.lm, conf.int = TRUE)
)

Putting text and `R` code together

This is just an example text showing how to put text and code into:

The ‘mtcars’ dataset consists of lst.final$nmb.1 rows and lst.final$nmb.2 columns.

Rendering to:

The ‘mtcars’ dataset consists of 32 rows and 11 columns.

The relation between horsepower and miles per gallon is visualised in the following figure:

lst.final$ggplt

Horsepower and miles per gallon

Statistical models can be easily put into a table using the {gtsummary} package:

library(gtsummary)

tbl_regression(lst.final$mdl.lm, intercept = TRUE) %>% 
  as_hux_table()

Characteristic	Beta	95% CI	p-value
(Intercept)	324	268, 380	<0.001
mpg	-8.8	-12, -6.2	<0.001
Abbreviation: CI = Confidence Interval

How to Assign Variable Labels in R

Intro

Defining variable labels is a useful way to describe and document datasets. Unlike SPSS, which makes it very easy to define variable labels using the data editor, base R doesn't provide any function to define variable labels (as far as I know).

However, Daniel Luedecke's R package sjlablled fills this gap. Let's give an example.

Defining variable labels

First, we load the mtcars data frame and define variable labels for all of the 11 variables:

data(mtcars)
labs <- c("Miles/(US) gallon", "Number of cylinders", "Displacement (cu.in.)", 
    "Gross horsepower", "Rear axle ratio", "Weight (1000 lbs)", "1/4 mile time", 
    "V/S", "Transmission", "Number of forward gears", "Number of carburetors")

Assigning labels to variables

Second, we assign the variable labels to the variables of the mtcars data frame:

library(sjlabelled)
mtcars <- set_label(mtcars, label = labs)

When we have a look at the mtcars data frame using RStudio's data viewer, we find the variable labels placed right underneath the variable names:

Moreover, we may as well save both variable names and labels into a data frame:

library(dplyr) # for data manipulation
library(knitr) # for printing tables
df <- get_label(mtcars) %>%
        data.frame() %>%
          rename_at(vars(1), funs(paste0('var.labs'))) %>%
            mutate(var.names = colnames(mtcars)) 
kable(df, align = 'lc')

var.labs	var.names
Miles/(US) gallon	mpg
Number of cylinders	cyl
Displacement (cu.in.)	disp
Gross horsepower	hp
Rear axle ratio	drat
Weight (1000 lbs)	wt
¼ mile time	qsec
V/S	vs
Transmission	am
Number of forward gears	gear
Number of carburetors	carb

How to parse Citavi files using R

Intro

In academic writing, the use of reference management software has become essential. A couple of years ago, I started using the free open-source software Zotero. However, most of my workmates at Leipzig University work with Citavi, a commercial software which is widely used at German, Austrian, and Swiss universities. The current version of Citavi – Citavi 5 – was released in April 2015.

plot of chunk pic

In this blog post, I show how to import Citavi files into R.

Packages required

Citavi organizes references in SQLite databases. Although Citavi files end with .ctv5 rather than .sqlite, they may be parsed as sqlite files. For reproducing the code of this blog post, the following R packages are required.

library(RSQLite)
library(DBI)
library(tidyverse)
library(stringr)

Data

Import

A connection to the database myrefs.ctv5 may be established using the dbConnect() function of the DBI and the SQLite() function of the RSQLite package.

# connect to Citavi file
con <- DBI::dbConnect(RSQLite::SQLite(), 'myrefs.ctv5')

A list of all tables may be returned using the dbListTables() function of the DBI package. Out database contains 38 tables.

# get a list of all tables
dbListTables(con)

[1] “Annotation” “Category”
[3] “CategoryCategory” “Changeset”
[5] “Collection” “DBVersion”
[7] “EntityLink” “Group”
[9] “ImportGroup” “ImportGroupReference”
[11] “Keyword” “KnowledgeItem”
[13] “KnowledgeItemCategory” “KnowledgeItemCollection”
[15] “KnowledgeItemGroup” “KnowledgeItemKeyword”
[17] “Library” “Location”
[19] “Periodical” “Person”
[21] “ProjectSettings” “ProjectUserSettings”
[23] “Publisher” “Reference”
[25] “ReferenceAuthor” “ReferenceCategory”
[27] “ReferenceCollaborator” “ReferenceCollection”
[29] “ReferenceEditor” “ReferenceGroup”
[31] “ReferenceKeyword” “ReferenceOrganization”
[33] “ReferenceOthersInvolved” “ReferencePublisher”
[35] “ReferenceReference” “SeriesTitle”
[37] “SeriesTitleEditor” “TaskItem”

Creating data frames

For reading the tables of the database, we need the dbGetQuery() function of the DBI package. Each table contains a number of variables. In case we want to save all variables into a data frame, we select them using the asterisk.

df.author <- dbGetQuery(con,'select * from Person')

In case we are only interested in a couple of variables, we need to specify their names separated by commas.

df.author <- dbGetQuery(con,'select ID, FirstName, LastName, Sex from Person')
df.keyword <- dbGetQuery(con,'select ID, Name from Keyword')
df.refs <- dbGetQuery(con,'select ID, Title, Year, Abstract, CreatedOn, ISBN, PageCount, PlaceOfPublication, ReferenceType   from Reference')

Data wrangling

In the next step, we try to join our data frames using the dplyr package.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID')

## Error in eval(expr, envir, enclos): Can't join on 'ID' x 'ID' because of incompatible types (list / list)

However, R returns an error message disclosing that the data frames cannot be joined by their ID variable because of incompatible types.

When we take a closer look at the ID variables we see that they are organized as list containing 1603 elements of type raw. Moreover, each of the list elements consists of 16 alphanumerical elements.

typeof(df.author$ID)

## [1] "list"

str(df.author$ID[[1]])

##  raw [1:16] 41 5b 18 51 ...

length(df.author$ID)

## [1] 1603

In order to be able to join our data frames we need to convert the type of the ID variables from list to character. Furthermore, we collapse the 16 list elements into single strings separated by hyphens.

# df.author
for (i in 1:nrow(df.author)){
  df.author$ID[[i]] <- as.character(df.author$ID[[i]])
  df.author$ID[[i]] <- str_c(df.author$ID[[i]], sep = "", collapse = "-")
}
df.author$ID <- unlist(df.author$ID)
# df.keyword
for (i in 1:nrow(df.keyword)){
  df.keyword$ID[[i]] <- as.character(df.keyword$ID[[i]])
  df.keyword$ID[[i]] <- str_c(df.keyword$ID[[i]], sep = "", collapse = "-")
}
df.keyword$ID <- unlist(df.keyword$ID)
# df.refs
for (i in 1:nrow(df.refs)){
  df.refs$ID[[i]] <- as.character(df.refs$ID[[i]])
  df.refs$ID[[i]] <- str_c(df.refs$ID[[i]], sep = "", collapse = "-")
}
df.refs$ID <- unlist(df.refs$ID)

typeof(df.refs$ID)

## [1] "character"

head(df.refs$ID)

## [1] "34-26-a7-db-e2-79-ac-4c-ad-21-45-91-37-3f-e9-b0"
## [2] "75-e6-c1-fd-65-0a-82-48-a8-c8-e9-80-2a-79-db-2c"
## [3] "0b-8f-73-86-82-0a-c8-48-84-f7-d2-7a-04-df-12-65"
## [4] "7a-79-2b-1e-4a-ef-ae-40-ae-06-d5-bd-5d-e9-a7-cb"
## [5] "12-3c-0e-70-a2-2a-f9-48-ad-e4-b7-7e-85-63-97-96"
## [6] "bc-ae-06-78-89-48-df-47-88-b6-5f-92-20-9b-4c-7c"

Joining the data frames

Finally, our data frames may be joined by their ID variables.

mydata <- df.refs %>%
  left_join(df.author, by = 'ID') %>%
    left_join(df.keyword, by = 'ID')

Results

Our final data frame contains 1066 cases and 13 variables, that is:

colnames(mydata)

##  [1] "ID"                 "Title"              "Year"              
##  [4] "Abstract"           "CreatedOn"          "ISBN"              
##  [7] "PageCount"          "PlaceOfPublication" "ReferenceType"     
## [10] "FirstName"          "LastName"           "Sex"               
## [13] "Name"

How to import a bunch of Excel files with multiple sheets

The Problem

I have been recently conducting an evaluation study in the field of social work with elderly people. With the purpose to provide advice to elderly peoply regarding age-related questions about housing, home care etc, 10 offices for senior citizens were founded. Each of the offices is requested to document its activities (number of persons, events etc.) on a monthly basis. The data need to be entered in a prestructured Excel table. Since the offices started working about 2.5 years ago, I needed to handle 300 Excel sheets (30 months * 10 offices).

In a first step, I decided to create one Excel file for every office each containing 30 sheets (one sheet per month). While the files are named after the offices (abbreviated with 4 characters), the sheets are named after the following pattern: YYYY.MM (year followed by month).

The Solution

Since I did not find a solution for my problem with the packages I usually use to import Excel files into R (xlsx, readxl), I searched the internet for help. Fortunatelly, I found the paper “How to import and merge many Excel files; each with multiple sheets of data for statistical analysis.” by Jon Starkweather. The paper is really worth reading and gives a very comprehensive description on the subject matter. The following code snippets stem from Starkweather's paper.

In a first step, we have to load the following packages:

library(rJava)
library(XLConnect, pos = 4)

In a second step, we define the file type, we want to import (.xls), save the sheet names of the Excel files into a new vector called sheet.names (since the sheet names in each of the files are identical, we may extract them from any of the 10 files) and create another vector (e.names) containing the names for the variables we want to import (in this case 28).

file.names <- list.files(pattern='*.xls')
sheet.names <- getSheets(loadWorkbook('Name.xls'))
e.names <- paste0(rep('v', 28), c(1:28))

In a thirt step, we create a data frame with 28 variables, named
v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, v16, v17, v18, v19, v20, v21, v22, v23, v24, v25, v26, v27, v28
and one row containing NAs only.

data.1 <- data.frame(matrix(rep(NA,length(e.names)),
                            ncol = length(e.names)))
names(data.1) <- e.names

Finally, we use 2 for-loops to import all the files and sheets and bind them to a data frame we can use for analysis.

for (i in 1:length(file.names)) {
    wb <- loadWorkbook(file.names[i])
    for (j in 1:length(sheet.names)) {
        ss <- readWorksheet(wb, sheet.names[j], startCol = 2, header = TRUE)
        condition <- rep(sheet.names[j], nrow(ss))
        sub.id <- rep(file.names[i], nrow(ss))
        s.frame <- seq(1:nrow(ss))
        df.1 <- data.frame(sub.id, condition, s.frame, ss)
        names(df.1) <- e.names
        data.1 <- rbind(data.1, df.1)
        rm(ss, condition, s.frame, sub.id, df.1)
    }
    rm(wb)
}

In the mentioned paper, Jon Starkweather elaborates in detail on what each line of each for-loop is doing.

How to number and reference tables and figures in R Markdown files

Introduction

R Markdown is a great tool to make research results reproducible. However, in scientific research papers or reports, tables and figures usually need to be numbered and referenced. Unfortunately, R Markdown has no “native” method to number and reference table and figure captions. The recently published bookdown package makes it very easy to number and reference tables and figures (Link). However, since bookdown uses LaTex functionality, R Markdown files created with bookdown cannot be converted into MS Word (.docx) files.

In this blog post, I will explain how to number and reference tables and figures in R Markdown files using the captioner package.

Packages required

The following code will install load and / or install the R packages required for this blog post. The dataset I will be using in this blog post is named bundesligR and part of the bundesligR package. It contains “all final tables of Germany's highest football league, the Bundesliga” Link.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(knitr, captioner, bundesligR, stringr)

In the first code snippet, we create a table using the kable function of the knitr package. With caption we can specify a simple table caption. As we can see, the caption will not be numbered and, thus, cannot be referenced in the document.

knitr::kable(bundesligR::bundesligR[c(1:6), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             caption = "German Bundesliga: Final Table 2015/16, Position 1-6")

Position	Team	Points	GD
1	FC Bayern Muenchen	88	63
2	Borussia Dortmund	78	48
3	Bayer 04 Leverkusen	60	16
4	Borussia Moenchengladbach	55	17
5	FC Schalke 04	52	2
6	1. FSV Mainz 05	50	4

Table numbering

Thanks to Alathea Letaw's captioner package, we can number tables and figures.
In a first step, we define a function named table_nums and apply it to the tables' name and caption. We define both table name and table caption. Furthermore, we may also define a prefix (Tab. for tables and Fig. for figures).

table_nums <- captioner::captioner(prefix = "Tab.")

tab.1_cap <- table_nums(name = "tab_1", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 7-12")
tab.2_cap <- table_nums(name = "tab_2", 
                        caption = "German Bundesliga: Final Table 2015/16, Position 12-18")

The next code snippet combines both inline code and a code chunk. With fig.cap = tab.1_cap, we specify the caption of the first table. It is important to separate inline code and code chunk. Otherwise the numbering won't work.

Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12

Position	Team	Points	GD
7	Hertha BSC	50	0
8	VfL Wolfsburg	45	-2
9	1. FC Koeln	43	-4
10	Hamburger SV	41	-6
11	FC Ingolstadt 04	40	-9
12	FC Augsburg	38	-10

Table referencing

Since we have received a numbered table, it should also be possible to reference the table. However, we can not just
use the inline code table_nums('tab_1'). Otherwise, we wi'll get the following output:

[1] “Tab. 1: German Bundesliga: Final Table 2015/16, Position 7-12”

In order to return the desired output (prefix Tab. and table number), I have written the function f.ref. Using a regular expression, the function returns all characters of the table_nums('tab_1') output located before the first colon.

f.ref <- function(x) {
  stringr::str_extract(table_nums(x), "[^:]*")
}

When we apply this function to tab_1, the inline code returns the following result:

As we can see in f.ref("tab_1"), the Berlin based football club Hertha BSC had position seven in the final table.

As we can see in Tab. 1, the Berlin based football club Hertha BSC had position seven in the final table.

Just to make the table complete, Tab. 2 shows positions 13 to 18 of the final Bundesliga table.

Tab. 2: German Bundesliga: Final Table 2015/16, Position 12-18

knitr::kable(bundesligR::bundesligR[c(13:18), c(2,3,11,10)],
             align = c('c', 'l', 'c', 'c'),
             row.names = FALSE)

Position	Team	Points	GD
13	Werder Bremen	38	-15
14	SV Darmstadt 98	38	-15
15	TSG 1899 Hoffenheim	37	-15
16	Eintracht Frankfurt	36	-18
17	VfB Stuttgart	33	-25
18	Hannover 96	25	-31

And what about figures?

Figures can be numbered and referenced following the same principle.

How to parse Evernote export files (.enex) using R

Evernote is a “cross-platform […] app designed for note taking, organizing, and archiving” (Wikipedia). All notes can be tagged and exported. I'm using Evernote, above all, to save and tag interesting blog posts related to R.

plot of chunk logo

In this blog post, I show how to import and parse an exported Evernote file with R.

Exporting the data from Evernote

In a first step, I've exported all of my notes tagged with 'R':

Open the Evernote client;
Select all notes to be exported;
Go to 'File' > 'Export';
Select option 'Export as a file in ENEX format (.enex)' from the format options box;
Name the file 'Evernote.enex' and save it into your RStudio project folder.

Importing the data into R

Since the '.enex' file has xml properties, the 'Evernote.enex' file can be imported using the XML package. Because of its structure, the imported file cannot be transformed into a dataframe right away. Instead, we need to transform it into a list (using the XML::xmlToList function).

library(XML)
xmlfile <- xmlParse("Evernote.enex")
xmllist <- xmlToList(xmlfile, addAttributes = FALSE)

In the following section, I show how to create a dataframe based on the xmllist object.

Building a data frame

First, we generate an empty data frame. The number of rows (262) is determined by the number of elements in the xmllist object and the number of columns is set to zero.

mydata <- data.frame(matrix(NA, ncol = 0, nrow = length(xmllist)))
dim(mydata)

[1] 262 0

Second, we read the names of the note titles and save it into a variable called title which is part of our data frame mydata.

for (i in 1:length(xmllist)){
  mydata$title[i] <- unlist(xmllist[[i]]['title'])
}

head(mydata$title, 10)

[1] “Network visualization in R with the igraph package | Rules of Reason”
[2] “More debate analysis with R”
[3] “Analyzing networks of characters in 'Love Actually' – Variance Explained”
[4] “Web scraping in R”
[5] “Color Quantization in R”
[6] “Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R | rud.is”
[7] “Waterfall plots – what and how?”
[8] “Sentiment Analysis on Donald Trump using R and Tableau | DataScience+”
[9] “Version 0.9 of timeline on CRAN”
[10] “Date Formats in R”

In a next step, we obtain the dates the notes were created. In order to receive a variable of the date class, the variable 'create' must be formated. Using the stringr package, we extract year, month and day and save it into the same variable.

for (i in 1:nrow(mydata)){
  mydata$created[i] <- xmllist[[i]]['created']
}


mydata$created <- as.Date(paste0(stringr::str_sub(mydata$created, 1, 4), 
                                 '-', 
                                 stringr::str_sub(mydata$created, 5, 6), 
                                 '-',
                                 stringr::str_sub(mydata$created, 7, 8)))

head(mydata$created, 5)

[1] “2016-01-06” “2016-01-06” “2016-01-05” “2016-01-05” “2016-01-04”

Furthermore, the http addresses of the notes can be read like this:

for (i in 1:nrow(mydata)){
  mydata$www[i] <- xmllist[[i]]['note-attributes']
}

mydata$www <- unlist(qdapRegex::ex_url(mydata$www,
                        trim=TRUE,
                        clean=TRUE,
                        extract=TRUE))

mydata$www <- stringr::str_sub(mydata$www, 1, nchar(mydata$www)-2)

head(mydata$www)

[1] “https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/”
[2] “http://www.r-bloggers.com/more-debate-analysis-with-r/”
[3] “http://varianceexplained.org/r/love-actually-network/”
[4] “http://cpsievert.github.io/slides/web-scraping/#1”
[5] “http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html”
[6] “http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/”

Finally, we want to read the tags and save them into a variable. Since the number of tags differs between the notes, we have to assess the number of tags for each note:

# number of tags
for (i in 1:nrow(mydata)){
  mydata$num.tag[i] <- length(which(names(xmllist[[i]])=="tag"))
}

head(mydata$num.tag, 20)

[1] 2 2 3 2 2 3 2 5 2 3 3 2 2 3 3 2 2 3 3 3

Since we want to save each tag into a single variable, we need to know the maximum number of tags.

tag.num <- max(mydata$num.tag)
tag.num

[1] 5

With the next code snippet we add three variables to our dataframe: both the position of the first and last tag as numeric variables and a variable (of class list) containing the positions of all tags.

# position of first tag
for (i in 1:nrow(mydata)){
  mydata$pos.1[i] <- which(names(xmllist[[i]])=="tag")[1]
}
# position of last tag
mydata$pos.2 <- mydata$pos.1 + mydata$num.tag - 1
# position of tags
for (i in 1:nrow(mydata)){
  mydata$pos.all[i] <- list(c(mydata$pos.1[i]:mydata$pos.2[i]))
}
# remove pos.1 and pos.2
mydata$pos.1 <- NULL
mydata$pos.2 <- NULL

Since we don't need the variables pos.1 and pos.2 for further processing, we remove them from our dataframe.

In the next step, we create 5 empty variables that will later on contain the tag names.

# create 5 new columns
num.col <- ncol(mydata) 
for (i in (ncol(mydata) + 1):(ncol(mydata) + tag.num)){
  mydata[, i] <- NA
  colnames(mydata)[i] <- paste0('tag.', i - num.col)
}

The following code snipped intents to write the tag names into the variables tag.1 to tag.5.

for (j in (num.col + 1):ncol(mydata)){
  for (i in 1:nrow(mydata)){
    mydata[i, j]  <- xmllist[[i]][mydata$pos.all[[i]][j - num.col]][[1]]
  }}

However, evaluating the code returns the following error message:

Error in '[<-.data.frame'('*tmp*', i, j, value = NULL) : replacement has length zero

Has anybody got an idea how to get the preceding code snippet working? I'd appreciate every piece of advice.

Thus, I decided to write one loop for each of the five variables. This is definetely not best practice, but it works.

# 1st tag
for (i in 1:nrow(mydata)){
  mydata$tag.1[i]  <- xmllist[[i]][mydata$pos.all[[i]][1]][1]
}
# 2nd tag
for (i in 1:nrow(mydata)){
  mydata$tag.2[i]  <- xmllist[[i]][mydata$pos.all[[i]][2]][1]
}
# 3rd tag
for (i in 1:nrow(mydata)){
  mydata$tag.3[i]  <- xmllist[[i]][mydata$pos.all[[i]][3]][1]
}
# 4th tag
for (i in 1:nrow(mydata)){
  mydata$tag.4[i]  <- xmllist[[i]][mydata$pos.all[[i]][4]][1]
}
# 5th tag
for (i in 1:nrow(mydata)){
  mydata$tag.5[i]  <- xmllist[[i]][mydata$pos.all[[i]][5]][1]
}

In the following step, we define a function (source) replacing NULL by NA and apply this function to each of the five tag variables:

# define function
nullToNA <- function(x) {
  x[sapply(x, is.null)] <- NA
  return(x)
}

# apply function
for (i in (num.col+1):ncol(mydata)){
  for (j in 1:nrow(mydata)){
  mydata[j, i] <- nullToNA(mydata[j, i])
}}

Finally, we paste the values of the five tag variables into a single variable named tags. To do this, we use the paste2 function of the qdap package. Since we don't need the variables tag.1 to tag.5 for further processing, we remove them from the dataframe using the select function of the dplyr package.

mydata$tags <- qdap::paste2(mydata[(num.col+1):ncol(mydata)], 
                            sep = ", ", 
                            handle.na = TRUE, 
                            trim = TRUE)

mydata <- dplyr::select(mydata, -starts_with('tag.'))
mydata$pos.all <- NULL

The final dataframe consists of the following variables:

title containing the titles of the notes;
created containing the dates the notes were created;
www containing the notes' http addresses;
num.tag containing the number of tags for each note;
tags containing the tag names.

The following table gives an impression about how our final dataframe looks like.

knitr::kable(head(mydata), align = c('l', 'c', 'l', 'c', 'c'))

title	created	www	num.tag	tags
Network visualization in R with the igraph package \| Rules of Reason	2016-01-06	https://rulesofreason.wordpress.com/2012/11/05/network-visualization-in-r-with-the-igraph-package/	2	network analysis, R, NA, NA, NA
More debate analysis with R	2016-01-06	http://www.r-bloggers.com/more-debate-analysis-with-r/	2	text mining, R, NA, NA, NA
Analyzing networks of characters in 'Love Actually' – Variance Explained	2016-01-05	http://varianceexplained.org/r/love-actually-network/	3	network analysis, text mining, R, NA, NA
Web scraping in R	2016-01-05	http://cpsievert.github.io/slides/web-scraping/#1	2	webscraping, R, NA, NA, NA
Color Quantization in R	2016-01-04	http://blog.ryanwalker.us/2016/01/color-quantization-in-r.html	2	R, image processing, NA, NA, NA
Zellingenach: A visual exploration of the spatial patterns in the endings of German town and village names in R \| rud.is	2016-01-04	http://rud.is/b/2016/01/03/zellingenach-a-visual-exploration-of-the-spatial-patterns-in-the-endings-of-german-town-and-village-names-in-r/	3	text mining, geo, R, NA, NA

The packages used in this blog post can be loaded/installed using the following code:

pacman::p_load(XML, knitr, dplyr, qdap, stringr)

The xmllist object may be downloaded as an .RData file under the following link.

In one of my next blog posts, I will show how to analyse the tags.

RMarkdown: How to format tables and figures in .docx files

In research, we usually publish the most important findings in tables and figures. When writing research papers using Rmarkdown (*.Rmd), we have several options to format the output of the final MS Word document (.docx).
Tables can be formated using either the knitr package’s kable() function or several functions of the pander package.
Figure sizes can be determined in the chunk options, e.g.

{r name_of_chunk, fig.height=8, fig.width=12}.

However, options for customizing tables and figures are rather limited in Rmarkdown. Thus, I usually customize tables and figures in the final MS Word document.

In this blog post, I show how to quickly format tables and figures in the final MS Word document using a macro). MS Word macros are written in VBA (Visual Basic for Applications) and can be accessed from a menu list or from the toolbar and run by simply clicking. There are loads of tutorials explaining how to write a macro for MS Word, e.g http://www.addictivetips.com/microsoft-office/create-macros-in-word-2010/.

The following two macros are very helpful to format drafts. Since I want drafts to be as compact as possible, tables and figures should not to be too space consuming.

The first macro called FormatTables customizes the format of all tables of the active MS Word document. With wdTableFormatGrid2, we use a table style predefined in MS Word. A list of other table styles can be found under the follwing link. Furthermore, we define font name (Arial) and font size (8 pt), space before (6 pt) and after (10 pt) the table. Finally, the row height is set to 18 pt exactly.

Sub FormatTables()

 Dim tbl As Table
    For Each tbl In ActiveDocument.Tables
         tbl.AutoFormat wdTableFormatGrid2
         tbl.Range.Font.Name = "Arial"
         tbl.Range.Font.Size = 8
         tbl.Range.ParagraphFormat.SpaceBefore = 6
         tbl.Range.ParagraphFormat.SpaceAfter = 10
         tbl.Range.Cells.SetHeight RowHeight:=18, HeightRule:=wdRowHeightExactly

    Next

End Sub

The second macro called FormatFigures merely reduces the size of all figures in the active MS Word document to 45% of its original size.

Sub FormatFigures()

Dim shp As InlineShape


For Each shp In ActiveDocument.InlineShapes
    shp.ScaleHeight = 45
    shp.ScaleWidth = 45
Next

End Sub

Please also see my blog post RMarkdown: How to insert page breaks in a MS Word document.

How to install the ‘RWordPress’ package in R

The RWordPress package is a very convenient tool for publishing blog posts from R to WordPress. In his blog post Publish blog posts from R + knitr to WordPress, Yihui Xie explains how to install and use the package. Furthermore, the blog post How to publish with R Markdown in WordPress gives some additional information on how to use the package.

However, the package repository http://www.omegahat.org/R does not seem to exist anymore (2016-08-30).

Fortunatelly, the RWordPress package is also available from Github and, thus, can be easily installed using the devtools package.

Since RWordPress depends on the packages RCurl, XML, and XMLRPC, these packages need to be installed before we can actually install RWordPress.

Unlike RCurl and XML, the XMLRPC package is not available from the CRAN repository. Instead, it is available from Github.

Here is the code to install all required packages:

install.packages("devtools")
install.packages("RCurl")
install.packages("XML")
devtools:::install_github("duncantl/XMLRPC")
devtools:::install_github("duncantl/RWordPress")

How to print tables with absolute and relative values in R

Introduction

In R, there are several ways to generate tables. while the table() function generates tables with absolute numbers, the prop.table() function returns tables with relative values (percentages). However, I couldn't find a function to return a table with both absolute and relative values.

In this blog post, I show how to generate such a table.

Generate a random dataframe

In the first code snippet, we generate a random dataframe with two variables: Sex and Age. As you can see, generating random dataframes is very easy and straightforward with Tyler Rinker's Wakefield package.

library(wakefield)
df <- r_data_frame(n=200, sex, age)

Write the function

My function tab.func() combines three R functions:

describe() from the Hmisc package to return an object of class describe containing absolute and relative frequency values of a factor variable. To access these values, we need to subset this object using $values. This will return a matrix with the desired values.
t() to transpose this matrix, and
as.table() to transform this matrix into a table.

tab.func <- function (x) {
  y <- as.table(t(Hmisc::describe(x)$values))
  colnames(y) <- c('**n**', '**%**')
  return(y)
}

(UPDATE: In version 4 of the Hmisc package, the describe() function was rewritten. My function only works up to version 3.17.4)

The double asterisks around n and % are Markdown code used to return bold text.

Deploy the function

In the following code snippet, we deploy this function to a categorial variable (Sex) which is part of the dataframe df.

mytable <- tab.func(df$Sex)


knitr::kable(mytable,
             caption = 'Table with absolute numbers and percentages')

	n	%
A	Male, Female	96, 104

Finaly, we print this table with the kable() function from the knitr package.

RMarkdown: How to insert page breaks in a MS Word document

Introduction

RStudio offers the opportunity to build MS Word documents from RMarkdown files. However, since formatting options in Markdown are very limited, there is no ‘native’ Markdown code to insert page breaks in the final MS Word output file.

In this blogpost I explain, how to define page breaks in the RMarkdown document that will be kept in the final MS Word document (.docx). My post is based on Richard Layton’s article Happy collaboration with Rmd to docx which explains how to create a MS Word .docx template in order to modify the document design of a MS Word file created from a .Rmd-file in RStudio.

The MS Word template

In the first step, we create a MS Word template called ‘mystyles.docx’ (How to…). This file must be saved in the same directory as the R Markdown file. For the following modifications we have to open this file with MS Word.

Modify style ‘Heading 5’

In the next step, we modify a predefined style. However, after modifying a predefined style, we cannot use it anymore in the originally intented way. Thus, we must choose a style hardly needed for any other purpose. In this blogpost, we use the Heading 5 style.

To modify this style, we select the ‘Home‘ ribbon tab and click the Styles window launcher in the Styles group (lower right corner, highlighted with red circle).

We select ‘Heading 5’ in the Word document. In the Styles window, we scroll down until we find the style already assigned to the text we selected. In our case, the assigned style is ‘Heading 5’. (In the figure it says ‘Heading 3’. However, we actually mean ‘Heading 5’)

The following modifications must be made in the Modify Style menu:

Set the font color to ‘white’ (rather than ‘Automatic’).
Select the smallest font size (8 rather than 11).
Select ‘Page break before’ in the ‘Line and Page Breaks’ tab.

Set the line spacing to ‘Exactly’ and ‘1 pt’ in the ‘Indents and Spacing’ tab.

After these tweaks, the ‘Heading 5’ style will no longer format a heading of level 5. Instead it will insert a very small and white (and, thus, invisible) line followed by a page break.

The RMarkdown document

In the RMarkdown document, a few specifications must be made.

The YAML header

RMarkdown documents contain a metadata section called YAML header. In this header, we specify the output format (word_document) and the name of the MS Word template (mystyles.docx).

---
title: 'Title'
date: "`r format(Sys.time(), '%d&period; %B %Y')`"
output: 
    word_document:
      reference_docx: mystyles.docx
---

The Markdown code ##### being originally reserved to format header 5 will be used to insert page breaks in the final .docx document. Since we modified the font color to ‘white’ in the MS Word template, the specification after the Markdown code (Page Break) will not appear in the final document.

The following example shows how to insert a page break between two paragraphs.

Example: Markdown code to insert a page break

Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break. Text before page break.

##### Page Break

Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break. Text after page break.

Download

My MS Word template may be downloaded here.

PS

Since I don’t have an English version of MS Word, I could not make the screenshots myself. Instead, I have used internet links. Please click on the pictures to get to the web pages.

Please also see my blog post RMarkdown: How to format tables and figures in .docx files.

Creating a list with named objects

Reusing objects

Saving plots in lists

Statistical models, data.frames etc.

Putting text and R code together

Share this:

Intro

Defining variable labels

Assigning labels to variables

Share this:

Intro

Packages required

Data

Import

Creating data frames

Data wrangling

Joining the data frames

Results

Share this:

The Problem

The Solution

Share this:

Introduction

Packages required

Table numbering

Table referencing

And what about figures?

Share this:

Exporting the data from Evernote

Importing the data into R

Building a data frame

Share this:

Share this:

Share this:

Introduction

Generate a random dataframe

Write the function

Deploy the function

Share this:

Introduction

The MS Word template

Modify style ‘Heading 5’

The RMarkdown document

The YAML header

Example: Markdown code to insert a page break

Download

PS

Share this:

Creating a `list` with named objects

Putting text and `R` code together