TRinker's R Blog

Splitting and Combining R pdf Graphics

Posted on October 8, 2012 by tylerrinker

A question that often comes across various help lists is how to combine or split an output from an R graphics device. Maybe you have looped/combined multiple visuals into a single pdf to avoid cluttering your working directory and now you want to pull various pages out. Or maybe you have several different pdfs of various sizes you’d like to combine into a single multi page file (example-click here-). This post utilizes 2 short videos to demonstrate combining and splitting R produced pdfs.

This post serves two purposes:

To show Windows users how to combine and split pdf’s (sorry this works only for Windows users)
To challenge R bloggers who use other operating systems to perform the same combine and split tasks

First you’ll need to download PDF24 Editor (a free program)

Click Here

Combining Multiple R pdf Graphics in a Single File

Splitting R pdf Pages into Separate Files

Pretty easy. Now I challenge R bloggers who use Mac and Linux to provide the same “FREE” functionality for their platforms. Ideally, someone has an approach that spans multiple platforms.

If you have an alternate method for any operating system please provide a link to your blog in the comments below.

Posted in visualization | Tagged combine, pdf, pdf24 editor, png, R, split, visual, visualization | 10 Comments

Presidential Debates with qdap-beta

Posted on October 4, 2012 by tylerrinker

qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.

Getting qdap to work may take some finagling because the package relies on the openNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.

Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer. For a combined pdf version of the graphics in this post click here.

Getting and cleaning transcripts of the debate

library(qdap) url_dl("pres.deb1.docx") #downloads a docx file of the debate to wd # the read.transcript function allows reading in of docx file # special thanks to Bryan Goodrich for his work on this dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue")) truncdf(dat) left.just(dat) # qprep wrapper for several lower level qdap functions # removes brackets & dashes; replaces numbers, symbols & abbreviations dat$dialogue <- qprep(dat$dialogue) # sentSplit splits turns of talk into sentences # special thanks to Dason Kurkiewicz for his work on this dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE) htruncdf(dat2) #view a truncated version of the data(see also truncdf)

Wordclouds (relies on Ian Fellows’ wordcloud package)

#first put a unique character between words we want to keep together #first put a unique character between words we want to keep together dat2$dia2 <- space_fill(dat2$dialogue, c("Governor Romney", "President Obama", "middle class", "The President", "Mister President")) #Generate target words to color by tw <- list( health=c("health", "insurance", "medic", "obamacare", "hospital"), economic = c("econom", "jobs", "unemploy", "business", "banks", "budget", "market", "paycheck"), foreign = c("war ", "terror", "foreign"), class = c("middle~~class", "poor", "rich"), opponent = c("romney ", "obama", "the~~president", "mister~~president") ) #create stop word list from qdap data set Top25Words but exclude he and I sw <- exclude(Top25Words, "he", "I") #the word cloud by grouping variable function with(dat2, trans.cloud(dia2, person, proportional = TRUE, target.words = tw, cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"), legend = names(tw), stopwords=sw, max.word.size = 4, char2space = "~~"))

Visuals of the trans.cloud function

Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.

# special thanks to Andrie de Vries for his work on this function with(dat2, gantt_plot(dialogue, person, xlab = "duration(words)", x.tick=TRUE, minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))

Visualization of the Gantt Plot

Formality scores (how formal a person’s language is)
This concept comes from:

Heylighen, F., & Dewaele, J.-M. (2002). Variation in the contextuality of language: An empirical measure. Foundations of Science, 7(3), 293–340. doi:10.1023/A:1019661126744

The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.

#parallel about 1:20 on 8 GB ram 8 core i7 machine v1 <- with(dat2, formality(dialogue, person, parallel=TRUE)) plot(v1) #about 4 minutes on 8GB ram i7 machine v2 <- with(dat2, formality(dialogue, person)) plot(v2) # note you can resupply the output from formality back # to formality and change arguments. This avoids the need for # openNLP, saving time. v3 <- with(dat2, formality(v1, person)) plot(v3, bar.colors=c("Dark2"))

Output and plot from the formality function

person word.count formality 1 ROMNEY 4068 61.82 2 LEHRER 765 61.31 3 OBAMA 3595 58.30

Afterthought: I was remiss to mention that the word clouds are proportional (argument proportional = TRUE) for all words spoken rather than frequency per person. This enables comparison across clouds.

Posted in ggplot2, qdap, word cloud | Tagged discourse analysis, formality, gantt, gantt plot, qdap, quantitative discourse analysis, R, transcript, transcript analysis, word cloud, wordcloud | 33 Comments

Add Text Annotations to ggplot2 Faceted Plot (an easier approach)

Posted on September 7, 2012 by tylerrinker

I recently posted a blog about adding text to a ggplot2 faceted plot (LINK).

I was unhappy with the amount of time it takes to create the text data frame to then label the plot. And then yesterday when the new version of ggplot2 0.9.2 was announced I got to reading about how ggplot2 objects are stored and I decided that I could extract a great deal of the information for plotting the text directly from the ggplot2 object.

After I did it I decided to wrap the function up into a package that I can add more ggplot2 extension functions to in the future.

Optionally Download the Package:

install_github("acc.ggplot2", "trinker") library(acc.ggplot2)

Here’s the Function Code and a Few Examples:

library(ggplot2) qfacet_text <- function(ggplot2.object, x.coord = NULL, y.coord = NULL, labels = NULL, ...) { require(ggplot2) dat <- ggplot2.object$data rows <- ggplot2.object$facet[[1]][[1]] cols <- ggplot2.object$facet[[2]][[1]] fcol <- dat[, as.character(cols)] frow <- dat[, as.character(rows)] len <- length(levels(factor(fcol))) * length(levels(factor(frow))) vars <- data.frame(expand.grid(levels(factor(frow)), levels(factor(fcol)))) colnames(vars) <- c(as.character(rows), as.character(cols)) if (any(class(ggplot2.object) %in% c("ggplot", "gg"))) { if (is.null(labels)) { labels <- LETTERS[1:len] } if (length(x.coord) == 1) { x.coord <- rep(x.coord, len) } if (length(y.coord) == 1) { y.coord <- rep(y.coord, len) } text.df <- data.frame(x = x.coord, y = y.coord, vars, labs=labels) } else { if (class(ggplot2.object) == "qfacet") { text.df <- ggplot2.object$dat if (!is.null(x.coord)) { text.df$x.coord <- x.coord } if (!is.null(y.coord)) { text.df$y.coord <- y.coord } if (!is.null(labels)) { text.df$labs <- labels } ggplot2.object <- ggplot2.object$original } } p <- ggplot2.object + geom_text(aes(x, y, label=labs, group=NULL), data=text.df, ...) print(p) v <- list(original = ggplot2.object, new = p, dat = text.df) class(v) <- "qfacet" invisible(v) }

Examples (using the same basic examples as my previous blog post):

#alter mtcars to make some variables factors mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor) p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + geom_line(aes(color=cyl)) + geom_point(aes(shape=cyl)) + facet_grid(gear ~ am) + theme_bw() z <- qfacet_text(ggplot2.object = p, x.coor = 33, y.coor = 2.2, labels = 1:6, color="red") str(z); names(z) #look at what's returned #approach 1 (alter the text data frame and pass the qfacet object) z$dat[5, 1:2] <- c(15, 5) qfacet_text(z, color="red") #approach 2 (alter the original ggplot object) qfacet_text(p, x = c(33, 33, 33, 33, 15, 33), y = c(2.2, 2.2, 2.2, 2.2, 5, 2.2), 1:6, color="red") #all the same things you can pass to geom_text qfacet_text takes qfacet_text(z, labels = paste("beta ==", 1:6), size = 3, color = "grey50", parse = TRUE)

Notice at the end you can pass qfacet_text a ggplot object or an object from qfacet_text. The qfacet_text function invisibly returns a list with the original ggplot2 object, the new ggplot2 object and the text data frame. This enables the user to alter the coordinates of the data frame and return the the qfacet_text object back to qfacet_text, thus altering the text position. There’s actual documentation for this package and function so ?qfacet_text should get you a help file with the same example.

PS this gave me a chance to actually run roxygen2 for the first time to create documentation. Also a pretty slick Hadley Wickham package.

The Plot:

Posted in annotate, ggplot2, text | Tagged add text, annotate, facet, faceted, facetted, ggplot, ggplot2, R, text | 8 Comments

Add Text Annotations to ggplot2 Faceted Plot

Posted on September 1, 2012 by tylerrinker

In my experience with R learners there are two basic types. The “show me the code and what it does and let me play” type and the “please give me step by step directions” type. I’ve broken the following tutorial on plotting text on faceted ggplot2 plots into 2 sections:

The Complete Code and Final Outcome

A Bit of Explanation

Hopefully, whatever learner you are you’ll be plotting text on faceted graphics in no time.

Section 1: The Complete Code and Final Outcome

mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor) p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + geom_line(aes(color=cyl)) + geom_point(aes(shape=cyl)) + facet_grid(gear ~ am) + theme_bw() p len <- length(levels(mtcars$gear)) * length(levels(mtcars$am)) vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am))) colnames(vars) <- c("gear", "am") dat <- data.frame(x = rep(15, len), y = rep(5, len), vars, labs=LETTERS[1:len]) p + geom_text(aes(x, y, label=labs, group=NULL),data=dat) dat[1, 1:2] <- c(30, 2) #to change specific locations p + geom_text(aes(x, y, label=labs, group=NULL), data=dat) p + geom_text(aes(x, y, label=paste("beta ==", labs), group=NULL), size = 4, color = "grey50", data=dat, parse = T)

Section 2: A Bit of Explanation
The following portion of the tutorial provides a bit more of a step by step procedure for plotting text to faceted plots as well as a visual to go with the code.

The initial non annotated plot
First, let’s make a faceted line plot with the mtcars data set. I reclassed a few variables to make factors.

mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor) p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + geom_line(aes(color=cyl)) + geom_point(aes(shape=cyl)) + facet_grid(gear ~ am) + theme_bw() p

Add text to each facet
The key here is a new data frame with three pieces of information (ggplot2 seems to like information given in a data frame).

Coordinates to plot the text

The faceted variable levels

The labels to be supplied

The first information piece is the coordinates (two columns x and y) to plot the text in each facet. Generally I find that one set of coordinates will work in most of the facet boxes and I just use rep to make these coordinates (I suppose the recycling rule could be used if you added it to an already existing data frame).

The second information piece is the faceted variable labels (in our case gear ~ am). There’re many ways to achieve this but I like a combination of levels and expand.grid. I renamed these columns to be exactly the same as the variable names (gear & am) I used in the original data frame (mtcars in this case).

Lastly, you must make the labels. I chose letters so you can track what piece of the data frame is plotted in which facet.

Your data should look something like this:

x y gear am labs 1 30 2 3 0 A 2 15 5 4 0 B 3 15 5 5 0 C 4 15 5 3 1 D 5 15 5 4 1 E 6 15 5 5 1 F

Note that the group=NULL is essential to let ggplot2 know you’re dealing with a new data set and the mapping from before can be forgotten (or at least this is how I understand it).

#long cut way to find number of facets len <- length(levels(mtcars$gear)) * length(levels(mtcars$am)) vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am))) colnames(vars) <- c("gear", "am") dat <- data.frame(x = rep(15, len), y = rep(5, len), vars, labs=LETTERS[1:len]) p + geom_text(aes(x, y, label=labs, group=NULL),data=dat)

Moving just one text location
Generally I can usually find one spot that most every text plot will work except that one dog gone facet that just won’t match up with the other coordinates. In this case label A is that pesky label. The key here is to figure out what text labels you want to move and alter those coordinates appropriately.

dat[1, 1:2] <- c(30, 2) #to change specific locations p + geom_text(aes(x, y, label=labs, group=NULL), data=dat)

Adding equation (Greek letters/math) and alter size/color
To annotate with math code use the parse = T argument in geom_text. For more on plotting math code see this ggplot wiki and this SO question. To alter the size just throw a size argument in geom_text. I also toned down the color of the text a bit to allow the line to pop the most visually.

p + geom_text(aes(x, y, label=paste("beta ==", labs), group=NULL), size = 4, color = "grey50", data=dat, parse = T)

If you have suggestions for improvement, links, or other thoughts please leave a comment.

Posted in annotate, ggplot2 | Tagged annotate, facet, faceted, ggplot, ggplot2 geom_text, R, text | 16 Comments

Parallelization: Speed up Functions in a Package

Posted on August 19, 2012 by tylerrinker

Well I bought a new computer a month back (i7 8GB memory). Finally more than one core and a chance to try parallelization. I saw this blog post a while back and was intrigued and was further intriqued when I saw that plyr/reshape2 has some paralellization capabilities(LINK). Let me say up front this is my first experience so there may be better ways but it sped up my code by over four times.

Let me warn you now, when I first read the A No BS Guide to the Basics of Parallelization in R I tried to see how many cores I had on my computer (this shows my ignorance; which may be of comfort to some of you, others will stop reading this blog post immediately). 1 is the loneliest number especially if you’re attempting to run on multiple cores.

Suggestion if you type detectCores() and see 1 you can’t run code in parallel, at least not by running it on different cores of your machine.

Background (skip this if you are short on time)
I’m working on a package (qdap) and have a function (pos) that takes a long time to run. It is basically finding parts of speech by sentence (each sentence is a cell and there are thousands of them). I rely on openNLP for the pos tagging but the whole process is time consuming. I figured perfect time to try this parallelization out.

I skimmed the Task View for parallel computing and knew I was out of my league and decided to just focus on my problem not the whole parallelization concept. Back to wrathematics bog post and I discovered my silly Windows machine was not compatible with mcapply but saw hope with the clusterApply(). Using ?clusterApply
I saw parLapply said it was a parallel version of lapply. I like lapply and dicided that was what I’d go with.

Working with parallel coding in functions (skip to here)
These are the three major problems/differences I encountered with parLapply over lapply inside a function:

You need to pass/export the functions and variables you’ll be needing in the parLapply using makeCluster & clusterExport. See Andy Garcia’s helpful response to my question about this (LINK)

You have to specify the envir argument of clusterExport as envir=environment(). See GSee’s helpful response to my question about this (LINK)

You have to explicitly stop the cluster when you’re finished using it, much like closing a connection you opened. You stop the cluster using the stopCluster function (see line 38 in the code below).

EDIT: Martin Morgan of stackoverflow.com gives a solution that addresses both the first and second problems. He suggests passing all objects directly to parLapply (LINK).

Below is an example of taking a non parallel function and making it run in parallel:

library(parallel) detectCores() #make sure you have > 1 core nonpar.test <- function(text.var, gc.rate=10){ ntv <- length(text.var) require(parallel) pos <- function(i) { paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ") } x <- lapply(seq_len(ntv), function(i) { x <- pos(text.var[i]) if (i%%gc.rate==0) gc() return(x) } ) return(x) } nonpar.test(rep("I wish I ran in parallel.", 20)) par.test <- function(text.var, gc.rate=10){ ntv <- length(text.var) require(parallel) pos <- function(i) { paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ") } #====================================== cl <- makeCluster(mc <- getOption("cl.cores", 4)) clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), envir=environment()) x <- parLapply(cl, seq_len(ntv), function(i) { #====================================== x <- pos(text.var[i]) if (i%%gc.rate==0) gc() return(x) } ) stopCluster(cl) #stop the cluster return(x) } par.test(rep("I wish I ran in parallel.", 20))

Notice that lines 27-30; 37 (between the #==== lines and stopping the cluster) is all that changes. Once you get it down working with parLapply is pretty easy.

Note:
It doesn’t always make sense to run in parallel as it takes time to make the cluster. In the pos I added parallel as an argument because for smaller text vectors running in parallel doesn’t make sense (it’s slower).

Wonderings and future direction:
The pos function I have in qdap uses a progress bar. Currently I couldn’t make a progress bar work with parLapply but it’s less of a need because it was so much faster.

Benchmarking (1 run)

> system.time(pos(rajSPLIT$dialogue, parallel=T)) user system elapsed 2.35 0.08 199.53 > system.time(pos(rajSPLIT$dialogue, progress.bar =F)) user system elapsed 816.61 16.74 833.47

This is benchmarked using the rajSPLIT$dialogue which is the text from Romeo and Juliet, a data set in qdap. This consists of 2151 rows or 23,943 words.

Hopefully this blog post is useful to those learning some parallelization. Check out Task View , the Documentation for the Parallel package and the Vignette for the parallel package.

If you have suggestions for improvement, links, or help on getting a progress bar with parLapply please leave a comment.

Posted in parallel | Tagged faster function, parallel, parallel computing, parallelization, R, speed up function | 9 Comments

Hangman in R: A learning experience

Posted on July 29, 2012 by tylerrinker

I love when people take a sophisticated tool and use it to play video games. Take R for example. I first saw someone create a game for R at talk.stats.com. My friend Dason inspired me to more efficiently waste time in R with his version of minesweeper. The other day I had an immense amount of work to do and decided it was the perfect time to make a hangman game.

Now some of the skills to create hangman were outside my typical uses and skills for R. It caused me to stretch and grow a bit. The purpose of this post is two fold:

To share the hangman game with people who have nothing better to do than waste time on a childhood game

To share the learning experiences I had in creating the game

First the hangman game

I have the code for the function posted here but I have saved the code and data set (word list) for the function at github. You can download the package that contains the hangman game and data set by either downloading the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the devtools package to install the development version:

# install.packages("devtools") library(devtools) install_github("hangman", "trinker")

To play type hangman() into the console and hit enter.

Here’s a screenshot of the game

Now for the learning

Here’s the code for the hangman function:

hangman <- function(reset.score = FALSE) { opar <- par()$mar on.exit(par(mar = opar)) par(mar = rep(0, 4)) x1 <- DICTIONARY[sample(1:nrow(DICTIONARY), 1), 1] x <- unlist(strsplit(x1, NULL)) len <- length(x) x2 <- rep("_", len) chance <- 0 if(!exists("wins", mode="numeric", envir = .GlobalEnv) | reset.score){ assign("wins", 0, envir = .GlobalEnv) } if(!exists("losses", mode="numeric", envir = .GlobalEnv) | reset.score){ assign("losses", 0, envir = .GlobalEnv) } win1 <- 0 win <- win1/len wrong <- character() right <- character() print(x2, quote = FALSE) circle <- function(x, y, radius, units=c("cm", "in"), segments=100, lwd = NULL){ units <- match.arg(units) if (units == "cm") radius <- radius/2.54 plot.size <- par("pin") plot.units <- par("usr") units.x <- plot.units[2] - plot.units[1] units.y <- plot.units[4] - plot.units[3] ratio <- (units.x/plot.size[1])/(units.y/plot.size[2]) size <- radius*units.x/plot.size[1] angles <- (0:segments)*2*pi/segments unit.circle <- cbind(cos(angles), sin(angles)) shape <- matrix(c(1, 0, 0, 1/(ratio^2)), 2, 2) ellipse <- t(c(x, y) + size*t(unit.circle %*% chol(shape))) lines(ellipse, lwd = lwd) } #taken from John Fox: http://tolstoy.newcastle.edu.au/R/help/06/04/25821.html hang.plot <- function(){ #plotting function plot.new() parts <- seq_len(length(wrong)) if (identical(wrong, character(0))) { parts <- 0 } text(.5, .9, "HANGMAN", col = "blue", cex=2) if (!6 %in% parts) { text(.5, .1, paste(x2, collapse = " "), cex=1.5) } text(.05, .86, "wrong", cex=1.5, col = "red") text(.94, .86,"correct", cex=1.5, col = "red") text(.05, .83, paste(wrong, collapse = "\n"), offset=.3, cex=1.5, adj=c(0,1)) text(.94, .83, paste(right, collapse = "\n"), offset=.3, cex=1.5, adj=c(0,1)) segments(.365, .77, .365, .83, lwd=2) segments(.365, .83, .625, .83, lwd=2) segments(.625, .83, .625, .25, lwd=2) segments(.58, .25, .675, .25, lwd=2) if (1 %in% parts) { circle(.365, .73, .7, lwd=4) if (!6 %in% parts) { text(.365, .745, "o o", cex=1) } if (!5 %in% parts) { text(.365, .71, "__", cex = 1) } text(.36, .73, "<", cex=1) } if (2 %in% parts) { segments(.365, .685, .365, .4245, lwd=7) } if (3 %in% parts) { segments(.365, .57, .45, .63, lwd=7) } if (4 %in% parts) { segments(.365, .57, .29, .63, lwd=7) } if (5 %in% parts) { segments(.365, .426, .43, .3, lwd=7) text(.365, .71, "O", cex = 1.25, col = "red") } if (6 %in% parts) { segments(.365, .426, .31, .3, lwd = 7) text(.365, .745, "x x", cex=1) text(.5, .5, "You Lose", cex=8, col = "darkgreen") text(.5, .1, paste(x, collapse = " "), cex=1.5) } if (win1 == len) { text(.5, .5, "WINNER!", cex=8, col = "green") text(.505, .505, "WINNER!", cex=8, col = "darkgreen") } } #end of hang.plot guess <- function(){#start of guess function cat("\n","Choose a letter:","\n") y <- scan(n=1,what = character(0),quiet=T) if (y %in% c(right, wrong)) { stop(paste0("You've already guessed ", y)) } if (!y %in% letters) { stop(paste0(y, " is not a letter")) } if (y %in% x) { right <<- c(right, y) win1 <<- sum(win1, sum(x %in% y)) win <<- win1/len message(paste0("Correct!","\n")) } else { wrong <<- c(wrong, y) chance <<- length(wrong) message(paste0("The word does not contain ", y, "\n")) } x2[x %in% right] <<- x[x %in% right] print(x2, quote = FALSE) hang.plot() }#end of guess function hang.plot() while(all(win1 != len & chance < 6)){ try(guess()) } if (win == 1) { outcome <- "\nCongratulations! You Win!\n" assign("wins", wins + 1, envir = .GlobalEnv) } else { outcome <- paste("\nSorry. You lose. The word is:", x1, "\n") assign("losses", losses + 1, envir = .GlobalEnv) } cat(outcome) cat(paste0("\nwins: ", wins, " | losses: ", losses, "\n")) text(.5, .2, paste0("wins: ", wins, " | losses: ", losses), cex = 3, col = "violetred") }

Things I tried and learned:

Translating simple game rules into systematic logic

try

plotting dynamically (text vs. mtext)

while loop

assign

I used try one other time in a web scraping function. If you don’t know anything about this function it allows you to try to do something and if an error occurs move onto the next step. This allows the game user to input wrong information yet the function doesn’t stop but instead recovers and prints a message.

I first tried plotting the symbols and text with mtext. Thanks to some help at stack.overflow I found out the text function is a more controllable choice. I also grabbed a circle plotting function from John Fox to avoid calling a package that plots circles.

This was my first need for a while loop (generally I use the apply functions but in this case the game logic demanded I repeat something until one of two circumstances were met (win or loss of the game)

assign is a nice function and I generally don’t use it as I can get away with <<- (cringe if you want but if you think it through the <<- operator can be handy.

So I encourage you to write your own R game as you’ll likely learn a bit, while effectively wasting time and will provide enjoyment to others.

Warning: not tested on a Linux or Mac machine

Posted in games | Tagged game, games, hangman, R, try, while loop, word game | 3 Comments

igraph and SNA: an amateur’s dabbling

Posted on June 30, 2012 by tylerrinker

I’ve been playing with the igraph package a bit lately (see previous post HERE) and wanted to approach a problem I once visited in the past. The basic gist of the problem is this:

Students in a class are asked their top three favorite students to work with (rank order). After a social intervention this same question is posed again to students. The intended outcome of the intervention is that the distribution of students receiving many or very few choices will diminish. In other words the dorks will become less dorky and the popular students will become less popular. The idea is to visual this relationship.

Here is a script of one such visualization. It’s a bit light on annotations but merely experimenting with the code should give a good sense of what is occurring.

library(igraph) set.seed(101) #create a data set X <-lapply(1:10, function(i) sample(LETTERS[c(1:10)[-i]], 3)) Y <- data.frame(person = LETTERS[1:10], sex = rbinom(10, 1, .5), do.call(rbind, X)) names(Y)[3:5] <- paste0("choice.", 1:3) #reshape the data to long format Z <- reshape(Y, direction="long", varying=3:5) colnames(Z)[3:4] <- c("choice.no", "choice") rownames(Z) <- NULL Z <- Z[, c(1, 4, 3, 2)] #turn the data into a graph structure edges <- as.matrix(Z[, 1:2]) g <- graph.data.frame(edges, directed=TRUE) V(g)$label <- V(g)$name #change label size based on number of votes SUMS <- data.frame(table(Z$choice)) SUMS$Var1 <- as.character(SUMS$Var1) SUMS <- SUMS[order(as.character(SUMS$Var1)), ] SUMS$Freq <- as.integer(SUMS$Freq) label.size <- 2 V(g)$label.cex <- log(scale(SUMS$Freq) + max(abs(scale(SUMS$Freq)))+ label.size) #Color edges that are reciprocal red x <- t(apply(edges, 1, sort)) x <- paste0(x[, 1], x[, 2]) y <- x[duplicated(x)] COLS <- ifelse(x %in% y, "red", "gray40") E(g)$color <- COLS #reverse score the choices.no and weight E(g)$width <- (4 - Z$choice.no)*2 #color vertex based on sex V(g)$gender <- Y$sex V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue") #plot it opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room plot.igraph(g, layout=layout.auto(g)) par(mar=opar)

For an additional script of this analysis with 20 students click here.

For helpful igraph documentation click here

Posted in igraph | Tagged graph, igraph, R, sna, social network analysis, sociogram | 4 Comments

igraph and structured text exploration

Posted on June 29, 2012 by tylerrinker

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts) with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK). It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn. As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well. The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

Build a word frequency matrix and covert to an adjacency matrix

set.seed(10) X <- matrix(rpois(100, 1), 10, 10) colnames(X) <- paste0("Guy_", 1:10) rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps', 'over', 'a', 'bot', 'named', 'Dason') X #word frequency matrix Y <- X >= 1 Y <- apply(Y, 2, as, "numeric") #boolean matrix rownames(Y) <- rownames(X) Z <- t(Y) %*% Y #adjacency matrix

Build a graph from the above matrix

g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected') # remove loops library(igraph) g <- simplify(g) # set labels and degrees of vertices V(g)$label <- V(g)$name V(g)$degree <- degree(g) #Plot a Graph set.seed(3952) layout1 <- layout.auto(g) #for more on layout see: browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html") opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room plot(g, layout=layout1)

Alter widths of edges based on dissimilarity of people’s dialogue

#adjust the widths of the edges and add distance measure labels #use 1 - binary (?dist) a proportion distance of two vectors #1 is perfect and 0 is no overlap (using 1 - binary) edge.weight <- 7 #a maximizing thickness constant z1 <- edge.weight*(1-dist(t(X), method="binary")) E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge z2 <- round(1-dist(t(X), method="binary"), 2) E(g)$label <- c(z2)[c(z2) != 0] plot(g, layout=layout1) #check it out!

Scale the label cex based on word counts

SUMS <- diag(Z) #frequency (same as colSums(X)) label.size <- .5 #a maximizing label size constant V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size plot(g, layout=layout1) #check it out!

Add vertex coloring based on factoring

#add factor information via vertex color set.seed(15) V(g)$gender <- rbinom(10, 1, .4) V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue") plot(g, layout=layout1) #check it out! plot(g, layout=layout1, edge.curved = TRUE) #curve it up par(mar=opar) #reset margins

Try it interactively with tkplot

#interactive version tkplot(g) #an interactive version of the graph tkplot(g, edge.curved =TRUE)

This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.

This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.

For a .txt version of this demonstration click here

Posted in igraph, text | Tagged graph, igraph, R, structured text, text | 12 Comments

reshape (from base) Explained: Part II

Posted on May 6, 2012 by tylerrinker

Part II Explains More Complex Wide to Long With base reshape

In part I of this base reshape tutorial we went over the basics of reshaping data with reshape. We learned two rules that help us to be more efficient and effective in using this powerful base tool:

RULE 1: Stack repeated measures/Replicate and stack everything else

RULE 2: Naming your columns in a way R likes makes your life easier

In part II we will be looking at more complex wide to long reshapes (more than one series of repeated measures) by building on what we learned in part I. Let’s start by generating some data with two series/nested repeated measures):

set.seed(10) dat <- data.frame(id=paste0("ID.", 1:5), sex=sample(c("male", "female"), 5, replace=TRUE), matrix(rpois(30, 10), 5, 6)) colnames(dat)[-c(1:2)] <- paste0(rep(1:2, each=3), rep(c("work", "home", "church"), 2)) dat

Which looks like this:

id sex 1work 2home 1church 2work 1home 2church 1 ID.1 female 7 8 7 10 6 10 2 ID.2 male 10 13 10 7 13 15 3 ID.3 male 11 10 6 10 10 7 4 ID.4 female 6 8 12 9 15 7 5 ID.5 male 9 11 15 10 10 12

As you can see we have nested repeated measures at three different locations (work, home, church) at two different times (the 1 or 2 prefix). Now let’s follow Rule 2 and get our names in a way R likes them (You may ask why I didn’t name them correctly to begin with? Fair question. Let me ask one though. Have you ever got a data set 100% the way you wanted it to be?).

names(dat) <- gsub("([0-9]+)([a-z]+)", "\\2\\.\\1", names(dat)) ############################################################### # BASICALLY, THIS SAYS FIND THE NAMES THAT ARE NUMERICALPHA. # # OTHERWISE LEAVE IT ALONE. THE [0-9]+ SAYS FIND THE NUMERIC # # STRING (PLUS SIGN SAYS FIND ALL THE PROCEDING CHARACTERS 1 # # OR MORE TIMES). THE [a-z]+ SAYS FIND THE ALPHA STRING (PLUS # # AGAIN MEANS FIND THE ALPHAS 1 OR MORE TIMES). THE "." IS # # CHARACTERS I'M INSERTING AND THE 1 AND 2 CORRESPOND TO THE # # PARENTHESIS IN THE ARGUMENT OF gsub. BASICALLY FLIP FLOPING # # THE POSITION OF 1 AND 2. # ############################################################### #================================================================== ############################################################## # OR MANUAL REPLACEMENT. YOU CAN SEE WHERE REGEX CAN COME IN # # HANDY AS THE DATA SET GROWS. # ############################################################## #names(dat)[-c(1:2)] <- c("work.1", "home.2", "church.1", # "work.2", "home.1", "church.2")

Which now looks like:

id sex work.1 home.2 church.1 work.2 home.1 church.2 1 ID.1 female 7 8 7 10 6 10 2 ID.2 male 10 13 10 7 13 15 3 ID.3 male 11 10 6 10 10 7 4 ID.4 female 6 8 12 9 15 7 5 ID.5 male 9 11 15 10 10 12

Alright in part I we learned the following arguments:

data – dataframe you’re supplying reshape

direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)

varying – the repeated measures columns we want to stack (takes indexes or column    names but I’m lazy and will use indexes if you want names use: c(“colname1”, “colname2”, “colname…n”))

v.names – This is what we we call the measurements (values) of each repeated measure. Name it anything you want.

timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will).  Name it anything you want.

times – Basically this is your:

(# of starting rept. meas. cols.) ÷ (final # of stacked cols.) = (times vector length)

In the first example we want to have a time 1 and time 2 column by stacking all the locations for time 1 in a column and all the locations for time 2 in a column (these are the v.names columns). Since we have two times we’ll need two column names (I called them TIME_1 and TIME_2 but this is up to you). We’ll need to keep track of these locations in the timevar column. If you notice the major difference between simple repeated measures and more complex repeated measures is that we don’t supply an index of columns to varying but a list of indexes. This is where rule 1 becomes important. What are you stacking? In this case we’re wanting to take everything in time 1 and stack it and the same for time 2 and using time.vars to keep track of the locations. In the example code below I have

The bare bones example (no time column)

An example with a time column (numeric values for cells)

An example with time column and locations for cell values (adj. w/ time.vars arg.)

################ # BARE MINIMUM # ################ reshape(dat, #dataframe direction="long", #wide to long varying=list(c(3:5), c(6:8)), #repeated measures list of indexes idvar='id') ################################################### # STACKING OF TIME 1 AND 2 REPEAT EVERYTHING ELSE # ################################################### reshape(dat, #dataframe direction="long", #wide to long varying=list(c(3:5), c(6:8)), #repeated measures list of indexes #idvar='id', #1 or more of what's left timevar="PLACE", #the repeated measures times v.names=c("TIME_1", "TIME_2"))#the repeated measures values ################################################## # STACKING OF TIME 1 AND 2 WITH NAMED TIME CELLS # ################################################## dat2 <- reshape(dat, #dataframe direction="long", #wide to long varying=list(c(3:5), c(6:8)), #repeated measures list of indexes #idvar='id', #1 or more of what's left timevar="PLACE", #the repeated measures times v.names=c("TIME_1", "TIME_2"), #the repeated measures values times =c("wrk", "hom", "chr")) row.names(dat2) <- NULL dat2

The final outcome is:

id sex PLACE TIME_1 TIME_2 1 ID.1 female wrk 7 10 2 ID.2 male wrk 10 7 3 ID.3 male wrk 11 10 4 ID.4 female wrk 6 9 5 ID.5 male wrk 9 10 6 ID.1 female hom 8 6 7 ID.2 male hom 13 13 8 ID.3 male hom 10 10 9 ID.4 female hom 8 15 10 ID.5 male hom 11 10 11 ID.1 female chr 7 10 12 ID.2 male chr 10 15 13 ID.3 male chr 6 7 14 ID.4 female chr 12 7 15 ID.5 male chr 15 12

This may be what we want but what if we wanted to have a work, home and church column by stacking all the times for work on each other, all the times for home and all the times for church (these are the v.names columns)? Well we do this with the list of indexes we supply to varying. This again is rule number 1. We know we have three var.names columns (the locations) so we need three indexes to pass as a list to varying. We want to stack all the times for work so we supply the index of 3 (work.1) and 6 (work.2) and do the same for home (c(4, 7)) and play (c(5, 8)). We now switch timevar to TIME because it’s no longer keeping track of the locations and the v.names will be given the three locations as names. We also could supply a times argument to reshape but it doesn’t make sense considering the default numeric index (1, 2) already makes sense.

################################ # STACKING OF THE THREE PLACES # ################################ dat3 <- reshape(dat, #dataframe direction="long", #wide to long varying=list(c(3, 6), c(4, 7), c(5, 8)), #repeated measures list of indexes #idvar='id', #1 or more of what's left timevar="TIME", #the repeated measures times v.names=c("WORK", "HOME", "CHURCH")) #the repeated measures values row.names(dat3) <- NULL dat3

Remember rule 1? The rule about naming. It’s on these more complex reshapes (more than one series of repeated measures/nested repeated measures) that proper naming pays off. The idea of passing varying a list of indexes was because reshape can’t figure out who’s who if you haven’t named them correctly but since we named them to have the three locations followed by a period and then a numeric index our life is easy peesy cheesy. Look below and you’ll see all we do is tell varying what columns are repeated measures and he figures out what to stack from the names. Additionally, there’s no need to supply the argument v.names because R is such a smarty he figured it out all by himself (what a big boy). You ask well why didn’t this work for stacking above with two times (the dat2 example)? Good question. It doesn’t work because we need to have the form measurment_column_name.time_column. So our rename job at the beginning was work.time, home.time, church.time. In this example our three measurement columns will be work, home, and time and the numeric index after each name indicates which time. If we wanted to have it easy for the dat2 example we would to have named the repeated measures as time_1.1, time_1.2, time_1.3, time_2.1, time_2.2, time_2.3. The dot numeric index at the end stands for the three locations. If you’re interested in seeing this please see the link of the script of this demonstration found at the bottom of this article as it contains extra code not found in this post.

So you have three approaches

Name it correctly (just indexes 1:n)

Provide a list of indexes (who cares about names)

Both name correctly and list of indexes (safety my friend)

############################################################### # STACKING OF THE THREE PLACES REWARDED BY GOOD COLUMN NAMING # ############################################################### dat3 <- reshape(dat, #dataframe direction="long", #wide to long varying=3:8, #indexes #idvar='id', #1 or more of what's left timevar="TIME") #the repeated measures times #v.names=c("WORK", "HOME", "CHURCH")) #Rewarded: no need for v.names row.names(dat3) <- NULL dat3

Which gives us:

id sex TIME WORK HOME CHURCH 1 ID.1 female 1 7 8 7 2 ID.2 male 1 10 13 10 3 ID.3 male 1 11 10 6 4 ID.4 female 1 6 8 12 5 ID.5 male 1 9 11 15 6 ID.1 female 2 10 6 10 7 ID.2 male 2 7 13 15 8 ID.3 male 2 10 10 7 9 ID.4 female 2 9 15 7 10 ID.5 male 2 10 10 12

Hold the phone Fenster!

So let me get this straight. If I’ve been a good R user and followed the Rule #2 (name the way R liketh) then all I have to provide reshape is data, direction and varying (maybe idvar)? Yep that’s right. See I told you that nameology was important, makes your life easy. don’t believe me try it out:

reshape(dat, direction="long", varying=3:8)

See reshape is actually pretty simple once you figure it out.

But sometimes we need to stack all the repeated measures into one column (for certain analysis and visualizations) and keep track of both time and location. To do this we simply supply all repeated measures columns to varying (indexes 3:8) as a vector (not a list as we only want one final column and lists are for when we want multiple repeated measures columns), provide v.names and timevar with appropriate names (I chose LOC_TIME for timevar as both the nested repeated measures of location and time will be in this column), and last give a vector of names to the times argument. Keep in mind that reshape will stack the columns you gave to varying in the order you supplied them. To figure out the number of times (as stated above) we take the original number of columns and divide by the total number of end columns (6 ÷ 1 = 6) which means we have to supply 6 names to the times argument (otherwise we have the numeric 1-6 default which can be pretty difficult to keep track of). This is where paste and R’s recycling rule comes in handy. Simply supply paste with the first vector of repeated measure series (location) and then the second, but use rep with the second providing each = (#of first series of repeated measures). The recycling rule will take care of the rest.

############################################################### # DOUBLE STACK. STACK TIMES AND PLACES AND NOTE EACH TIME AND # # PLACE. # of TIMES = # OF COLUMNS STACKED. # ############################################################### dat4 <- reshape(dat, #dataframe direction="long", #wide to long varying=3:8, #repeated measures list of indexes #idvar='id'), #1 or more of what's left timevar="LOC_TIME", #the repeated measures times v.names=c("VALUE"), #the repeated measures values times =paste(c("work", "home", "church"), rep(1:2, each=3))) row.names(dat4) <- NULL dat4

This gives us:

id sex LOC_TIME VALUE 1 ID.1 female work 1 7 2 ID.2 male work 1 10 3 ID.3 male work 1 11 4 ID.4 female work 1 6 5 ID.5 male work 1 9 6 ID.1 female home 1 8 7 ID.2 male home 1 13 8 ID.3 male home 1 10 . . . 29 ID.4 female church 2 7 30 ID.5 male church 2 12

This is nice but the information for the timevar (location and time) is all garbled and may make analysis or visualization functions difficult. The best approach would be to split this data into two different columns. Many people are familiar with Wickham’s colsplit
from the reshape2 package. This is one approach. I also have a function called colsplit2 that operates from the base package that I keep in my .Rprofile (I actually call it colsplit as well but for namespace purposes we’ll call it colsplit2). this is similar to Wickham’s but a little different. With Wickham’s you provide just the one column and it splits it into two and you then need to cbind it back to the original some how. My function takes the dataframe and the column to be split and outputs a new data frame with two columns in the same place as the original singular column. This is a base alternative if you’re attempting to avoid dependence. For this tutorial I’ll use my function but the downloadable script has both methods.

############################################# # ALTERNATE BASE METHOD OF COLUMN SPLITTING # ############################################# colsplit2 <- function(dataframe, splitcol, new.names=NULL, sep=""){ if(is.numeric(dataframe[, splitcol])) stop("splitcol can not be numeric") X <- data.frame(do.call(rbind, strsplit(as.vector( dataframe[, splitcol]), split = sep))) z <- if (!is.numeric(splitcol)) match(splitcol, names(dataframe)) else splitcol if (!is.null(new.names)) colnames(X) z) { cbind(dataframe[, 1:(z-1), drop=FALSE], X, dataframe[, (z + 1):ncol(dataframe), drop=FALSE]) } else { if (z!=1 & ncol(dataframe) == z) { cbind(dataframe[, 1:(z-1), drop=FALSE], X) } else { if (z==1 & ncol(dataframe) > z) { cbind(X, dataframe[, (z + 1):ncol(dataframe), drop=FALSE]) } else { X } } } } #END OF colsplit2 FUNCTION dat4 <- colsplit2(dat4, "LOC_TIME", c("place", "time"), " ")

We now have:

id sex place time VALUE 1 ID.1 female work 1 7 2 ID.2 male work 1 10 3 ID.3 male work 1 11 4 ID.4 female work 1 6 5 ID.5 male work 1 9 6 ID.1 female home 1 8 7 ID.2 male home 1 13 8 ID.3 male home 1 10 . . . 29 ID.4 female church 2 7 30 ID.5 male church 2 12

Let’s do a bit of visualization with one of my favorite packages, Wickham’s ggplot2. For social sciences (and particularly repeated measures) the faceting with facet_grid is pretty nice. One little change to the time column to make the labels on facet_grid nicer. I use a paste approach that alters the actual variable because it’s easier to explain but in real practice I don’t like to alter variable I prefer add another column or approach it with other means. The website Cookbook for Rprovides a very nice alternative to altering your variable content using the labeller argument of facet_grid (look under the heading Modifying facet label text in the link).

############################################################### # MAKE THE NAMES ON LABELS PRETTY FOR GGPLOT FACETING (ONE OF # # MANY APPROACHES) # ############################################################### dat4$time <- paste("time", dat4$time) ######################## # PLOT IT WITH GGPLOT2 # ######################## library(ggplot2) ggplot(data=dat4, aes(sex, VALUE)) + geom_boxplot() + facet_grid(place~time) ggplot(data=dat4, aes(place, VALUE)) + geom_boxplot() + facet_grid(time~sex)

faceted boxplot 1

faceted boxplot 2

In Part III of this series we’ll look at the less used long to wide format

For a .txt version of this demonstration click here

Posted in reshape | Tagged data, data prep, data set, long, long to wide, R, reshape, reshape 2, wide, wide to long | 3 Comments

reshape (from base) Explained: Part I

Posted on May 3, 2012 by tylerrinker

This Post Will Explain the Basics of Wide to Long With base reshape (part I)

Often your data set is in wide format and some sort of analysis or visualization requires putting the data set into long format. Hadely Wickham has a package for reshaping data called reshape2 that is pretty handy for quickly reshaping data with the melt and cast functions. I learned to use this long before I learned the base function reshape for doing the same task. I suspect many of you are in the same boat and may never have learned to use bases reshape period. There’s a reason for that: the arguments are not instinctive like Wickham’s package, the description of the function (LINK) is very difficult for beginners, and this function is actually two functions in 1 (a wide to long as well as a long to wide function).

So you’ve mastered Wickham’s reshape2 package and are thinking “Why the fudgesicle should I learn a confusing function like reshape when I got Hadley?” Here’s my list:

It’s powerful

It’s flexible

It’s in base (no dependencies)

It may be faster for large data sets

So which approach should you use? The best one for the job. Alright let’s tear into base’s reshape and take some mystery away from how to work it (I got 2 rules to help guide your thinking).

RULE 1: Stack repeated measures/Replicate and stack everything else

Basically you want to:

Take repeated measures columns and stack them as a measures column

Put their column names next to them in a new times column (so you can keep track of which time is which time)

And then replicate everything else that’s left as many times as you had repeated measures and stack it all.

So in the data frame below we have 3 repeated measures (time1, time2, time3) and the “everything else” is the id column.

id time1 time2 time3 1 ID.1 5.01 5.12 8.62 2 ID.2 79.40 81.42 81.29 3 ID.3 80.37 83.12 85.92

We want to stack the last three columns, making sure to put their respective column name next to them and then we want to replicate the id part of the data frame and stack it 3 times because that’s how many repeated measures we have. So the final product will look like this:

id time results 1 ID.1 1 5.01 2 ID.2 1 79.40 3 ID.3 1 80.37 4 ID.1 2 5.12 5 ID.2 2 81.42 6 ID.3 2 83.12 7 ID.1 3 8.62 8 ID.2 3 81.29 9 ID.3 3 85.92

RULE 2: Naming your columns in a way R likes makes your life easier

Here’s a stackoverflow.com example of someone with this very problem of not satisfying the the naming the way R likes it (this was added after this blog was written) [LINK click here].

When I got this little fact down reshape became a lot easier to operate. So how does R like your columns to look? Well R doesn’t give a rip what your “everything else” columns look like but the repeated measures it likes in the form “time.1” or a word common to all repeated measures -> followed by a period -> followed by sequence of numbers or alpha.numeric

I promise you getting this down makes your life easier. It enables varying to figure out what columns are what with more complex problems. Alright let’s generate some data using the DFgen function and look at ways to rename the columns (you can source it if you haven’t saved it to your .Rprofile). The last three columns are our repeated measures.

########################### # LOAD THE DFgen FUNCTION # ########################### source("http://dl.dropbox.com/u/61803503/DFgen_fun.txt") ###################### # GENERATE SOME DATA # ###################### set.seed(10);dat <- DFgen()[1:5, -c(6:10)]

The data set looks like this:

id group hs.grad race gender score time1 time2 time3 1 ID.1 treat yes white male -1.24 51.39 52.15 53.76 2 ID.2 control yes black male -0.46 32.21 35.07 33.10 3 ID.3 control yes white male -0.83 43.36 45.46 46.22 4 ID.4 treat no white male 0.34 71.63 72.06 74.49 5 ID.5 control yes white female 1.07 9.26 12.24 11.02

Now let’s rename time1, time2 and time3 the way R likes (makes life easy peasy cheesy) . There’s two approaches: 1) I’ll do it manually because regex is kinda a pain to learn 2) I’ll use regex because a) I like to show off b) I am somehow brilliant and know how already c) my data set is huge (many # of vars) and it’s more of a pain to do it manually d) all of the above. I ain’t gonna lie regex takes some learning but can be a valuable asset and a time saver.

#Variable Rename Method 1 names(dat)[7:9] <- c("time.1", "time.2", "time.3") dat #Variable Rename Method 2 dat <- redat #reload the data set with the old names names(dat) <- gsub("([a-z])([0-9])", "\\1\\.\\2", names(dat)) ######################################################################## # Basically this says find all the letters a-z followed by all numbers # # 0-9, slplit them apart into pieces 1 and two then the second part # # says take pieces one and two put a period between them and put them # # back together. If there's not a pattern of alpha followed by # # numeric then leave those names alone. # ######################################################################## dat

Alright we’ve satisfied the R beast’s desire for nicely formatted names, now our life is easy. Let’s learn the bare minimum of what reshape needs now. You have to tell reshape:

data – dataframe you’re supplying reshape

direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)

varying – the repeated measures columns we want to stack (takes indexes or column    names but I’m lazy and will use indexes if you want names use: c(“colname1”, “colname2”, “colname…n”))

Alright let’s see what that gives us:

reshape(dat, direction="long", varying=7:9)

Which yields:

id group hs.grad race gender score time ID.1.1 ID.1 treat yes white male -1.24 51.39 ID.2.1 ID.2 control yes black male -0.46 32.21 ID.3.1 ID.3 control yes white male -0.83 43.36 ID.4.1 ID.4 treat no white male 0.34 71.63 ID.5.1 ID.5 control yes white female 1.07 9.26 ID.1.2 ID.1 treat yes white male -1.24 52.15 ID.2.2 ID.2 control yes black male -0.46 35.07 ID.3.2 ID.3 control yes white male -0.83 45.46 ID.4.2 ID.4 treat no white male 0.34 72.06 ID.5.2 ID.5 control yes white female 1.07 12.24 ID.1.3 ID.1 treat yes white male -1.24 53.76 ID.2.3 ID.2 control yes black male -0.46 33.10 ID.3.3 ID.3 control yes white male -0.83 46.22 ID.4.3 ID.4 treat no white male 0.34 74.49 ID.5.3 ID.5 control yes white female 1.07 11.02

This ain’t bad but (a) the row names are annoying, (b) time is the measurements and (c) speaking of time where’s that column? Well we need to add some cute little arguments to get what we want. Let’s look at some more arguments and see what they’ll give us:

v.names – This is what we we call the measurements (values) of each repeated measure. Name it anything you want.

timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will).  Name it anything you want.

Basically these guys are column renamers. Also by specifying timevar it puts that column into the data set (remember he was no where to be found in the last step). Remember you can call them anything you want. Let’s see what they’re doing:

reshape(dat, direction="long", varying=7:9, idvar='id', timevar="TIME", v.names="RESULTS")

Which yields (only show the first 6 rows of data):

id group hs.grad race gender score TIME RESULTS ID.1.1 ID.1 treat yes white male -1.24 1 51.39 ID.2.1 ID.2 control yes black male -0.46 1 32.21 ID.3.1 ID.3 control yes white male -0.83 1 43.36 ID.4.1 ID.4 treat no white male 0.34 1 71.63 ID.5.1 ID.5 control yes white female 1.07 1 9.26 ID.1.2 ID.1 treat yes white male -1.24 2 52.15

But what if times in the TIME column weren’t really 1, 2, and 3 but were 3 locations like “work”, “home”, “church” and we want the data to represent this (this can make our life easier later on for analysis and visuals so we aren’t having to remember what 1,2 & 3 are)? Well we can via:

times – This guy is the way we specify what the 1, 2 and 3 are. As many numeric values that you have in this column you must have names.

If you don’t mind I’m also going to rename the rows at this point to because I can’t stand anything but ordinal numbers for rownames (but this is my blog so who’s going to stop me?).

dat2 <- reshape(dat, #dataframe direction="long", #wide to long varying=7:9, #repeated measures index idvar='id', #1 or more of what's left timevar="TIME", #The repeated measures times v.names="RESULTS", #the repeated measures values times =c("wrk", "hom", "chr")) ###################################### # RENAME THE ROWS TO ORDINAL NUMBERS # ###################################### row.names(dat2) <- NULL dat2

Which yields:

id group hs.grad race gender score TIME RESULTS 1 ID.1 treat yes white male -1.24 wrk 51.39 2 ID.2 control yes black male -0.46 wrk 32.21 3 ID.3 control yes white male -0.83 wrk 43.36 4 ID.4 treat no white male 0.34 wrk 71.63 5 ID.5 control yes white female 1.07 wrk 9.26 6 ID.1 treat yes white male -1.24 hom 52.15 7 ID.2 control yes black male -0.46 hom 35.07 8 ID.3 control yes white male -0.83 hom 45.46 9 ID.4 treat no white male 0.34 hom 72.06 10 ID.5 control yes white female 1.07 hom 12.24 11 ID.1 treat yes white male -1.24 chr 53.76 12 ID.2 control yes black male -0.46 chr 33.10 13 ID.3 control yes white male -0.83 chr 46.22 14 ID.4 treat no white male 0.34 chr 74.49 15 ID.5 control yes white female 1.07 chr 11.02

If you’re the type who skipped over rule 2, rename columns the way R likes (throwing caution to the wind) in this case you’ll get away with it because the format is pretty simple. In fact I’d probably not rename the columns but if R squawks this is one of the first things to fix.

In part II we’ll explore more complex reshapes like double stacks and more than one set of repeated measures series.

In part II of this reshape series we’ll be looking at more complex reshapes

For a .txt version of this demonstration click here

Posted in reshape | Tagged data, data prep, long, long to wide, R, reshape, reshape 2, wide, wide to long | 1 Comment

← Older posts

Newer posts →

Search for:

Recent Posts

ggplot2: How Geoms & Aesthetics ≈ Whipped Cream

Math Notation for R Plot Titles: expression, bquote, & Greek Letters

Using R to Reason & Test Theory: A Case Study from the Field of Reading Education

Minimal, Explicit, Python Style Package Loading for R

Easily Make Multi-tabbed .xlsx Files with openxlsx

Archives

March 2018

February 2018

December 2016

May 2016

March 2016

May 2015

April 2015

February 2015

December 2014

November 2014

October 2014

September 2014

August 2014

June 2014

May 2014

April 2014

March 2014

February 2014

December 2013

November 2013

September 2013

August 2013

July 2013

May 2013

March 2013

February 2013

December 2012

November 2012

October 2012

September 2012

August 2012

July 2012

June 2012

May 2012

April 2012

Categories

analysis

animation

annotate

benchmark

data

data generation

discourse analysis

factor

games

ggplot2

grapheme

igraph

knitr

letter

package creation

parallel

paste

plot

qdap

r

random

regular expression

reports

reshape

slidify

text

tidytext

tidyverse

trinker

tylerrinker

Uncategorized

visualization

wakefield

word cloud

work flow

Tag Cloud

#rstats

ability scores

animation

annotate

benchmark

cran

data

data prep

data science

data set

dialogue

discourse

discourse analysis

facet

faceted

factor scores

formality

game

ggplot

ggplot2

graph

igraph

IRT

item response theory

knitcitations

knitr

latex

long

long to wide

ltm

microbenchmark

multipaste

multi paste

mutipaste

natural language processing

package

pacman

paste

paste2

paste column

plot

polarity

qdap

qdapRegex

quantitative discourse analysis

R

random data

rbenchmark

regex

reorder

reports

reshape

reshape 2

rinker

rmarkdown

rmd

rnw

rstudio

safe indexing

sentence drawing

text

text analysis

text mining

tidyverse

title

transcript

trinker

tyler rinker

visualization

wide

wide to long

wordcloud

word cloud

work flow

workflow

Search this blog

Search for:

Blogroll

AboutMe

Bot Thoughts

Data Science, Data Mining and Predictive Analytics

ggplot2

My GitHub account

psychometroscar

r twotorials

R-bloggers

Talk Stats