Splitting and Combining R pdf Graphics

A question that often comes across various help lists is how to combine or split an output from an R graphics device. Maybe you have looped/combined multiple visuals into a single pdf to avoid cluttering your working directory and now you want to pull various pages out. Or maybe you have several different pdfs of various sizes you’d like to combine into a single multi page file (example-click here-). This post utilizes 2 short videos to demonstrate combining and splitting R produced pdfs.


This post serves two purposes:

  1. To show Windows users how to combine and split pdf’s (sorry this works only for Windows users)
  2. To challenge R bloggers who use other operating systems to perform the same combine and split tasks

 


First you’ll need to download PDF24 Editor (a free program)

icon
Click Here


Combining Multiple R pdf Graphics in a Single File


Splitting R pdf Pages into Separate Files


Pretty easy. Now I challenge R bloggers who use Mac and Linux to provide the same “FREE” functionality for their platforms. Ideally, someone has an approach that spans multiple platforms.

If you have an alternate method for any operating system please provide a link to your blog in the comments below.

Posted in visualization | Tagged , , , , , , , | 10 Comments

Presidential Debates with qdap-beta

qdap brief intro
For the past year I’ve been working on a package (qdap) to assist my field in quantitative discourse analysis; basically looking at patterns in language. It’s still a ways from being finished and lacks documentation (roxygen2 is my friend), but after seeing the presidential debates yesterday I decided to try using some of the package’s functions on a transcript of the dialogue.

Getting qdap to work may take some finagling because the package relies on the openNLP package. You have to make sure you have the correct version of java installed. I know the package is able to be installed on all three major OS. You’ll also notice quickly that the tm, ggplot2, and wordcloud packages are relied upon as well.

Note: I display the graphics here with .png files but recommend .pdf or .svg as the image is much clearer. For a combined pdf version of the graphics in this post click here.

Getting and cleaning transcripts of the debate

library(qdap)
url_dl("pres.deb1.docx")  #downloads a docx file of the debate to wd
# the read.transcript function allows reading in of docx file 
# special thanks to Bryan Goodrich for his work on this
dat <- read.transcript("pres.deb1.docx", col.names=c("person", "dialogue"))
truncdf(dat)
left.just(dat)
# qprep wrapper for several lower level qdap functions
# removes brackets & dashes; replaces numbers, symbols & abbreviations
dat$dialogue <- qprep(dat$dialogue)  
# sentSplit splits turns of talk into sentences
# special thanks to Dason Kurkiewicz for his work on this
dat2 <- sentSplit(dat, "dialogue", stem.col=FALSE)  
htruncdf(dat2)   #view a truncated version of the data(see also truncdf)

Wordclouds (relies on Ian Fellows’ wordcloud package)

#first put a unique character between words we want to keep together
#first put a unique character between words we want to keep together
dat2$dia2 <- space_fill(dat2$dialogue, c("Governor Romney", "President Obama", 
    "middle class", "The President", "Mister President"))

#Generate target words to color by
tw <- list(
        health=c("health", "insurance", "medic", "obamacare", "hospital"), 
        economic = c("econom", "jobs", "unemploy", "business", "banks", 
            "budget", "market", "paycheck"),
        foreign = c("war ", "terror", "foreign"),
        class = c("middle~~class", "poor", "rich"),
        opponent = c("romney ", "obama", "the~~president", "mister~~president")
)

#create stop word list from qdap data set Top25Words but exclude he and I
sw <- exclude(Top25Words, "he", "I")

#the word cloud by grouping variable function
with(dat2, trans.cloud(dia2, person, 
    proportional = TRUE,
    target.words = tw,
    cloud.colors = c("red", "blue", "black", "orange", "purple", "gray45"),
    legend = names(tw),
    stopwords=sw, 
    max.word.size = 4,
    char2space = "~~"))

Visuals of the trans.cloud function
wordcloud 1
wordcloud 2
wordcloud 3

Gantt Plot of the dialogue over time
Obviously (when you see the output), this uses Hadley Wickham’s ggplot2.

# special thanks to Andrie de Vries for his work on this function
with(dat2, gantt_plot(dialogue, person,  xlab = "duration(words)", x.tick=TRUE,
    minor.line.freq = NULL, major.line.freq = NULL, rm.horiz.lines = FALSE))

Visualization of the Gantt Plot
Gantt Plot

Formality scores (how formal a person’s language is)
This concept comes from:

Heylighen, F., & Dewaele, J.-M. (2002). Variation in the 
    contextuality of language: An empirical measure. Foundations 
    of Science, 7(3), 293–340. doi:10.1023/A:1019661126744

The code can be run in parallel because this is a slower function. It uses openNLP to first map parts of speech for every word.

#parallel about 1:20 on 8 GB ram 8 core i7 machine
v1 <- with(dat2, formality(dialogue, person, parallel=TRUE))
plot(v1)
#about 4 minutes on 8GB ram i7 machine
v2 <- with(dat2, formality(dialogue, person)) 
plot(v2)
# note you can resupply the output from formality back
# to formality and change arguments.  This avoids the need for
# openNLP, saving time.
v3 <- with(dat2, formality(v1, person))
plot(v3, bar.colors=c("Dark2"))

Output and plot from the formality function

  person word.count formality
1 ROMNEY       4068     61.82
2 LEHRER        765     61.31
3  OBAMA       3595     58.30

formality

Afterthought: I was remiss to mention that the word clouds are proportional (argument proportional = TRUE) for all words spoken rather than frequency per person. This enables comparison across clouds.

Posted in ggplot2, qdap, word cloud | Tagged , , , , , , , , , , | 33 Comments

Add Text Annotations to ggplot2 Faceted Plot (an easier approach)

I recently posted a blog about adding text to a ggplot2 faceted plot (LINK).

I was unhappy with the amount of time it takes to create the text data frame to then label the plot. And then yesterday when the new version of ggplot2 0.9.2 was announced I got to reading about how ggplot2 objects are stored and I decided that I could extract a great deal of the information for plotting the text directly from the ggplot2 object.

After I did it I decided to wrap the function up into a package that I can add more ggplot2 extension functions to in the future.

Optionally Download the Package:

 
install_github("acc.ggplot2", "trinker")
library(acc.ggplot2)

Here’s the Function Code and a Few Examples:

 
library(ggplot2)

qfacet_text <- function(ggplot2.object, x.coord = NULL, y.coord = NULL, 
    labels = NULL, ...) {
    require(ggplot2)
    dat <- ggplot2.object$data
    rows <- ggplot2.object$facet[[1]][[1]]
    cols <- ggplot2.object$facet[[2]][[1]]
    fcol <- dat[, as.character(cols)]
    frow <- dat[, as.character(rows)]
    len <- length(levels(factor(fcol))) *  length(levels(factor(frow)))
    vars <- data.frame(expand.grid(levels(factor(frow)), levels(factor(fcol))))
    colnames(vars) <- c(as.character(rows), as.character(cols))
    if (any(class(ggplot2.object) %in% c("ggplot", "gg"))) {
        if (is.null(labels)) {
            labels <- LETTERS[1:len]
        }
        if (length(x.coord) == 1) {
           x.coord <- rep(x.coord, len)
        }
        if (length(y.coord) == 1) {
           y.coord <- rep(y.coord, len)
        }
        text.df <- data.frame(x = x.coord, y = y.coord, vars, labs=labels)
    } else {
        if (class(ggplot2.object) == "qfacet") {
            text.df <- ggplot2.object$dat
            if (!is.null(x.coord)) {
                text.df$x.coord <- x.coord
            }
            if (!is.null(y.coord)) {
                text.df$y.coord <- y.coord
            }
            if (!is.null(labels)) {
                text.df$labs <- labels
            }
            ggplot2.object <- ggplot2.object$original
        }
    }
    p <- ggplot2.object + geom_text(aes(x, y, label=labs, group=NULL), 
        data=text.df, ...)
    print(p)
    v <- list(original = ggplot2.object, new = p, dat = text.df)
    class(v) <- "qfacet"
    invisible(v)
}

Examples (using the same basic examples as my previous blog post):

 
#alter mtcars to make some variables factors
mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, 
    c("cyl", "am", "gear")], as.factor)

p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + 
    geom_line(aes(color=cyl)) +
    geom_point(aes(shape=cyl)) + 
    facet_grid(gear ~ am) +
    theme_bw()  

z <- qfacet_text(ggplot2.object = p, x.coor = 33, y.coor = 2.2, labels = 1:6, color="red")
str(z); names(z)  #look at what's returned

#approach 1 (alter the text data frame and pass the qfacet object)
z$dat[5, 1:2] <- c(15, 5)
qfacet_text(z, color="red")

#approach 2 (alter the original ggplot object)
qfacet_text(p, x = c(33, 33, 33, 33, 15, 33), 
    y = c(2.2, 2.2, 2.2, 2.2, 5, 2.2), 1:6, color="red")

#all the same things you can pass to geom_text qfacet_text takes
qfacet_text(z, labels = paste("beta ==", 1:6), 
    size = 3, color = "grey50", parse = TRUE)

Notice at the end you can pass qfacet_text a ggplot object or an object from qfacet_text. The qfacet_text function invisibly returns a list with the original ggplot2 object, the new ggplot2 object and the text data frame. This enables the user to alter the coordinates of the data frame and return the the qfacet_text object back to qfacet_text, thus altering the text position. There’s actual documentation for this package and function so ?qfacet_text should get you a help file with the same example.

PS this gave me a chance to actually run roxygen2 for the first time to create documentation. Also a pretty slick Hadley Wickham package.

The Plot:
ggplot facet with text

Posted in annotate, ggplot2, text | Tagged , , , , , , , , | 8 Comments

Add Text Annotations to ggplot2 Faceted Plot


In my experience with R learners there are two basic types. The “show me the code and what it does and let me play” type and the “please give me step by step directions” type. I’ve broken the following tutorial on plotting text on faceted ggplot2 plots into 2 sections:

  1. The Complete Code and Final Outcome
  2. A Bit of Explanation

Hopefully, whatever learner you are you’ll be plotting text on faceted graphics in no time.


Section 1: The Complete Code and Final Outcome

mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor)

p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + 
    geom_line(aes(color=cyl)) +
    geom_point(aes(shape=cyl)) + 
    facet_grid(gear ~ am) +
    theme_bw()                                                                      
p                                                                     
 

len <- length(levels(mtcars$gear)) *  length(levels(mtcars$am))

vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am)))
colnames(vars) <- c("gear", "am")
dat <- data.frame(x = rep(15, len), y = rep(5, len), vars, labs=LETTERS[1:len])

p + geom_text(aes(x, y, label=labs, group=NULL),data=dat) 


dat[1, 1:2] <- c(30, 2)   #to change specific locations
p + geom_text(aes(x, y, label=labs, group=NULL), data=dat) 


p + geom_text(aes(x, y, label=paste("beta ==", labs), group=NULL), size = 4, 
    color = "grey50", data=dat, parse = T) 

final


Section 2: A Bit of Explanation

The following portion of the tutorial provides a bit more of a step by step procedure for plotting text to faceted plots as well as a visual to go with the code.

The initial non annotated plot
First, let’s make a faceted line plot with the mtcars data set. I reclassed a few variables to make factors.

mtcars[, c("cyl", "am", "gear")] <- lapply(mtcars[, c("cyl", "am", "gear")], as.factor)

p <- ggplot(mtcars, aes(mpg, wt, group = cyl)) + 
    geom_line(aes(color=cyl)) +
    geom_point(aes(shape=cyl)) + 
    facet_grid(gear ~ am) +
    theme_bw()                                                                      
p                                                                     

initial
Add text to each facet
The key here is a new data frame with three pieces of information (ggplot2 seems to like information given in a data frame).

  1. Coordinates to plot the text
  2. The faceted variable levels
  3. The labels to be supplied

The first information piece is the coordinates (two columns x and y) to plot the text in each facet. Generally I find that one set of coordinates will work in most of the facet boxes and I just use rep to make these coordinates (I suppose the recycling rule could be used if you added it to an already existing data frame).

The second information piece is the faceted variable labels (in our case gear ~ am). There’re many ways to achieve this but I like a combination of levels and expand.grid. I renamed these columns to be exactly the same as the variable names (gear & am) I used in the original data frame (mtcars in this case).

Lastly, you must make the labels. I chose letters so you can track what piece of the data frame is plotted in which facet.

Your data should look something like this:

   x y gear am labs
1 30 2    3  0    A
2 15 5    4  0    B
3 15 5    5  0    C
4 15 5    3  1    D
5 15 5    4  1    E
6 15 5    5  1    F

Note that the group=NULL is essential to let ggplot2 know you’re dealing with a new data set and the mapping from before can be forgotten (or at least this is how I understand it).

#long cut way to find number of facets
len <- length(levels(mtcars$gear)) *  length(levels(mtcars$am))

vars <- data.frame(expand.grid(levels(mtcars$gear), levels(mtcars$am)))
colnames(vars) <- c("gear", "am")
dat <- data.frame(x = rep(15, len), y = rep(5, len), vars, labs=LETTERS[1:len])

p + geom_text(aes(x, y, label=labs, group=NULL),data=dat)  

second

Moving just one text location
Generally I can usually find one spot that most every text plot will work except that one dog gone facet that just won’t match up with the other coordinates. In this case label A is that pesky label. The key here is to figure out what text labels you want to move and alter those coordinates appropriately.

dat[1, 1:2] <- c(30, 2)   #to change specific locations
p + geom_text(aes(x, y, label=labs, group=NULL), data=dat) 

third
Adding equation (Greek letters/math) and alter size/color
To annotate with math code use the parse = T argument in geom_text. For more on plotting math code see this ggplot wiki and this SO question. To alter the size just throw a size argument in geom_text. I also toned down the color of the text a bit to allow the line to pop the most visually.

p + geom_text(aes(x, y, label=paste("beta ==", labs), group=NULL), size = 4, 
    color = "grey50", data=dat, parse = T)  

final

If you have suggestions for improvement, links, or other thoughts please leave a comment.

Posted in annotate, ggplot2 | Tagged , , , , , , | 16 Comments

Parallelization: Speed up Functions in a Package

Well I bought a new computer a month back (i7 8GB memory). Finally more than one core and a chance to try parallelization. I saw this blog post a while back and was intrigued and was further intriqued when I saw that plyr/reshape2 has some paralellization capabilities(LINK). Let me say up front this is my first experience so there may be better ways but it sped up my code by over four times.

parallel computing

Let me warn you now, when I first read the A No BS Guide to the Basics of Parallelization in R I tried to see how many cores I had on my computer (this shows my ignorance; which may be of comfort to some of you, others will stop reading this blog post immediately). 1 is the loneliest number especially if you’re attempting to run on multiple cores.

Suggestion if you type detectCores() and see 1 you can’t run code in parallel, at least not by running it on different cores of your machine.

Background (skip this if you are short on time)
I’m working on a package (qdap) and have a function (pos) that takes a long time to run. It is basically finding parts of speech by sentence (each sentence is a cell and there are thousands of them). I rely on openNLP for the pos tagging but the whole process is time consuming. I figured perfect time to try this parallelization out.

I skimmed the Task View for parallel computing and knew I was out of my league and decided to just focus on my problem not the whole parallelization concept. Back to wrathematics bog post and I discovered my silly Windows machine was not compatible with mcapply but saw hope with the clusterApply(). Using ?clusterApply
I saw parLapply said it was a parallel version of lapply. I like lapply and dicided that was what I’d go with.

Working with parallel coding in functions (skip to here)
These are the three major problems/differences I encountered with parLapply over lapply inside a function:

    1. You need to pass/export the functions and variables you’ll be needing in the parLapply using makeCluster & clusterExport. See Andy Garcia’s helpful response to my question about this (LINK)
    2. You have to specify the envir argument of clusterExport as envir=environment(). See GSee’s helpful response to my question about this (LINK)
    3. You have to explicitly stop the cluster when you’re finished using it, much like closing a connection you opened. You stop the cluster using the stopCluster function (see line 38 in the code below).
    4.  
      EDIT: Martin Morgan of stackoverflow.com gives a solution that addresses both the first and second problems. He suggests passing all objects directly to parLapply (LINK).

       
      Below is an example of taking a non parallel function and making it run in parallel:

       library(parallel)
      detectCores()  #make sure you have > 1 core
      
      nonpar.test <- function(text.var, gc.rate=10){ 
          ntv <- length(text.var)
          require(parallel)
          pos <-  function(i) {
              paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
          }
          x <- lapply(seq_len(ntv), function(i) {
                  x <- pos(text.var[i])
                  if (i%%gc.rate==0) gc()
                  return(x)
              }
          )
          return(x)
      }
      
      nonpar.test(rep("I wish I ran in parallel.", 20))
      
      par.test <- function(text.var, gc.rate=10){ 
          ntv <- length(text.var)
          require(parallel)
          pos <-  function(i) {
              paste(sapply(strsplit(tolower(i), " "), nchar), collapse=" | ")
          }
      #======================================
          cl <- makeCluster(mc <- getOption("cl.cores", 4))
          clusterExport(cl=cl, varlist=c("text.var", "ntv", "gc.rate", "pos"), 
              envir=environment())
          x <- parLapply(cl, seq_len(ntv), function(i) {
      #======================================
                  x <- pos(text.var[i])
                  if (i%%gc.rate==0) gc()
                  return(x)
              }
          )
          stopCluster(cl)  #stop the cluster
          return(x)
      }
      
      par.test(rep("I wish I ran in parallel.", 20))

      Notice that lines 27-30; 37 (between the #==== lines and stopping the cluster) is all that changes. Once you get it down working with parLapply is pretty easy.

      Note:
      It doesn’t always make sense to run in parallel as it takes time to make the cluster. In the pos I added parallel as an argument because for smaller text vectors running in parallel doesn’t make sense (it’s slower).

      Wonderings and future direction:
      The pos function I have in qdap uses a progress bar. Currently I couldn’t make a progress bar work with parLapply but it’s less of a need because it was so much faster.

      Benchmarking (1 run)

      > system.time(pos(rajSPLIT$dialogue, parallel=T))
         user  system elapsed 
         2.35    0.08  199.53 
      
      > system.time(pos(rajSPLIT$dialogue, progress.bar =F))
         user  system elapsed 
       816.61   16.74  833.47

      This is benchmarked using the rajSPLIT$dialogue which is the text from Romeo and Juliet, a data set in qdap. This consists of 2151 rows or 23,943 words.

      Hopefully this blog post is useful to those learning some parallelization. Check out Task View , the Documentation for the Parallel package and the Vignette for the parallel package.

      If you have suggestions for improvement, links, or help on getting a progress bar with parLapply please leave a comment.

Posted in parallel | Tagged , , , , , | 9 Comments

Hangman in R: A learning experience

I love when people take a sophisticated tool and use it to play video games. Take R for example. I first saw someone create a game for R at talk.stats.com. My friend Dason inspired me to more efficiently waste time in R with his version of minesweeper. The other day I had an immense amount of work to do and decided it was the perfect time to make a hangman game.

Now some of the skills to create hangman were outside my typical uses and skills for R. It caused me to stretch and grow a bit. The purpose of this post is two fold:

  1. To share the hangman game with people who have nothing better to do than waste time on a childhood game
  2. To share the learning experiences I had in creating the game

First the hangman game

I have the code for the function posted here but I have saved the code and data set (word list) for the function at github.  You can download the package that contains the hangman game and data set by either downloading the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the devtools package to install the development version:

# install.packages("devtools")
library(devtools)
install_github("hangman", "trinker")

To play type hangman() into the console and hit enter.

Here’s a screenshot of the game

hangman game
Now for the learning

Here’s the code for the hangman function:

hangman <- function(reset.score = FALSE) {
    opar <- par()$mar
    on.exit(par(mar = opar))
    par(mar = rep(0, 4))
    x1 <- DICTIONARY[sample(1:nrow(DICTIONARY), 1), 1]
    x <- unlist(strsplit(x1, NULL))
    len <- length(x)
    x2 <- rep("_", len)
    chance <- 0
    if(!exists("wins", mode="numeric", envir = .GlobalEnv)  | reset.score){
        assign("wins", 0, envir = .GlobalEnv)
    }
    if(!exists("losses", mode="numeric", envir = .GlobalEnv) | reset.score){
        assign("losses", 0, envir = .GlobalEnv)
    }
    win1 <- 0
    win <- win1/len
    wrong <- character()
    right <- character()
    print(x2, quote = FALSE)
    circle <- function(x, y, radius, units=c("cm", "in"), segments=100,  
        lwd = NULL){ 
        units <- match.arg(units) 
        if (units == "cm") radius <- radius/2.54 
        plot.size <- par("pin") 
        plot.units <- par("usr") 
        units.x <- plot.units[2] - plot.units[1] 
        units.y <- plot.units[4] - plot.units[3] 
        ratio <- (units.x/plot.size[1])/(units.y/plot.size[2]) 
        size <- radius*units.x/plot.size[1] 
        angles <- (0:segments)*2*pi/segments 
        unit.circle <- cbind(cos(angles), sin(angles)) 
        shape <- matrix(c(1, 0, 0, 1/(ratio^2)), 2, 2) 
        ellipse <- t(c(x, y) + size*t(unit.circle %*% chol(shape))) 
        lines(ellipse, lwd = lwd) 
    } #taken from John Fox: http://tolstoy.newcastle.edu.au/R/help/06/04/25821.html
    hang.plot <- function(){ #plotting function
        plot.new()
        parts <- seq_len(length(wrong))
        if (identical(wrong, character(0))) {
            parts <- 0
        }
        text(.5, .9, "HANGMAN", col = "blue", cex=2)   
        if (!6 %in% parts) { 
            text(.5, .1, paste(x2, collapse = " "), cex=1.5) 
        }
        text(.05, .86, "wrong", cex=1.5, col = "red") 
        text(.94, .86,"correct", cex=1.5, col = "red")
        text(.05, .83, paste(wrong, collapse = "\n"), offset=.3, cex=1.5, 
            adj=c(0,1))
        text(.94, .83, paste(right, collapse = "\n"), offset=.3, cex=1.5, 
            adj=c(0,1))
        segments(.365, .77, .365, .83, lwd=2)
        segments(.365, .83, .625, .83, lwd=2)
        segments(.625, .83, .625, .25, lwd=2)
        segments(.58, .25, .675, .25, lwd=2)
        if (1 %in% parts) {
            circle(.365, .73, .7, lwd=4)
            if (!6 %in% parts) { 
                text(.365, .745, "o o", cex=1)
            }
            if (!5 %in% parts) { 
                text(.365, .71, "__", cex = 1)
            }
        text(.36, .73, "<", cex=1)
        }
        if (2 %in% parts) {
            segments(.365, .685, .365, .4245, lwd=7)
        }
        if (3 %in% parts) {
            segments(.365, .57, .45, .63, lwd=7)
        }
        if (4 %in% parts) {
            segments(.365, .57, .29, .63, lwd=7)
        }
        if (5 %in% parts) {
            segments(.365, .426, .43, .3, lwd=7)
            text(.365, .71, "O", cex = 1.25, col = "red")
        }
        if (6 %in% parts) {
            segments(.365, .426, .31, .3, lwd = 7)
            text(.365, .745, "x  x", cex=1)
            text(.5, .5, "You Lose", cex=8, col = "darkgreen") 
            text(.5, .1, paste(x, collapse = " "), cex=1.5) 
        }
        if (win1 == len) {
            text(.5, .5, "WINNER!", cex=8, col = "green")
            text(.505, .505, "WINNER!", cex=8, col = "darkgreen")
        }
    } #end of hang.plot
    guess <- function(){#start of guess function
        cat("\n","Choose a letter:","\n") 
        y <- scan(n=1,what = character(0),quiet=T)
        if (y %in% c(right, wrong)) {
            stop(paste0("You've already guessed ", y))
        }
        if (!y %in% letters) {
            stop(paste0(y, " is not a letter"))
        }
        if (y %in% x) {
            right <<- c(right, y)
            win1 <<- sum(win1, sum(x %in% y)) 
            win <<- win1/len 
            message(paste0("Correct!","\n"))
        } else {
            wrong  <<- c(wrong, y)
            chance  <<- length(wrong)
            message(paste0("The word does not contain ", y, "\n"))
        }
        x2[x %in% right] <<- x[x %in% right]
        print(x2, quote = FALSE)
        hang.plot()
    }#end of guess function
    hang.plot()
    while(all(win1 != len & chance < 6)){ 
        try(guess())
    } 
    if (win == 1) {
        outcome <- "\nCongratulations! You Win!\n"
        assign("wins", wins + 1, envir = .GlobalEnv)
    } else {
        outcome <- paste("\nSorry. You lose. The word is:", x1, "\n")
        assign("losses", losses + 1, envir = .GlobalEnv)
    }
    cat(outcome)
    cat(paste0("\nwins: ", wins, " | losses: ", losses, "\n"))
    text(.5, .2, paste0("wins: ", wins, "  |  losses: ", 
        losses), cex = 3, col = "violetred")
}

Things I tried and learned:

  1. Translating simple game rules into systematic logic
  2. try
  3. plotting dynamically (text vs. mtext)
  4. while loop
  5. assign

I used try one other time in a web scraping function. If you don’t know anything about this function it allows you to try to do something and if an error occurs move onto the next step. This allows the game user to input wrong information yet the function doesn’t stop but instead recovers and prints a message.

I first tried plotting the symbols and text with mtext. Thanks to some help at stack.overflow I found out the text function is a more controllable choice. I also grabbed a circle plotting function from John Fox to avoid calling a package that plots circles.

This was my first need for a while loop (generally I use the apply functions but in this case the game logic demanded I repeat something until one of two circumstances were met (win or loss of the game)

assign is a nice function and I generally don’t use it as I can get away with <<- (cringe if you want but if you think it through the <<- operator can be handy.

So I encourage you to write your own R game as you’ll likely learn a bit, while effectively wasting time and will provide enjoyment to others. 

Warning: not tested on a Linux or Mac machine

Posted in games | Tagged , , , , , , | 3 Comments

igraph and SNA: an amateur’s dabbling

I’ve been playing with the igraph package a bit lately (see previous post HERE) and wanted to approach a problem I once visited in the past. The basic gist of the problem is this:

Students in a class are asked their top three favorite students to work with (rank order).  After a social intervention this same question is posed again to students.  The intended outcome of the intervention is that the distribution of students receiving many or very few choices will diminish.  In other words the dorks will become less dorky and the popular students will become less popular.  The idea is to visual this relationship.

Here is a script of one such visualization.  It’s a bit light on annotations but merely experimenting with the code should give a good sense of what is occurring.

 
library(igraph)
set.seed(101)
#create a data set
X <-lapply(1:10, function(i) sample(LETTERS[c(1:10)[-i]], 3))
Y <- data.frame(person = LETTERS[1:10], sex = rbinom(10, 1, .5), do.call(rbind, X))
names(Y)[3:5] <- paste0("choice.", 1:3)

#reshape the data to long format
Z <- reshape(Y, direction="long", varying=3:5)
colnames(Z)[3:4] <- c("choice.no",  "choice")
rownames(Z) <- NULL
Z <- Z[, c(1, 4, 3, 2)]

#turn the data into a graph structure
edges <- as.matrix(Z[, 1:2])
g <- graph.data.frame(edges, directed=TRUE)
V(g)$label <- V(g)$name

#change label size based on number of votes
SUMS <- data.frame(table(Z$choice))
SUMS$Var1 <- as.character(SUMS$Var1)
SUMS <- SUMS[order(as.character(SUMS$Var1)), ]
SUMS$Freq <- as.integer(SUMS$Freq)
label.size <- 2
V(g)$label.cex <- log(scale(SUMS$Freq) + max(abs(scale(SUMS$Freq)))+ label.size)

#Color edges that are reciprocal red
x <- t(apply(edges, 1, sort))
x <- paste0(x[, 1], x[, 2])
y <- x[duplicated(x)]
COLS <- ifelse(x %in% y, "red", "gray40")
E(g)$color <- COLS

#reverse score the choices.no and weight
E(g)$width <- (4 - Z$choice.no)*2

#color vertex based on sex
V(g)$gender <- Y$sex
V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue")

#plot it
opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room
plot.igraph(g, layout=layout.auto(g))
par(mar=opar)

For an additional script of this analysis with 20 students click here.

For helpful igraph documentation click here

Posted in igraph | Tagged , , , , , | 4 Comments

igraph and structured text exploration

I am in the slow process of developing a package to bridge structured text formats (i.e. classroom transcripts)  with the tons of great R packages that visualize and analyze quantitative data (If you care to play with a rough build of this package (qdap) see: https://github.com/trinker/qdap). One of the packages qdap will bridge to is igraph.

A while back I came across a blog post on igraph and word statistics (LINK).  It inspired me to learn a little bit about graphing and the igraph package and provided a nice intro to learn.  As I play with this terrific package I feel it is my duty to share my experiences with others who are just starting out with igraph as well.   The following post is a script and the plots created with a word frequency matrix (similar to a term document matrix from the tm package) and igraph:

Build a word frequency matrix and covert to an adjacency matrix

set.seed(10)
X <- matrix(rpois(100, 1), 10, 10)
colnames(X) <- paste0("Guy_", 1:10)
rownames(X) <- c('The', 'quick', 'brown', 'fox', 'jumps',
    'over', 'a', 'bot', 'named', 'Dason')
X #word frequency matrix
Y <- X >= 1
Y <- apply(Y, 2, as, "numeric") #boolean matrix
rownames(Y) <- rownames(X)
Z <- t(Y) %*% Y  #adjacency matrix

Build a graph from the above matrix

 g <- graph.adjacency(Z, weighted=TRUE, mode ='undirected')
# remove loops
library(igraph)
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

#Plot a Graph
set.seed(3952)
layout1 <- layout.auto(g)
#for more on layout see:
browseURL("http://finzi.psych.upenn.edu/R/library/igraph/html/layout.html")
opar <- par()$mar; par(mar=rep(0, 4)) #Give the graph lots of room
plot(g, layout=layout1)


Alter widths of edges based on dissimilarity of people’s dialogue

 #adjust the widths of the edges and add distance measure labels
#use 1 - binary (?dist) a proportion distance of two vectors
#1 is perfect and 0 is no overlap (using 1 - binary)

edge.weight <- 7  #a maximizing thickness constant
z1 <- edge.weight*(1-dist(t(X), method="binary"))
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
z2 <- round(1-dist(t(X), method="binary"), 2)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 


Scale the label cex based on word counts

 SUMS <- diag(Z) #frequency (same as colSums(X))
label.size <- .5 #a maximizing label size constant
V(g)$label.cex <- (log(SUMS)/max(log(SUMS))) + label.size
plot(g, layout=layout1) #check it out!
 


Add vertex coloring based on factoring

 #add factor information via vertex color
set.seed(15)
V(g)$gender <- rbinom(10, 1, .4)
V(g)$color <- ifelse(V(g)$gender==0, "pink", "lightblue")

plot(g, layout=layout1) #check it out!
plot(g, layout=layout1, edge.curved = TRUE) #curve it up

par(mar=opar) #reset margins 



Try it interactively with tkplot

#interactive version
tkplot(g)  #an interactive version of the graph
tkplot(g, edge.curved =TRUE) 

This is just scratching the surface of igraph’s capabilities. Click here for a link to more igraph documentation.

This post was me toying with different ideas and concepts. If you see a way to improve the code/thinking please leave a comment.

For a .txt version of this demonstration click here

Posted in igraph, text | Tagged , , , , | 12 Comments

reshape (from base) Explained: Part II

Part II Explains More Complex  Wide to Long With base reshape 

In part I of this base reshape tutorial we went over the basics of reshaping data with reshape.  We learned two rules that help us to be more efficient and effective in using this powerful base tool:

RULE 1: Stack repeated measures/Replicate and stack everything else

RULE 2: Naming your columns in a way R likes makes your life easier

In part II we will be looking at more complex wide to long reshapes (more than one series of repeated measures) by building on what we learned in part I.  Let’s start by generating some data with two series/nested repeated measures):

set.seed(10)
dat <- data.frame(id=paste0("ID.", 1:5), 
    sex=sample(c("male", "female"), 5, replace=TRUE), 
    matrix(rpois(30, 10), 5, 6))
colnames(dat)[-c(1:2)] <- paste0(rep(1:2, each=3), 
    rep(c("work", "home", "church"), 2))
dat

Which looks like this:

    id    sex 1work 2home 1church 2work 1home 2church
1 ID.1 female     7     8       7    10     6      10
2 ID.2   male    10    13      10     7    13      15
3 ID.3   male    11    10       6    10    10       7
4 ID.4 female     6     8      12     9    15       7
5 ID.5   male     9    11      15    10    10      12

As you can see we have nested repeated measures at three different locations (work, home, church) at two different times (the 1 or 2 prefix).  Now let’s follow Rule 2 and get our names in a way R likes them (You may ask why I didn’t name them correctly to begin with?  Fair question.  Let me ask one though. Have you ever got a data set 100% the way you wanted it to be?).

names(dat) <- gsub("([0-9]+)([a-z]+)", "\\2\\.\\1", names(dat))
###############################################################
# BASICALLY, THIS SAYS FIND THE NAMES THAT ARE NUMERICALPHA.  #
# OTHERWISE LEAVE IT ALONE. THE [0-9]+ SAYS FIND THE NUMERIC  #
# STRING (PLUS SIGN SAYS FIND ALL THE PROCEDING CHARACTERS 1  #
# OR MORE TIMES). THE [a-z]+ SAYS FIND THE ALPHA STRING (PLUS #
# AGAIN MEANS FIND THE ALPHAS 1 OR MORE TIMES). THE "." IS    #
# CHARACTERS I'M INSERTING AND THE 1 AND 2 CORRESPOND TO THE  #
# PARENTHESIS IN THE ARGUMENT OF gsub. BASICALLY FLIP FLOPING #
# THE POSITION OF 1 AND 2.                                    #
###############################################################
#==================================================================
##############################################################
# OR MANUAL REPLACEMENT. YOU CAN SEE WHERE REGEX CAN COME IN #
# HANDY AS THE DATA SET GROWS.                               #
##############################################################
#names(dat)[-c(1:2)] <- c("work.1", "home.2", "church.1", 
#    "work.2", "home.1", "church.2")

Which now looks like:

    id    sex work.1 home.2 church.1 work.2 home.1 church.2
1 ID.1 female      7      8        7     10      6       10
2 ID.2   male     10     13       10      7     13       15
3 ID.3   male     11     10        6     10     10        7
4 ID.4 female      6      8       12      9     15        7
5 ID.5   male      9     11       15     10     10       12

Alright in part I we learned the following arguments:

        • data – dataframe you’re supplying reshape
        • direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)
        • varying – the repeated measures columns we want to stack (takes indexes or column    names but I’m lazy and will use indexes if you want names use: c(“colname1”, “colname2”, “colname…n”))
        • v.names – This is what we we call the measurements (values) of each repeated measure.  Name it anything you want.
        • timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will).  Name it anything you want.
        • times – Basically this is your:

 (# of starting rept. meas. cols.) ÷ (final # of stacked cols.) = (times vector length)

In the first example we want to have a time 1 and time 2 column by stacking all the locations for time 1 in a column and all the locations for time 2 in a column (these are the v.names columns).  Since we have two times we’ll need two column names (I called them TIME_1 and TIME_2 but this is up to you).  We’ll need to keep track of these locations in the timevar column.  If you notice the major difference between simple repeated measures and more complex repeated measures is that we don’t supply an index of columns to varying but a list of indexes.  This is where rule 1 becomes important.  What are you stacking?  In this case we’re wanting to take everything in time 1 and stack it and the same for time 2 and using time.vars to keep track of the locations.  In the example code below I have

  1. The bare bones example (no time column)
  2. An example with a time column (numeric values for cells)
  3. An example with time column and locations for cell values (adj. w/ time.vars arg.)
################
# BARE MINIMUM #
################
reshape(dat,                      #dataframe
    direction="long",             #wide to long
    varying=list(c(3:5), c(6:8)), #repeated measures list of indexes
    idvar='id')

###################################################
# STACKING OF TIME 1 AND 2 REPEAT EVERYTHING ELSE #
###################################################
reshape(dat,                      #dataframe
    direction="long",             #wide to long
    varying=list(c(3:5), c(6:8)), #repeated measures list of indexes
    #idvar='id',                  #1 or more of what's left
    timevar="PLACE",              #the repeated measures times
    v.names=c("TIME_1", "TIME_2"))#the repeated measures values

##################################################
# STACKING OF TIME 1 AND 2 WITH NAMED TIME CELLS #
##################################################
dat2 <- reshape(dat,               #dataframe
    direction="long",              #wide to long
    varying=list(c(3:5), c(6:8)),  #repeated measures list of indexes
    #idvar='id',                   #1 or more of what's left
    timevar="PLACE",               #the repeated measures times
    v.names=c("TIME_1", "TIME_2"), #the repeated measures values
    times =c("wrk", "hom", "chr")) 
row.names(dat2) <- NULL
dat2

The final outcome is:

     id    sex PLACE TIME_1 TIME_2
1  ID.1 female   wrk      7     10
2  ID.2   male   wrk     10      7
3  ID.3   male   wrk     11     10
4  ID.4 female   wrk      6      9
5  ID.5   male   wrk      9     10
6  ID.1 female   hom      8      6
7  ID.2   male   hom     13     13
8  ID.3   male   hom     10     10
9  ID.4 female   hom      8     15
10 ID.5   male   hom     11     10
11 ID.1 female   chr      7     10
12 ID.2   male   chr     10     15
13 ID.3   male   chr      6      7
14 ID.4 female   chr     12      7
15 ID.5   male   chr     15     12

This may be what we want but what if we wanted to have a work, home and church column by stacking all the times for work on each other, all the times for home and all the times for church (these are the v.names columns)?  Well we do this with the list of indexes we supply to varying.  This again is rule number 1.  We know we have three var.names columns (the locations) so we need three indexes to pass as a list to varying.  We want to stack all the times for work so we supply the index of 3 (work.1) and 6 (work.2) and do the same for home (c(4, 7)) and play (c(5, 8)).  We now switch timevar to TIME because it’s no longer keeping track of the locations and the v.names will be given the three locations as names.  We also could supply a times argument to reshape but it doesn’t make sense considering the default numeric index (1, 2) already makes sense.

################################
# STACKING OF THE THREE PLACES #
################################
dat3 <- reshape(dat,                          #dataframe
    direction="long",                         #wide to long
    varying=list(c(3, 6), c(4, 7), c(5, 8)),  #repeated measures list of indexes
    #idvar='id',                              #1 or more of what's left
    timevar="TIME",                           #the repeated measures times
    v.names=c("WORK", "HOME", "CHURCH"))      #the repeated measures values
row.names(dat3) <- NULL
dat3

Remember rule 1?  The rule about naming.  It’s on these more complex reshapes (more than one series of repeated measures/nested repeated measures) that proper naming pays off.  The idea of passing varying a list of indexes was because reshape can’t figure out who’s who if you haven’t named them correctly but since we named them to have the three locations followed by a period and then a numeric index our life is easy peesy cheesy.  Look below and you’ll see all we do is tell varying what columns are repeated measures and he figures out what to stack from the names.  Additionally, there’s no need to supply the argument v.names because R is such a smarty he figured it out all by himself (what a big boy).  You ask well why didn’t this work for stacking above with two times (the dat2 example)?  Good question.  It doesn’t work because we need to have the form measurment_column_name.time_column.  So our rename job at the beginning was work.time, home.time, church.time.  In this example our three measurement columns will be work, home, and time and the numeric index after each name indicates which time.  If we wanted to have it easy for the dat2 example we would to have named the repeated measures as time_1.1, time_1.2, time_1.3, time_2.1, time_2.2, time_2.3.  The dot numeric index at the end stands for the three locations.  If you’re interested in seeing this please see the link of the script of this demonstration found at the bottom of this article as it contains extra code not found in this post.

So you have three approaches

  1. Name it correctly (just indexes 1:n)
  2. Provide a list of indexes (who cares about names)
  3. Both name correctly and list of indexes (safety my friend)
###############################################################
# STACKING OF THE THREE PLACES REWARDED BY GOOD COLUMN NAMING #
###############################################################
dat3 <- reshape(dat,                          #dataframe
    direction="long",                         #wide to long
    varying=3:8,                              #indexes
    #idvar='id',                              #1 or more of what's left
    timevar="TIME")                           #the repeated measures times
    #v.names=c("WORK", "HOME", "CHURCH"))     #Rewarded: no need for v.names
row.names(dat3) <- NULL
dat3

Which gives us:

     id    sex TIME WORK HOME CHURCH
1  ID.1 female    1    7    8      7
2  ID.2   male    1   10   13     10
3  ID.3   male    1   11   10      6
4  ID.4 female    1    6    8     12
5  ID.5   male    1    9   11     15
6  ID.1 female    2   10    6     10
7  ID.2   male    2    7   13     15
8  ID.3   male    2   10   10      7
9  ID.4 female    2    9   15      7
10 ID.5   male    2   10   10     12

Hold the phone Fenster!

So let me get this straight.  If I’ve been a good R user and followed the Rule #2 (name the way R liketh) then all I have to provide reshape is data, direction and varying (maybe idvar)?  Yep that’s right.  See I told you that nameology was important, makes your life easy.  don’t believe me try it out:

reshape(dat, direction="long", varying=3:8)

See reshape is actually pretty simple once you figure it out.

But sometimes we need to stack all the repeated measures into one column (for certain analysis and visualizations) and keep track of both time and location.  To do this we simply supply all repeated measures columns to varying (indexes 3:8) as a vector (not a list as we only want one final column and lists are for when we want multiple repeated measures columns), provide v.names and timevar with appropriate names (I chose LOC_TIME for timevar as both the nested repeated measures of location and time will be in this column), and last give a vector of names to the times argument.  Keep in mind that reshape will stack the columns you gave to varying in the order you supplied them.  To figure out the number of times (as stated above) we take the original number of columns and divide by the total number of end columns (6 ÷ 1 = 6) which means we have to supply 6 names to the times argument (otherwise we have the numeric 1-6 default which can be pretty difficult to keep track of).  This is where paste and R’s recycling rule comes in handy.  Simply supply paste with the first vector of repeated measure series (location) and then the second, but use rep with the second providing each = (#of first series of repeated measures).  The recycling rule will take care of the rest.

###############################################################
# DOUBLE STACK. STACK TIMES AND PLACES AND NOTE EACH TIME AND #
# PLACE.  # of TIMES = # OF COLUMNS STACKED.                  #
###############################################################
dat4 <- reshape(dat,              #dataframe
    direction="long",             #wide to long
    varying=3:8,                  #repeated measures list of indexes
    #idvar='id'),                 #1 or more of what's left
    timevar="LOC_TIME",           #the repeated measures times
    v.names=c("VALUE"),           #the repeated measures values
    times =paste(c("work", "home", "church"), rep(1:2, each=3)))
row.names(dat4) <- NULL
dat4

This gives us:

     id    sex LOC_TIME VALUE
1  ID.1 female   work 1     7
2  ID.2   male   work 1    10
3  ID.3   male   work 1    11
4  ID.4 female   work 1     6
5  ID.5   male   work 1     9
6  ID.1 female   home 1     8
7  ID.2   male   home 1    13
8  ID.3   male   home 1    10
.
.
.
29 ID.4 female church 2     7
30 ID.5   male church 2    12

This is nice but the information for the timevar (location and time) is all garbled and may make analysis or visualization functions difficult.  The best approach would be to split this data into two different columns.  Many people are familiar with Wickham’s colsplit
from the reshape2 package.   This is one approach.  I also have a function called colsplit2 that operates from the base package that I keep in my .Rprofile (I actually call it colsplit as well but for namespace purposes we’ll call it colsplit2).  this is similar to Wickham’s but a little different.  With Wickham’s you provide just the one column and it splits it into two and you then need to cbind it back to the original some how.  My function takes the dataframe and the column to be split and outputs a new data frame with two columns in the same place as the original singular column.  This is a base alternative if you’re attempting to avoid dependence.  For this tutorial I’ll use my function but the downloadable script has both methods.

#############################################
# ALTERNATE BASE METHOD OF COLUMN SPLITTING #
#############################################
colsplit2 <- function(dataframe, splitcol, new.names=NULL, sep=""){
     if(is.numeric(dataframe[, splitcol])) stop("splitcol can not be numeric")
    X <- data.frame(do.call(rbind, strsplit(as.vector(
        dataframe[, splitcol]), split = sep)))
    z <- if (!is.numeric(splitcol)) match(splitcol, names(dataframe)) else splitcol
    if (!is.null(new.names)) colnames(X)  z) {
        cbind(dataframe[, 1:(z-1), drop=FALSE], X, 
            dataframe[, (z + 1):ncol(dataframe), drop=FALSE])
    } else {
        if (z!=1 & ncol(dataframe) == z) {
            cbind(dataframe[, 1:(z-1), drop=FALSE], X)
        } else {
            if (z==1 & ncol(dataframe) > z) {
                cbind(X, dataframe[, (z + 1):ncol(dataframe), drop=FALSE])
            } else {
                X
            }
        }
    }
} #END OF colsplit2 FUNCTION

dat4 <- colsplit2(dat4, "LOC_TIME", c("place", "time"), " ")

We now have:

     id    sex  place time VALUE
1  ID.1 female   work    1     7
2  ID.2   male   work    1    10
3  ID.3   male   work    1    11
4  ID.4 female   work    1     6
5  ID.5   male   work    1     9
6  ID.1 female   home    1     8
7  ID.2   male   home    1    13
8  ID.3   male   home    1    10
.
.
.
29 ID.4 female church    2     7
30 ID.5   male church    2    12

Let’s do a bit of visualization with one of my favorite packages, Wickham’s ggplot2.  For social sciences (and particularly repeated measures) the faceting with facet_grid is pretty nice.  One little change to the time column to make the labels on facet_grid nicer.  I use a paste approach that alters the actual variable because it’s easier to explain but in real practice I don’t like to alter variable I prefer add another column or approach it with other means.  The website Cookbook for Rprovides a very nice alternative to altering your variable content using the labeller argument of facet_grid (look under the heading Modifying facet label text in the link).

###############################################################
# MAKE THE NAMES ON LABELS PRETTY FOR GGPLOT FACETING (ONE OF #
# MANY APPROACHES)                                            #
###############################################################
dat4$time <- paste("time", dat4$time)
########################
# PLOT IT WITH GGPLOT2 #
########################
library(ggplot2)
ggplot(data=dat4, aes(sex, VALUE)) +
    geom_boxplot() + facet_grid(place~time)

ggplot(data=dat4, aes(place, VALUE)) +
    geom_boxplot() + facet_grid(time~sex)

faceted boxplot 1

faceted boxplot 2

In Part III of this series we’ll look at the less used long to wide format

For a .txt version of this demonstration click here

Posted in reshape | Tagged , , , , , , , , , | 3 Comments

reshape (from base) Explained: Part I

This Post Will Explain the Basics of Wide to Long With base reshape (part I)

Often your data set is in wide format and some sort of analysis or visualization requires putting the data set into long format.  Hadely Wickham has a package for reshaping data called reshape2 that is pretty handy for quickly reshaping data with the melt and cast functions.  I learned to use this long before I learned the base function reshape for doing the same task.  I suspect many of you are in the same boat and may never have learned to use bases reshape period.  There’s a reason for that: the arguments are not instinctive like Wickham’s package, the description of the function (LINK) is very difficult for beginners, and this function is actually two functions in 1 (a wide to long as well as a long to wide function).

So you’ve mastered Wickham’s reshape2 package and are thinking “Why the fudgesicle should I learn a confusing function like reshape when I got Hadley?”  Here’s my list:

  1. It’s powerful
  2. It’s flexible
  3. It’s in base (no dependencies)
  4. It may be faster for large data sets

So which approach should you use?  The best one for the job.  Alright let’s tear into base’s reshape and take some mystery away from how to work it (I got 2 rules to help guide your thinking).

RULE 1: Stack repeated measures/Replicate and stack everything else

Basically you want to:

  1. Take repeated measures columns and stack them as a measures column
  2. Put their column names next to them in a new times column (so you can keep track of which time is which time)
  3. And then replicate everything else that’s left as many times as you had repeated measures and stack it all.

So in the data frame below we have 3 repeated measures (time1, time2, time3) and the “everything else” is the id column.

    id time1 time2 time3
1 ID.1  5.01  5.12  8.62
2 ID.2 79.40 81.42 81.29
3 ID.3 80.37 83.12 85.92

We want to stack the last three columns, making sure to put their respective column name next to them and then we want to replicate the id part of the data frame and stack it 3 times because that’s how many repeated measures we have.  So the final product will look like this:

    id time results
1 ID.1    1    5.01
2 ID.2    1   79.40
3 ID.3    1   80.37
4 ID.1    2    5.12
5 ID.2    2   81.42
6 ID.3    2   83.12
7 ID.1    3    8.62
8 ID.2    3   81.29
9 ID.3    3   85.92

RULE 2: Naming your columns in a way R likes makes your life easier

Here’s a stackoverflow.com example of someone with this very problem of not satisfying the the naming the way R likes it (this was added after this blog was written)               [LINK click here].

When I got this little fact down reshape became a lot easier to operate.  So how does R like your columns to look?  Well R doesn’t give a rip what your “everything else” columns look like but the repeated measures it likes in the form “time.1” or a word common to all repeated measures -> followed by a period -> followed by sequence of numbers or alpha.numeric 

I promise you getting this down makes your life easier.  It enables varying to figure out what columns are what with more complex problems.  Alright let’s generate some data using the DFgen function and look at ways to rename the columns (you can source it if you haven’t saved it to your .Rprofile).  The last three columns are our repeated measures.

###########################
# LOAD THE DFgen FUNCTION #
###########################
source("http://dl.dropbox.com/u/61803503/DFgen_fun.txt")
######################
# GENERATE SOME DATA #
######################
set.seed(10);dat <- DFgen()[1:5, -c(6:10)]

The data set looks like this:

    id   group hs.grad  race gender score time1 time2 time3
1 ID.1   treat     yes white   male -1.24 51.39 52.15 53.76
2 ID.2 control     yes black   male -0.46 32.21 35.07 33.10
3 ID.3 control     yes white   male -0.83 43.36 45.46 46.22
4 ID.4   treat      no white   male  0.34 71.63 72.06 74.49
5 ID.5 control     yes white female  1.07  9.26 12.24 11.02

Now let’s rename time1, time2 and time3 the way R likes (makes life easy peasy cheesy) .  There’s two approaches: 1) I’ll do it manually because regex is kinda a pain to learn 2) I’ll use regex because a) I like to show off b) I am somehow brilliant and know how already c) my data set is huge (many # of vars) and it’s more of a pain to do it manually d) all of the above.  I ain’t gonna lie regex takes some learning but can be a valuable asset and a time saver.

#Variable Rename Method 1
names(dat)[7:9] <- c("time.1", "time.2", "time.3")
dat

#Variable Rename Method 2
dat <- redat #reload the data set with the old names
names(dat) <- gsub("([a-z])([0-9])", "\\1\\.\\2", names(dat))
########################################################################
# Basically this says find all the letters a-z followed by all numbers #
# 0-9, slplit them apart into pieces 1 and two then the second part    #
# says take pieces one and two put a period between them and put them  #
# back together. If there's not a pattern of alpha followed by         #
# numeric then leave those names alone.                                #
########################################################################
dat

Alright we’ve satisfied the R beast’s desire for nicely formatted names, now our life is easy. Let’s learn the bare minimum of what reshape needs now.  You have to tell reshape:

  • data – dataframe you’re supplying reshape
  • direction – either ‘long’ or ‘wide’ (in this case we are going to long so choose that)
  • varying – the repeated measures columns we want to stack (takes indexes or column    names but I’m lazy and will use indexes if you want names use: c(“colname1”, “colname2”, “colname…n”))
Alright let’s see what that gives us:
reshape(dat,
    direction="long",
    varying=7:9)
Which yields:
         id   group hs.grad  race gender score  time
ID.1.1 ID.1   treat     yes white   male -1.24 51.39
ID.2.1 ID.2 control     yes black   male -0.46 32.21
ID.3.1 ID.3 control     yes white   male -0.83 43.36
ID.4.1 ID.4   treat      no white   male  0.34 71.63
ID.5.1 ID.5 control     yes white female  1.07  9.26
ID.1.2 ID.1   treat     yes white   male -1.24 52.15
ID.2.2 ID.2 control     yes black   male -0.46 35.07
ID.3.2 ID.3 control     yes white   male -0.83 45.46
ID.4.2 ID.4   treat      no white   male  0.34 72.06
ID.5.2 ID.5 control     yes white female  1.07 12.24
ID.1.3 ID.1   treat     yes white   male -1.24 53.76
ID.2.3 ID.2 control     yes black   male -0.46 33.10
ID.3.3 ID.3 control     yes white   male -0.83 46.22
ID.4.3 ID.4   treat      no white   male  0.34 74.49
ID.5.3 ID.5 control     yes white female  1.07 11.02
This ain’t bad but (a) the row names are annoying, (b) time is the measurements and (c) speaking of time where’s that column?  Well we need to add some cute little arguments to get what we want.  Let’s look at some more arguments and see what they’ll give us:
  • v.names – This is what we we call the measurements (values) of each repeated measure.  Name it anything you want.
  • timevar – This is what we’ll call the times of each repeated measures (the categorical variable if you will).  Name it anything you want.
Basically these guys are column renamers.  Also by specifying timevar it puts that column into the data set (remember he was no where to be found in the last step).    Remember you can call them anything you want.  Let’s see what they’re doing:
reshape(dat, 
    direction="long",
    varying=7:9,
    idvar='id',
    timevar="TIME",
    v.names="RESULTS")
Which yields (only show the first 6 rows of data):
         id   group hs.grad  race gender score TIME RESULTS
ID.1.1 ID.1   treat     yes white   male -1.24    1   51.39
ID.2.1 ID.2 control     yes black   male -0.46    1   32.21
ID.3.1 ID.3 control     yes white   male -0.83    1   43.36
ID.4.1 ID.4   treat      no white   male  0.34    1   71.63
ID.5.1 ID.5 control     yes white female  1.07    1    9.26
ID.1.2 ID.1   treat     yes white   male -1.24    2   52.15
But what if times in the TIME column weren’t really 1, 2, and 3 but were 3 locations like “work”, “home”, “church” and we want the data to represent this (this can make our life easier later on for analysis and visuals so we aren’t having to remember what 1,2 & 3 are)?  Well we can via:  
  • times – This guy is the way we specify what the 1, 2 and 3 are.  As many numeric values that you have in this column you must have names.
If you don’t mind I’m also going to rename the rows at this point to because I can’t stand anything but ordinal numbers for rownames (but this is my blog so who’s going to stop me?).
dat2 <- reshape(dat,        #dataframe
    direction="long",       #wide to long
    varying=7:9,            #repeated measures index
    idvar='id',             #1 or more of what's left
    timevar="TIME",         #The repeated measures times
    v.names="RESULTS",      #the repeated measures values
    times =c("wrk", "hom", "chr"))
######################################
# RENAME THE ROWS TO ORDINAL NUMBERS #
######################################
row.names(dat2) <- NULL
dat2
Which yields:
     id   group hs.grad  race gender score TIME RESULTS
1  ID.1   treat     yes white   male -1.24  wrk   51.39
2  ID.2 control     yes black   male -0.46  wrk   32.21
3  ID.3 control     yes white   male -0.83  wrk   43.36
4  ID.4   treat      no white   male  0.34  wrk   71.63
5  ID.5 control     yes white female  1.07  wrk    9.26
6  ID.1   treat     yes white   male -1.24  hom   52.15
7  ID.2 control     yes black   male -0.46  hom   35.07
8  ID.3 control     yes white   male -0.83  hom   45.46
9  ID.4   treat      no white   male  0.34  hom   72.06
10 ID.5 control     yes white female  1.07  hom   12.24
11 ID.1   treat     yes white   male -1.24  chr   53.76
12 ID.2 control     yes black   male -0.46  chr   33.10
13 ID.3 control     yes white   male -0.83  chr   46.22
14 ID.4   treat      no white   male  0.34  chr   74.49
15 ID.5 control     yes white female  1.07  chr   11.02
If you’re the type who skipped over rule 2, rename columns the way R likes (throwing caution to the wind) in this case you’ll get away with it because the format is pretty simple.  In fact I’d probably not rename the columns but if R squawks this is one of the first things to fix.

In part II we’ll explore more complex reshapes like double stacks and more than one set of repeated measures series.

In part II of this reshape series we’ll be looking at more complex reshapes 

For a .txt version of this demonstration click here

Posted in reshape | Tagged , , , , , , , , | 1 Comment