Sentence Drawing: Function vs. Art

I recently was reading the book “Functional Art” and came across the work of Stefanie Posavec. Her Sentence Drawings (click here to see and click here to learn) caught my attention. Here is a ggplot2 rendition:

From what I understand about this visualization technique it’s meant to show the aesthetic and organic beauty of language (click here for interview with artist). I was captivated and thus I began the journey of using ggplot2 to recreate a Sentence Drawing.


Getting Started

I decided to use data sets from the qdap package.

Installing Packages from GitHub

# install.packages("devtools")
library(devtools)
install_github("ggthemes", "jrnold")
install.packages("qdap")
install.packages("scales")

invisible(lapply(c("qdap", "ggplot2", "ggthemes", "scales", "grid"), 
    require, character.only = TRUE))

Right Turn Function

Stefanie Posavec describes the process for creating the Sentence Drawing by making a right turn at the end of each sentence. I went straight to work creating an inefficient solution to making right hand turns. Realizing the inefficiency, I asked for help and utilized this response from flodel. Here is the solution as a function that you’ll need to run.

turn_it <- function(dataframe, len.col, turn = -pi/2) {

    dat <- dataframe
    dat[, "turn"] <- rep(turn, nrow(dataframe))
    dat <- within(dat, { 
        facing <- pi/2 + cumsum(turn)
        move <- dat[, len.col] * exp(1i * facing)
        position <- cumsum(move)
        x2 <- Re(position)
        y2 <- Im(position)
        x1 <- c(0, head(x2, -1))
        y1 <- c(0, head(y2, -1))
    })

    dat[, c("x1", "y1", "x2", "y2")] <- 
        lapply(dat[, c("x1", "y1", "x2", "y2")], round, digits=0)
    data.frame(dataframe, dat[, c("x1", "y1", "x2", "y2")])
}

Plot It

Here are the turns represented visually.

n <- 15
set.seed(11)
(dat <- data.frame(id = paste("X", 1:n, sep="."), 
    lens=sample(1:25, n, replace=TRUE)))
##      id lens
## 1   X.1    7
## 2   X.2    1
## 3   X.3   13
## 4   X.4    1
## 5   X.5    2
## 6   X.6   24
## 7   X.7    3
## 8   X.8    8
## 9   X.9   23
## 10 X.10    4
## 11 X.11    5
## 12 X.12   12
## 13 X.13   23
## 14 X.14   22
## 15 X.15   19
ggplot(turn_it(dat, "lens"), aes(x = x1, y = y1, xend = x2, yend = y2)) + 
    geom_segment(aes(color=id), size=3,lineend = "round") + 
    ylim(c(-40, 10)) + xlim(c(-20, 40))

plot of chunk fig1


Apply to Romeo and Juliet

Now that I had this accomplished I set to work with Romeo and Juliet.

Setting Up a Data Set

dat2b <- rajSPLIT
dat2b$wc <- wc(rajSPLIT$dialogue)
dat2b <- dat2b[!is.na(dat2b[, "wc"]), ]

## Reassign names to family affiliation
dat2b[, "fam.aff"] <- factor(lookup(as.character(dat2b[, "fam.aff"]), 
    levels(dat2b[, "fam.aff"])[1:3], qcv(Escalus, Capulet, Montague), 
    missing = NULL))

## Make dataframe with the beginning coordinates of each act
beg_act <- do.call(rbind, lapply(with(turn_it(dat2b, "wc"), 
    split(turn_it(dat2b, "wc"), act)), function(x) {
        x[1, qcv(act, x1, y1, x2, y2)]
}))

Romeo and Juliet Plotted

ggplot(turn_it(dat2b, "wc"), aes(x = x1, y = y1, xend = x2, yend = y2)) + 
    geom_segment(aes(color=fam.aff), lineend = "butt", size=1) +
    #geom_point(x=0, y=0, size=5, shape="S") +
    #geom_point(data=dat4b, aes(x=-106, y=-273), size=5, shape="E") + 
    geom_point(data=beg_act, aes(x = x1, y=y1), size=2.3,
        colour = "grey25") +
    geom_text(data=beg_act, aes(x = x1, y=y1, label = paste("Act", act)), 
        colour = "grey25", hjust = -.1, size=5, fontface = "bold") +
    guides(colour = guide_legend(override.aes = list(alpha = 1))) + 
    theme_few() + 
    scale_colour_few(name="Family\nAffiliation") +
    theme(axis.ticks = element_blank(), 
        axis.text = element_blank(),  
        axis.title= element_blank(),
        legend.position = c(.1, .85),
        legend.title.align = .5) +
    ggtitle("Romeo and Juliet Family\nAffiliation: Sentence Drawing")

plot of chunk fig2

After this I wanted to try to fill by sentence level polarity using a newer polarity (sentiment) algorithm from qdap.

poldat <- polarity(dat2b[, "dialogue"])

ggplot(turn_it(poldat[["all"]], "wc"), aes(colour=polarity)) + 
    geom_segment(aes(x = x1, y = y1, xend = x2, yend = y2), 
        lineend = "round", size=1) + 
    theme_few() +
    theme(panel.background = element_rect(fill = "grey20"),
        axis.ticks = element_blank(), 
        axis.text = element_blank(),  
        axis.title= element_blank(),
        legend.direction = "horizontal",
        legend.title = element_text(colour="white"),
        legend.background = element_blank(),
        legend.text = element_text(colour="white"),
        legend.position = c(.80, .07))  + 
    scale_colour_gradient2(name="", low = muted("blue"), 
        mid = "white", high = muted("red"))  +
    guides(colour = guide_colorbar(barwidth = 11, barheight = .75)) +
    ggtitle("Sentence Polarity: Sentence Drawing")

plot of chunk fig3


Thoughts…

While I like the aesthetics and organic feel of Stefanie Posavec’s Sentence Drawings I can’t help but to ask what this is showing me; what does such a visual afford the audience? I concluded that it captures that language isn’t linear but recursive and intricately linked. Posavec describes the tight spirals as choppy and the extended ones as flowing and smooth. However, I believe there are better ways to capture this sentiment while still balancing the notion of organic recursivity with identifying structure.

Visual representations, like this turn of talk plot below, capture meaningful patterns in the data and allow for comparisons but present the data as linear, when it really is not.

out <- tot_plot(dat2b, "dialogue", grouping.var = "fam.aff",
    facet.vars = "act", tot=FALSE, plot = FALSE)

out + theme(legend.position = "bottom") + 
    labs(fill="Family\nAffiliation")

plot of chunk fig4

Again, there must be a balance between capturing the essence of language and understanding the structure. Perhaps using pre-attentive attributes in a meaningful way would be a start to allowing Posavec’s representation to be more useful in finding the narrative in the data. The right hand turn she uses is arbitrary. I ask, what if the turn were meaningful, towards a particular demographic variable. I also could see the benefit of the use of Yihui’s animation package to show the fluid nature of the conversation. I may return to this blog post but I invite others to attempt the challenge of showing something meaningful in the data, while capturing the controlled chaos of language.

Click here for a complete script of this blog post


*Blog post created using the reports package

Posted in discourse analysis, ggplot2, text, Uncategorized, visualization | Tagged , , , , , , , , , , | 8 Comments

Correspondence Analysis in R

Correspondence analysis (from a layman’s perspective) is like principal components analysis for categorical data. It can be useful to discover structure in this type of data. My friend Gianmarco Alberti, an archaeologist, has put together an in depth web site detailing the history, use and worked R examples of correspondence analysis. It’s like a FREE ebook 🙂

The website: http://cainarchaeology.weebly.com/

Enjoy!!

Posted in analysis | Tagged , , , , , , | Leave a comment

paste, paste0, and sprintf

I find myself pasting urls and lots of little pieces together lately. Now paste is a standard go to guy when you wanna glue some stuff together. But often I find myself pasting and getting stuff like this:

paste(LETTERS)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
[18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Rather than the desired…

[1] "ABCDEFGHIJKLMNOPQRSTUVWXYZ"

When I get into those situations I think, “Oh better use collapse instead”; but never really think before using paste (That is whether I collapse or sep and why). This is inefficient and causes me to lack the time to write quality articles for Fox News (JK for those taking me serious). This tutorial will give some basic and clear direction about the following functions:

paste(x)
paste0(x)
sprintf(x, y)

paste

paste has 3 arguments.

paste (..., sep = " ", collapse = NULL)

The ... is the stuff you want to paste together and sep and collapse are the guys to get it done. There are three basic things I paste together:

  1. A bunch of individual character strings.
  2. 2 or more vectors pasted element for element.
  3. One vector smushed together.

Here's an example of each, though not with the correct arguments (I'm building suspense here):

paste("A", 1, "%")       #A bunch of individual character strings.
paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
paste(1:10)              #One vector smushed together.

Here's the sep/collapse rule for each:

  1. A bunch of individual character strings – You want sep
  2. 2 or more vectors pasted element for element. – You want sep
  3. One vector smushed together.- Smushin requires collapse

So here they are with the correct arguments:

paste("A", 1, "%")       #A bunch of individual character strings.
paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
paste(1:10, collapse="") #One vector smushed together.

This yields:

> paste("A", 1, "%")       #A bunch of individual character strings.
[1] "A 1 %"
> paste(1:4, letters[1:4]) #2 or more vectors pasted element for element.
[1] "1 a" "2 b" "3 c" "4 d"
> paste(1:10, collapse="") #One vector smushed together.
[1] "12345678910"

paste0

paste0 is short for:

paste(x, sep="")

So it allows us to be lazier and more efficient. I'm lazy so I use paste0 a lot.

paste0("a", "b") == paste("a", "b", sep="")
## [1] TRUE

'nuff said.


sprintf

I discovered this guy a while back but realized it's value in pasting recently. Much of my work on the reports (Rinker, 2013) package requires that I piece together lots of chunks of url and insert user specific pieces. This can be a nightmare with all the quotation marks. A typical take may look like this:

person <-"Grover"
action <-"flying"
message(paste0("On ", Sys.Date(), " I realized ", person, " was...\n", action, " by the street"))
## On 2013-09-14 I realized Grover was... flying by the street

No joke it took me 6 tries before I formatted that without an error (missing quotes, spaces, and commas).

But we can use sprintf to make one string (less commas + less quotations marks = less errors) and feed the elements that may differ from user to user or time to time. Let's look at an example to see what I mean:

person <-"Grover"
action <-"flying"
message(sprintf("On %s I realized %s was...\n%s by the street", Sys.Date(), person, action))
## On 2013-09-14 I realized Grover was... flying by the street

Boom first time. It's easy to figure out the spacing and there aren't the commas and quotation marks to deal with. Just use the %s marker to denote that some element goes here and then feed it in as a vector after the character string. For some applications sprintf is a superior choice over paste/paste0.


Note that these are not extensive, all-encompassing rules but guides for general use. Also be aware the sprintf is even cooler than I demonstrated here.

*Created using the reports package


References

Posted in paste, Uncategorized | Tagged , , , , , | 20 Comments

GitHub Package Ideas I Stole

One of my favorite sources of good ideas is looking at the GitHub repositories of others and modeling my repos after the good ideas I see others doing. Here's Steve Jobs on stealing ideas:

In the past few weeks I've spotted three simple things other maintainers are doing that I liked and which I promptly incorporated into my own repos. I wanted to shine a light on those three ideas.


Versioning Semantics

The first idea comes from Carl Boettiger, the maintainer of the knitcitations (Boettiger, 2013) package. In his NEWS file for knitcitations Carl provides the reader with his versioning semantics as seen below.

Releases will be numbered with the following semantic versioning format:

<major>.<minor>.<patch>

And constructed with the following guidelines:

* Breaking backward compatibility bumps the major (and resets the minor 
  and patch)
* New additions without breaking backward compatibility bumps the minor 
  (and resets the patch)
* Bug fixes and misc changes bumps the patch

Until Yihui's blog post I thought 1.0.0 implied maturity. To some maintainers this may be the case but I choose to follow Carl's model and I feel it is important to tell the user of your package what the versioning means. In fact a major version bump may indicate an increased likeliness of bugs, not maturity. To you Carl I say “Well played sir!”


Contact

The second practice I picked up is Tal Galili's use of a Contact section included in the README.md file of the installr (Galili, 2013) package as seen below.

Contact

You are welcome to:

* submit suggestions and bug-reports at: https://github.com/talgalili/installr/issues
* send a pull request on: https://github.com/talgalili/installr/
* compose a friendly e-mail to: tal.galili@gmail.com

Simple but brilliant. Tal is explicit about where to submit both suggestions and bug reports as a GitHub's issues page often implies just bugs and such, not improvements. Because of the history of R many users will attempt to contact you via email directly, thus you explain the same problem many times rather than answering it once in a public forum. Tal's format is direct yet diplomatic in that it directs the user to co-action rather than asking for fixes (see Yihui's blog post about this). Notice that email is listed as last. This reminds me of my first grade teacher's mantra “Ask three before me.”


NEWS.md

Last I saw Yihui Xie's use of a NEWS.md file in the knitr (Xie, 2013) package. The typical NEWS file is plain text and boring. The inclusion of a NEWS.md is much prettier to look at and creates a better experience for the package user. Including a NEWS.md merely requires a quick file conversion via:

file.copy("NEWS", "NEWS.md")

I then included NEWS.md in my .Rbuildignore to avoid clogging up CRAN needlessly. To you Yihui I say very nice, high five!


I hope these little stolen bits of goodness are useful to fellow idea thieves. Please be sure to provide feedback in the comments below.

Blog post created using the reports (Rinker, 2013) package

Get the .Rmd file here


References

Posted in knitr, package creation, reports, Uncategorized | Tagged , , , , , , | 5 Comments

How do I re-arrange??: Ordering a plot revisited

Back in October of last year I wrote a blog post about reordering/rearanging plots. This was, and continues to be, a frequent question on list serves and R help sites. In light of my recent studies/presenting on The Mechanics of Data Visualization, based on the work of Stephen Few (2012); Few (2009), I realized I was remiss in explaining the ordering of variables from largest to smallest bar (particularly Cleveland Dot Plots and Bar Plots). It is often much more meaningful to arrange (order) factor levels by size of other numeric variable(s). This allows for easier pattern recognition over the standard aphabetic arrangement of levels.

The post will take you through a demonstration of sorting bars/points on another variable, however it assumes you already know how that if you want to reorder/rearrange in a plot you must reorder the factor levels (if you do not know this see this blog post). We then explore my GitHub package package plotflow to add efficiency to re-leveling in the workflow. After we learn how to sort by bar/point size we will look at a applied use. I will use ggplot2 because this is my go to plotting system.


Section 1: Reordering by Bar/Point Size

Create a data set we can alter

mtcars3 <-mtcars2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
mtcars3$cyl  <-mtcars2$cyl <-as.factor(mtcars2$cyl)
head(mtcars2)
##                 car  mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1         Mazda RX4 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2     Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3        Datsun 710 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4    Hornet 4 Drive 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6           Valiant 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

In this example it's difficult to find trends and patterns in the data.

An Example of Unordered Bars/Points

library(ggplot2)
library(gridExtra)
x <-ggplot(mtcars2, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars2, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

plot of chunk order1

Below we use the levels argument to factor in conjunction with order to order the levels of car by miles per gallong (mpg).

An Example of Ordered Bars/Points

## Relevel the cars by mpg
mtcars3$car <-factor(mtcars2$car, levels=mtcars2[order(mtcars$mpg), "car"])

x <-ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity")

y <-ggplot(mtcars3, aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip()

grid.arrange(x, y, ncol=2)

plot of chunk order2

This is an example when a factor's levels each has a unique row. This is not always the case. For instance if we we to use mtcars2cyl rather than mtcars2$car as the factor we'd have multiple observations for each cylinder level. In these instances we'd most likely utilize the ording by some summarizing variable as seen in the ordering mtcars2$carb by average mpg below.

An Example of Ordered and Faceted Bars/Points

## Relevel the carb by average mpg
(ag_mtcars <-aggregate(mpg ~ carb, mtcars3, mean))
##   carb   mpg
## 1    1 25.34
## 2    2 22.40
## 3    3 16.30
## 4    4 15.79
## 5    6 19.70
## 6    8 15.00
mtcars3$carb <-factor(mtcars2$carb, levels=ag_mtcars[order(ag_mtcars$mpg), "carb"])

ggplot(mtcars3, aes(y=carb, x=mpg)) + 
    geom_point(stat="identity", size=2, aes(color=carb))

plot of chunk order3

The last plot in this section adds faceting to further draw distinction and allow for pattern recognition. The ordering of the facets can also be changed by reordering factor levels in a way that is sensible for representing the narrative the data is telling.

An Example of Ordered and Faceted Bars/Points

ggplot(mtcars3, aes(y=car, x=mpg)) + 
    geom_point(stat="identity") +
    facet_grid(cyl~., scales = "free", space="free")

plot of chunk order4

Recapping Section 1: Reordering by Bar/Point Size

In this first section we learned:

  1. Ordering factors by a numeric variable increases the ability to recognize patterns
  2. We can have (a) one row per factor level or (b) multiple rows per factor level.
    • The first scenerio requires feeding the dataframe with the levels reordered through order.
    • The second scenerio requires some sort of aggregation by summary statistic before using order and feeding to the levels argument of factor.
  3. Adding faceting can increase the ability to further find patterns among the ordered figure.

Section 2: Speeding Up the Workflow With the plotflow Package

Because I have the need to reorder factors by other numeric variables frequently and using order and sometimes aggregate is tedious and annoying I have wrapped this process up as a function called order_by in the plotflow package. I pretty much ripped off the entire function from Thomas Wutzler. This function allows the user to sort a dataframe by 1 or more numeric variables and return the new dataframe with a releveled factor. This is useful in that a new dataframe is created rather than tampering with the original. The function also allows for a summery stat to be passed via te FUN argument in a similar fashion as aggregate. This approach save typing and is more intuitive.

Getting the plotflow package

To get plotflow you can install the devtools package and use the install_github function:

# install.packages("devtools")

library(devtools)
install_github("plotflow", "trinker")

What Does order_by do?

library(plotflow)
dat <-aggregate(cbind(mpg, hp, disp)~carb, mtcars, mean)
dat$carb <-factor(dat$carb)

## compare levels (data set looks the same though)
dat$carb
## [1] 1 2 3 4 6 8
## Levels: 1 2 3 4 6 8
order_by(carb, ~-hp + -mpg, data = dat)$carb
## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

By defualt order_by returns a dataframe however we can also tell order_by to return a vector by setting df=FALSE.

## Return just the vector with new levels
order_by(carb, ~ -hp + -mpg, dat, df=FALSE)
## [1] 1 2 3 4 6 8
## Levels: 8 4 3 6 2 1

Let's see order_by in action.

Use order_by to Order Bars

library(ggplot2)

## Reset the data from Section 1
dat2 <-data.frame(car=rownames(mtcars), mtcars, row.names=NULL)
ggplot(order_by(car, ~ mpg, dat2), aes(x=car, y=mpg)) + 
    geom_bar(stat="identity") + 
    coord_flip() + ggtitle("Order Pretty Easy")

plot of chunk order5

Aggregated by Summary Stat

###Carb Ordered By Summary (Mean) of mpg

## Ordered points with the order_by function
a <-ggplot(order_by(carb, ~ mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Ordered Dot Plots Made Easy")

## Reverse the ordered points
b <-ggplot(order_by(carb, ~ -mpg, dat2, mean), aes(x=carb, y=mpg)) +
    geom_point(stat="identity", aes(colour=carb)) +
    coord_flip() + ggtitle("Reverse Order Too!")

grid.arrange(a, b, ncol=1)

plot of chunk order6

Nested Usage (order_by on an order by dataframe)

ggplot(order_by(gear, ~mpg, dat2, mean), aes(mpg, carb)) +
    geom_point(aes(color=factor(cyl))) +
    facet_grid(gear~., scales="free") + ggtitle("I'm Nested (Yay for me!)")

plot of chunk order7

The order_by function makes life a little easier.


Section 3: Using order_by on Real Data

Now I turn the attention to a real life usage of ordering a factor by a numeric variable in order to see patterns. A while back Abraham Mathew presented a blog post utilizing some interesting data on job satisfaction within bigger technology companies. His demonstrations showed various ways to utilize ggplot2 to visualize the data.

As I read the post I was also reading a bit of Stephen Few's work, which recomends ordering bars/dotplots to better see patterns. This visualization, which Mathew produced with ggplot2, is captivating:

However, I believed that by order the bars as Stephen Few (2012); Few (2009) suggests may enhance our ability to see a pattern; which of the four variables are linked?

In this next section we'll grab the data, clean it, reshape it, relevel the factors and plot in a more meaningful way to reveal patterns not seen before. Let's begin by loading the following packages:

library(RCurl)
library(XML)
library(rjson)
library(ggplot2)
library(qdap)
library(reshape2)
library(gridExtra)

Now we can scrape the data and extract the required pieces.

URL <-"http://www.payscale.com/top-tech-employers-compared-2012/job-satisfaction-survey-data"
doc   <-htmlTreeParse(URL, useInternalNodes=TRUE)
nodes <-getNodeSet(doc, "//script[@type='text/javascript']")[[19]][[1]]
dat <-gsub("];", "]", capture.output(nodes)[5:27])
ndat <-data.frame(do.call(rbind, fromJSON(paste(dat, collapse = ""))))[, -2]
ndat[, 1:5] <-lapply(ndat, unlist)
IBM <-grepl("International Business Machines", ndat[, 1])
ndat[IBM, 1] <-bracketXtract(ndat[IBM, 1])
ndat[, 1] <-sapply(strsplit(ndat[, 1], "\\s|,"), "[", 1)

At this point we relevel the factor level Employer.Name by job satisfaction.

## Re-level with order_by
ndat[, "Employer.Name"] <-order_by(Employer.Name, ~Job.Satisfaction, ndat, df=FALSE)
colnames(ndat)[1] <-"Employer"
ndat
##           Employer Job.Satisfaction Work.Stress Job.Meaning Job.Flexibility
## 1            Adobe           0.6875      0.7031      0.4532          0.8594
## 2       Amazon.com           0.7723      0.7010      0.4901          0.7376
## 3              AOL           0.7714      0.6572      0.4118          0.7714
## 4            Apple           0.7800      0.6510      0.7114          0.7567
## 5             Dell           0.6890      0.6275      0.4983          0.8712
## 6             eBay           0.7097      0.6087      0.5824          0.8153
## 7         Facebook           0.8750      0.6875      0.8125          0.9375
## 8           Google           0.7987      0.5660      0.6387          0.8334
## 9  Hewlett-Packard           0.5807      0.6034      0.4335          0.8733
## 10           Intel           0.7339      0.6677      0.6892          0.8896
## 11             IBM           0.6414      0.6637      0.4631          0.8946
## 12        LinkedIn           1.0000      0.6923      0.8462          0.9166
## 13       Microsoft           0.6777      0.6181      0.6099          0.9281
## 14     Monster.com           0.7273      0.8181      0.5454          0.8181
## 15           Nokia           0.7400      0.4800      0.5600          0.8200
## 16          Nvidia           0.7692      0.5897      0.5385          0.7692
## 17          Oracle           0.6713      0.6406      0.4221          0.9218
## 18  Salesforce.com           0.8667      0.7334      0.6667          0.8275
## 19         Samsung           0.6596      0.7447      0.6595          0.6170
## 20            Sony           0.7500      0.6667      0.5217          0.8750
## 21          Yahoo!           0.6762      0.5333      0.5145          0.8750

Now we can reshape the data to long format which ggplot2 prefers almost exclusively.

## Melt the data to long format
mdat <-melt(ndat)
mdat[, 2] <-factor(gsub("\\.", " ", mdat[, 2]), 
    levels = gsub("\\.", " ", colnames(ndat)[-1]))

head(mdat)
##     Employer         variable  value
## 1      Adobe Job Satisfaction 0.6875
## 2 Amazon.com Job Satisfaction 0.7723
## 3        AOL Job Satisfaction 0.7714
## 4      Apple Job Satisfaction 0.7800
## 5       Dell Job Satisfaction 0.6890
## 6       eBay Job Satisfaction 0.7097

Now our data is cleaned and reshaped with Employer releveled by job stisfaction. I chose this (job stisfaction) as the variable of interest because of literature I've read around job performance, teacher retention and job satisfaction. Let's see if re-leveling the factor has an improvement on the trends and patterns we can see.

ggplot(data=mdat, aes(x=Employer, y=value, fill=factor(Employer))) + 
  geom_bar(stat="identity") + coord_flip() + ylim(c(0, 1)) + 
  facet_wrap( ~ variable, ncol=2) + theme(legend.position="none") + 
  ggtitle("Plot 3: Employee Job Satisfaction at Top Tech Companies") +
  ylab(c("Job Satisfaction"))

plot of chunk order8

The first thing I noticed after the reordering is that Job Meaning and Job Satisfaction appear to be related. In general, higher satisfaction corresponds with greater meaning. I also noticed that Flexibility and Stress do not appear to correspond with satisfaction. This made me curious and so I ran a simple regression model with Satisfaction as the outcome and the other three variables as predictors. The story from the regression model is similar to the visualization.

mod <-lm(Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, data=ndat)
mod
## 
## Call:
## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
##     data = ndat)
## 
## Coefficients:
##     (Intercept)      Work.Stress      Job.Meaning  Job.Flexibility  
##          0.3101           0.1062           0.5241           0.0733
anova(mod)
## Analysis of Variance Table
## 
## Response: Job.Satisfaction
##                 Df Sum Sq Mean Sq F value Pr(&gt;F)    
## Work.Stress      1 0.0069  0.0069    1.45 0.2452    
## Job.Meaning      1 0.0816  0.0816   17.04 0.0007 ***
## Job.Flexibility  1 0.0006  0.0006    0.13 0.7260    
## Residuals       17 0.0814  0.0048                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mod)
## 
## Call:
## lm(formula = Job.Satisfaction ~ Work.Stress + Job.Meaning + Job.Flexibility, 
##     data = ndat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12043 -0.03002 -0.00263  0.03268  0.11915 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(&gt;|t|)    
## (Intercept)       0.3101     0.2413    1.29   0.2160    
## Work.Stress       0.1062     0.2147    0.49   0.6273    
## Job.Meaning       0.5241     0.1288    4.07   0.0008 ***
## Job.Flexibility   0.0733     0.2058    0.36   0.7260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0692 on 17 degrees of freedom
## Multiple R-squared:  0.523,  Adjusted R-squared:  0.438 
## F-statistic: 6.21 on 3 and 17 DF,  p-value: 0.00483

The model accounts for ~50% of the variability in Job Satisfaction. While the model is significant there clearly is more than just Meaninging that impacts Satisfaction. I Decided to do a bit more plotting and use the preattentive attributes of color and size to represent Flexibility and Stress in the visual model.

theplot <-ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 

theplot

plot of chunk order9

There is certainly a pullby this group of tech companies, that may be an unaccounted variable in the model.

theplot + annotation_custom(grob=circleGrob(r = unit(.4,"npc")), xmin=.47, xmax=.57, ymin=.72, ymax=.82)

If we view the data as two separate smootherd regression lines we get a more predictable model. This indicates a variable that we have not included.

ndat$outs <-1
ndat$outs[ndat$Employer %in% qcv(AOL, Amazon.com, Nvidia, Sony)] <-0

ggplot(data=ndat, aes(x = Job.Meaning, y = Job.Satisfaction)) + 
    geom_smooth(method="lm", fill = "blue", alpha = .1, size=1, aes(group=outs)) +  
    geom_smooth(color="red", fill = "pink", alpha = .3, size=1) +
    xlim(c(.4, .9)) +
    geom_point(aes(size = Job.Flexibility, colour = Work.Stress)) +
    geom_text(aes(label=Employer), size = 3, hjust=-.1, vjust=-.1) +
    scale_colour_gradient(low="gold", high="red") 

plot of chunk order10


We've learned:

  1. Re-leveling/re-ordering a factor by a numeric variable(s) can lead to important pattern detection in data.
  2. The levels argument to factor is key to the reordering.
  3. order and sometimes aggregate allows the re0leving to occur.
  4. The order_by function in the plotflow package can make re-leveling easier.
  5. 5. Faceting can amplify the distinction made by the re-leveling.

*Created using the reports (Rinker, 2013) package


References

  • Stephen Few, (2009) Now You See It: Simple Visualization Techniques for Quantitative
    Analysis.
  • Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
  • Tyler Rinker, (2013) reports: Package to assist in report writing. http://github.com/trinker/reports
Posted in factor, ggplot2, Uncategorized, visualization, work flow | Tagged , , , , , , , | 12 Comments

ggplot2 Choropleth of Supreme Court Decisions: A Tutorial

I don't do much GIS but I like to. It's rather enjoyable and involves a tremendous skill set. Often you will find your self grabbing data sets from some site, scraping, data cleaning and reshaping, and graphing. On the ride home from work yesterday I heard an NPR talk about the Supreme Court decisions being very close with this court. This got me wondering if there is a data base with this information and the journey began. This tutorial is purely exploratory but you will learn to:

  1. Grab .zip files from a data base and read into R
  2. Clean data
  3. Reshape data with reshape2
  4. Merge data sets
  5. Plot a Choropleth map in ggplot2
  6. Arrange several grid plots with gridExtra

I'm lazy and like a good challenge. I challenged myself to not manually open a file so I downloaded Biobase from bioconductor to open the pdf files for the codebook. Also I used my own package qdap because it had some functions I like and I'm used to using them. This blog post was created in the dev. version of the reports package using the wordpress_rmd template.

Also note that this is designed to be instructional. I broke the code up into chunks with an explanation in between, This is extremely annoying if you just want to run the code so for this sort of person I have provided the code here. Thanks to Bryan Goodrich of talkstats.com who has many programming and statistics talents and shared much with me on GIS topics.

Enjoy!

Load Packages

## download Biobase so we don't have to manually open codebook
source("http://bioconductor.org/biocLite.R")
biocLite("Biobase", suppressUpdates = TRUE)
library(qdap)
## Load initial required packages
lapply(qcv(ggplot2, maps, ggthemes, Biobase), require, character.only = T)

Get Data

The Supreme Court Codebook and opened without clicking

## download the pdf code book and open it
url_dl(SCDB_2012_01_codebook.pdf, url = "http://scdb.wustl.edu/_brickFiles/2012_01/")
openPDF(file.path(getwd(), "SCDB_2012_01_codebook.pdf"))

The Supreme Court Data; learn to download and open a zip file

temp <-tempfile()
download.file("http://scdb.wustl.edu/_brickFiles/2012_01/SCDB_2012_01_caseCentered_Citation.csv.zip", 
    temp)
dat <-read.csv(unz(temp, "SCDB_2012_01_caseCentered_Citation.csv"))
unlink(temp)
htruncdf(dat, 6, 6)
##   caseId docket caseIs voteId dateDe decisi usCite sctCit ledCit lexisC
## 1 1946-0 1946-0 1946-0 1946-0 11/18/      1 329 U. 67 S.  91 L.  1946 U
## 2 1946-0 1946-0 1946-0 1946-0 11/18/      1 329 U. 67 S.  91 L.  1946 U
## 3 1946-0 1946-0 1946-0 1946-0 11/18/      1 329 U. 67 S.  91 L.  1946 U
## 4 1946-0 1946-0 1946-0 1946-0 11/25/      7 329 U. 67 S.  91 L.  1946 U
## 5 1946-0 1946-0 1946-0 1946-0 11/25/      1 329 U. 67 S.  91 L.  1946 U
## 6 1946-0 1946-0 1946-0 1946-0 11/25/      1 329 U. 67 S.  91 L.  1946 U
##   term natura  chief docket caseNa dateAr dateRe petiti petiti respon
## 1 1946   1301 Vinson     24 HALLIB 1/9/19 10/23/    198   <NA>    172
## 2 1946   1301 Vinson     12 CLEVEL 10/10/ 10/17/    100   <NA>     27
## 3 1946   1301 Vinson     21 CHAMPL 11/8/1 10/18/    209   <NA>     27
## 4 1946   1301 Vinson     26 UNITED 1/31/1 10/25/     27   <NA>    170
## 5 1946   1301 Vinson     50 UNITED 10/25/            27   <NA>    176
## 6 1946   1301 Vinson     46 RICHFI 10/24/           198   <NA>      4
##   respon jurisd adminA adminA threeJ caseOr caseOr caseSo caseSo lcDisa
## 1   <NA>      6   <NA>   <NA>      0     51      6     29   <NA>      0
## 2   <NA>      1   <NA>   <NA>      0    123     52     30   <NA>      0
## 3   <NA>      2     66   <NA>      1    107     42    107     42      0
## 4   <NA>      1   <NA>   <NA>      0      3   <NA>      3   <NA>      0
## 5   <NA>      1   <NA>   <NA>      0      3   <NA>      3   <NA>      0
## 6      6      2    117      6      0    302      6    300      6      1
##   certRe lcDisp lcDisp declar caseDi caseDi partyW preced voteUn issue
## 1     11      2      1      1      3      0      1      0      0 80180
## 2      4      2      1      1      2      0      0      0      0 10500
## 3      1   <NA>      2      1      2      0      0      0      0 80250
## 4     10   <NA>      2      1      2      0      0      0      0 20150
## 5      2   <NA>      2      1      3      0      1      0      0 80060
## 6      1      3      2      3      3      0      1      0      0 80100
##   issueA decisi decisi author author lawTyp lawSup lawMin majOpi majOpi
## 1      8      2      0      4   <NA>      6    600 35 U.S     78     78
## 2      1      1      0      4   <NA>      6    600 18 U.S     81     87
## 3      8      2      0      1   <NA>      2    207            84     78
## 4      2      2      0      4   <NA>      6    600 49 Sta     87     87
## 5      8      1      0      7   <NA>   <NA>   <NA>            78     78
## 6      8      1      0      2   <NA>      1    129            81     87
##   splitV majVot minVot
## 1      1      8      1
## 2      1      6      3
## 3      1      5      4
## 4      1      5      3
## 5      1      6      3
## 6      1      7      1

Source a Codebook for State Keys Used By Supreme Court Data

source("http://copy.com/zEtAXJC8tG7yv7Zz")
head(state.key)
##   code          state
## 1    1        alabama
## 2    2         alaska
## 3    3 american samoa
## 4    4        arizona
## 5    5       arkansas
## 6    6     california

Clean Data

Clean Supreme Court Data

dat$state <-lookup(dat$caseOriginState, state.key)
dat2 <-dat[!is.na(dat$state), ]
dat_state <-data.frame(with(dat2, prop.table(table(state))))
head(dat_state)
##        state     Freq
## 1    alabama 0.030063
## 2     alaska 0.005010
## 3    arizona 0.017954
## 4   arkansas 0.010438
## 5 california 0.103549
## 6   colorado 0.009603

Before I get started with any sizable graphing project I start with the bare minimum and add to the code. Dr. Hadley Wickham has provided just such a minimal example for Choropleth mapping on pages 10-11 in the Changes and Additions guide

Minimal Choropleth Example

crimes <-data.frame(state = tolower(rownames(USArrests)), USArrests)

states_map <-map_data("state")

ggplot(crimes, aes(map_id = state)) +
    geom_map(aes(fill = Murder), map = states_map) +
    expand_limits(x = states_map$long, y = states_map$lat) 

plot of chunk unnamed-chunk-6


Map the Data

Here I use the maps package to get the state longitude and latitudes for the shapes.

states_map <-map_data("state")
head(states_map)
##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>

Plot the Data

Use map_id to map the states. expand_limits is filled with the states_map data set's latitude and longitude.

ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = Freq), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(legend.position = "bottom",
         axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="blue") +
    guides(fill = guide_colorbar(barwidth = 10, barheight = .5)) + 
    ggtitle("Choropleth Supreme Court")

plot of chunk unnamed-chunk-8

Notice the little trick of moving the legend to the bottom and making it narrow? This is done with:

theme(legend.position = "bottom")

## and

guides(fill = guide_colorbar(barwidth = 10, barheight = 0.5))

Generate labels

I said to myself, “Self I forgot all the state names; this needs labels”. Here is a question on label centering I asked at stackoverflow. The trick to supplying text data to ggplot is it has to be in a data.frame format. Note that you need the state (region) name, latitude and longitude. I also added angle to be able to manually twist the angles of individual labels. The trick here was to take the mean of the range of the shape file lats/longs in an answer provided by Andrie from stackoverflow.com. Note that it is extremely important that you are now adding a new date set to ggplot and you need to unmap the map_id with map_id = NULL otherwise ggplot2 will become enraged and refuse to comply with what you consider to be a reasonable request.

cnames <-aggregate(cbind(long, lat) ~ region, data = states_map, FUN = function(x) mean(range(x)))
cnames$angle <-0
head(cnames)
##        region    long   lat angle
## 1     alabama  -86.69 32.63     0
## 2     arizona -111.92 34.17     0
## 3    arkansas  -92.14 34.75     0
## 4  california -119.26 37.28     0
## 5    colorado -105.55 39.00     0
## 6 connecticut  -72.75 41.53     0

Plot With Labels 1

ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = Freq), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(legend.position = "bottom",
         axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="blue") +
    guides(fill = guide_colorbar(barwidth = 10, barheight = .5)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) + 
    ggtitle("Choropleth Supreme Court (With Labels 1)")

plot of chunk unnamed-chunk-11

You can manually adjust the labels with indexing the dataframe cnames.

Manually Move State Locations and Change Angle

cnames[11, c(2:3)] <-c(-114.5, 43.5)  # alter idaho's coordinates
cnames[17, 3] <-30.75  # alter louisiana's coordinates
cnames[21, c(2:3)] <-c(-84.5, 43)  # alter michigan's coordinates
cnames[23, 4] <-90  # alter mississippi's angle
cnames[9, c(2, 4)] <-c(-81.5, 90)  # alter florida's angle and coordinates

Plot With Labels 2

ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = Freq), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(legend.position = "bottom",
         axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="blue") +
    guides(fill = guide_colorbar(barwidth = 10, barheight = .5)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) + 
    ggtitle("Choropleth Supreme Court (With Labels 2)")

plot of chunk unnamed-chunk-13


Further Exploring the Data

From there a new thought emerged. It seemed that a few states had the most percentage of Supreme Court cases originating in them. I wondered if this had something to do with population. I wanted to compare a population Choropleth. This meant grabbing more data and the US Census database is just the place.

Download and read in a zip file just like the Supreme Court Data.

## Download the US census database
temp <-tempfile()
download.file("http://www2.census.gov/census_2000/datasets/demographic_profile/0_All_State/2khxx.zip", 
    temp)
demo <-read.csv(unz(temp, "2khxx.csv"))
unlink(temp)

Clean Data

Clean Census Data and Merge With dat_state From Above

## browseURL("http://www2.census.gov/census_2000/datasets/demographic_profile/Alabama/2kh01.pdf")
vars <-data.frame(codes = qcv(X281421906, X138053563, X143368343, X35.3, United.States), 
    var = qcv(pop, male, female, med_age, state))

colnames(demo)[colnames(demo) %in% vars[, 1]] <-lookup(colnames(demo)[colnames(demo) %in% vars[, 1]], vars)

demo$state <-tolower(demo$state)
demo <-demo[, colnames(demo) %in% vars[, 2]]
demo <-demo[demo$state %in% tolower(state.name), ]

## Merge it
dat_state <-merge(demo, dat_state, by = "state")

One thing led to another and before I knew it I decided to include male to female ratio and median age in the analysis. My first approach (which I did not like) was to combine all the data into a long format data set that I could pass to ggplot2 and then facet by the chosen variables. This required me to use a common scale for the variables. I used the apply function to scale the variables.

Clean Census Data and Reshape it using the melt function from Reshape2.

library(reshape2)
dat_state <-transform(dat_state, per.male = male/c(male + female))
colnames(dat_state)[6] <-"case_origin"
dat_state2 <-melt(data.frame(dat_state[, 1, drop = FALSE], apply(dat_state[, 
    -c(1, 3:4)], 2, scale)))
head(dat_state2)
##        state variable    value
## 1    alabama      pop -0.18913
## 2     alaska      pop -0.80673
## 3    arizona      pop -0.07863
## 4   arkansas      pop -0.47588
## 5 california      pop  4.56783
## 6   colorado      pop -0.21271

Faceted Plot Attempt 1

ggplot(dat_state2, aes(map_id = state)) +
    geom_map(aes(fill = value), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(legend.position = "bottom",
         axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="blue") +
    guides(fill = guide_colorbar(barwidth = 10, barheight = .5)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) +
    facet_grid(variable~.)

plot of chunk unnamed-chunk-17


A Hunger for Better Display

This was unsatisfying in that I had to use a common scale and the meaning was lost. Also I couldn't control individual map colors easily (though I'm sure there's a way). I decided to instead create 4 separate plots and feed them to grid.arange of the gridExtra package (a compliment to ggplot2).

plot1 <-ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = case_origin), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="orange", name="Percent") +
    guides(fill = guide_colorbar(barwidth = .5, barheight = 10)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) +
    ggtitle("Origin of Supreme Court Case (percent)") 

plot2 <-ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = pop), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="red", name="People") +
    guides(fill = guide_colorbar(barwidth = .5, barheight = 10)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) +
    ggtitle("State Populations")

plot3 <-ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = med_age), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="darkgreen", name="Age") +
    guides(fill = guide_colorbar(barwidth = .5, barheight = 10)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) +
    ggtitle("Median Age")

plot4 <-ggplot(dat_state, aes(map_id = state)) +
    geom_map(aes(fill = per.male), map = states_map, color ="black") +
    expand_limits(x = states_map$long, y = states_map$lat) +
    theme_few()+
    theme(axis.ticks = element_blank(), 
         axis.title = element_blank(), 
         axis.text =  element_blank()) +
    scale_fill_gradient(low="white", high="blue", name="Percent Male") +
    guides(fill = guide_colorbar(barwidth = .5, barheight = 10)) + 
    geom_text(data=cnames, aes(long, lat, label = region,  
        angle=angle, map_id =NULL), size=2.5) +
    ggtitle("Gender Distribution")

library(gridExtra)
grid.arrange(plot1, plot3, plot2, plot4, ncol = 2)

plot of chunk unnamed-chunk-18

I did not like the alignment of the plot edges and didn't know how to solve the problem. I asked on stackoverflow.com and Kohske gave the following approaches in his response:

Using grid.draw and Aligning Plot Edges

library(gtable)

p1 <-ggplotGrob(plot1)
p2 <-ggplotGrob(plot2)
p3 <-ggplotGrob(plot3)
p4 <-ggplotGrob(plot4)

library(gtable)
grid.draw(cbind(rbind(p1, p2, size="last"), rbind(p3, p4, size="last"), size = "first"))

plot of chunk unnamed-chunk-19

Eliminating the Plot Box and Aligning the Legends

This approach looks nice (though the plot box had to be removed as the legend covered it). It is not recomended by Kohske as it's a bit hacky. If someone has a better approach please share.

plot1b <-plot1 + theme(panel.border = element_blank())
plot2b <-plot2 + theme(panel.border = element_blank())
plot3b <-plot3 + theme(panel.border = element_blank())
plot4b <-plot4 + theme(panel.border = element_blank())

p1b <-ggplotGrob(plot1b)
p2b <-ggplotGrob(plot2b)
p3b <-ggplotGrob(plot3b)
p4b <-ggplotGrob(plot4b)

gt <-cbind(rbind(p1b, p2b, size="last"), rbind(p3b, p4b, size="last"), size = "first")

for (i in which(gt$layout$name == "guide-box")) {
  gt$grobs[[i]] <-gt$grobs[[i]]$grobs[[1]]
}

grid.draw(gt)

plot of chunk unnamed-chunk-20

The final out put was very satisfying. I did notice that yes indeed the states of high population also had a high number of Supreme Court cases originating in them. That was sensible. I noted that New York and Alabama seemed to have more Supreme Court cases originating in them in comparison to their populations (but only slightly). There is still a ton of data in the two data sets left to explore (particularly from a time series perspective). Feel free to experiment yourself with the data.


Please be sure to provide feedback in the comments below.

Posted in data, ggplot2, reports, Uncategorized, visualization | Tagged , , , , , , , , | 31 Comments

The Mechanics of Data Visualization

I recently presented about the mechanics of data visualization at the CLaRI Literacy Conference to a group of researchers, teachers and school administrators. The presentation is based on the work of Few (2012; 2009).

unnamed-chunk-10

While the presentation itself is not about R directly, the slides were created by reports(Rinker, 2013)/slidify (Vaidyanathan, 2012) and most of the graphics were also created within R. I have included the following for your learning:

  1. YouTube Video of the presentation
  2. The presentation slides
  3. The .Rmd and .bib to create the presentation
  4. The extra_functions.R file

Video of the Presentation


References

  • Stephen Few, (2009) Now you see it: Simple visualization techniques for quantitative
    analysis.
  • Stephen Few, (2012) Show me the numbers: Designing tables and graphs to enlighten.
  • Tyler Rinker, (2013) reports: Package to asssist in report writing. http://github.com/trinker/reports
  • R. Vaidyanathan, (2012) slidify: Generate reproducible html5 slides from R markdown. http://ramnathv.github.com/slidify/

You feedback is greatly appreciated! Please post comments below.

Posted in ggplot2, reports, slidify, Uncategorized, visualization | Tagged , , , , , , , , , , , | 2 Comments

qdap 0.2.2 released

I’m very pleased to announce the release of qdap 0.2.2

logo

This is the third installment of the qdap package available at CRAN. The qdap package automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse, including frequency counts of sentence types, words, sentence, turns of talk, syllable counts and other assorted analysis tasks. The package provides parsing tools for preparing transcript data but may be useful for many other language processing tasks. Many functions enable the user to aggregate data by any number of grouping variables providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text.

 The biggest change is that qdap now is compiled for Mac users!!!  No need to download from source.  Just use:

install.packages(“qdap”)

Some of the changes in version 0.2.2 include:


NEW FEATURES

  • tot_plot– a visualizing function that uses a bar graph to visualize patterns in sentence length and grouping variables by turn of talk.
  • beg2char and char2end– functions to grab text from beginning of string to a character or from a character to the end of a string.
  • ngrams– function to calculate ngrams by grouping variable.

BUG FIXES

  • genXtract labels returned the word “right” rather than the right edge string. See here for an example of the old behavior. This behavior has been fixed.
  • gradient_cloud‘s min.freq locked at 1. This has been fixed. (Manuel Fdez-Moya)
  • termco would produce an error if single length named vectors were passed to match.list and no multilength vectors were supplied. Also an error was thrown if an unnamed multilength vector was passed to match.list. This behavior has been fixed.

For a complete list of changes see qdap’s NEWS

Development Version
github

Posted in discourse analysis, qdap, text | Tagged , , , , , , , , , , , , , | 2 Comments

Sharing my R notes

I started working with R 2 1/2 years ago. I remember opening R closing it and thinking it was the dumbest thing ever (command line to a non programmer is not inviting). Now it’s my constant friend. From the beginning I took notes to remind myself all of the things I learned and relearned. They’ve been invaluable to me in learning. They are not particularly well arranged nor do they credit sources properly. There are likely bad or outdated practices in there but I figured they may be helpful to others learning the language and so I’m sharing.

Note that :

1) they are poorly arranged
2) they may have mistakes
3) they don’t credit others work properly or at all

They were for me but now I think maybe others will find them useful so here they are:

*Note that the file is larger ~7000KB and 274 pages worth.

Posted in Uncategorized | Tagged , , , , , , , , | 26 Comments

Animations Understood

When I first saw a graphic made from Yihui’s animation package (Xie, 2013) I was amazed at the magic and thought “I could never do that”.

Passage of time…

One night I found myself bored and as usual avoiding work. I decided to try learning how to make an animation and an epiphany hit me in the head.


Figure 1: Animated gif of guy getting hit in the head.



Basically I realized the animation package works just like those flipbooks you made as a kid. You know the ones teachers would yell at you for and would lament the waste of tablet paper. Go on and enjoy this flipbook and allow yourself to be taken back to 3rd grade. Ahh.

Video: A clever use of a tablet.




Now where were we. Ah yes, animation, an electronic flipbook for dorky people. Basically here’s how it works:

  1. Create a scene (static components in a plot)
  2. Create element(s) that will change
  3. Wrap it all into a function
  4. Use the animation package to output an MP4 video, HTML file or an animated GIF.

Let’s do this…

Set It Up

First source this circle drawing function I stole from John Fox and load the animation package.

 source("http://dl.dropboxusercontent.com/u/61803503/wordpress/circle_fun.txt")
library(animation)

Create that function to draw a scene

 FUN <- function(y = 0.8) {
    opar <- par()$mar
    on.exit(par(mar = opar))
    par(mar = rep(0, 4))
    plot.new()
    circle(0.5, 0.6, 1, "cm", , 4)
    segments(0.5, 0.2, 0.5, 0.54, lwd = 4)
    segments(0.4, 0, 0.5, 0.2, lwd = 4)
    segments(0.6, 0, 0.5, 0.2, lwd = 4)
    segments(0.5, 0.4, 0.3, 0.5, lwd = 4)
    segments(0.5, 0.4, 0.7, 0.5, lwd = 4)
    points(0.5, y, pch = -9742L, cex = 4, col = "firebrick3")
}

FUN()
plot of chunk unnamed-chunk-2
Figure 2: Static plot of guy with a static phone.



This part runs FUN and allows the phone to “drop”.

This is where you supply multiple values to the portions of the graphic that will change. Build a function that runs recursively, outputting multiple graphics.

 oopt <- animation::ani.options(interval = 0.1)

FUN2 <- function() {
    lapply(seq(1.01, 0.69, by = -0.02), function(i) {
        FUN(i)
        animation::ani.pause()
    })
}

## FUN2()

Now save it any of the following formats

saveGIF(FUN2(), interval = 0.1, outdir = "images/animate")

saveVideo(FUN2(), interval = 0.1, outdir = "images/animate", 
    ffmpeg = "C:/Program Files (x86)/ffmpeg-latest-win32-static/ffmpeg-20130306-git-28adecf-win32-static/bin/ffmpeg.exe")

saveLatex(FUN2(), autoplay = TRUE, loop = FALSE, latex.filename = "tester.tex",
    caption = "animated dialogue", outdir = "images/animate", ani.type = "pdf",
    ani.dev = "pdf", ani.width = 5, ani.height = 5.5, interval = 0.1)

saveHTML(FUN2(), autoplay = FALSE, loop = FALSE, verbose = FALSE, outdir = "images/animate/new",
    single.opts = "'controls': ['first', 'previous', 'play', 'next', 'last', 'loop', 'speed'], 'delayMin': 0")

Oh yeah here’s the HTML version


Created using the reports (Rinker, 2013) package
Get the .Rmd file here
Just the R code


References

Rinker TW (2013). reports: Package to asssist in report writing. University at Buffalo/SUNY, Buffalo, New York. version 0.1.3, http://github.com/trinker/reports.

Xie Y (2013). animation: A gallery of animations in statistics and utilities to create animations. R package version 2.2, http://CRAN.R-project.org/package=animation.

Posted in animation, reports, Uncategorized, visualization | Tagged , , , , , , | 7 Comments