knitr2wordpress and gradient_cloud Revisited

This post serves three function:

  1. It allows me to revisit an old blogpost
  2. It let's me test out the new-ish knitr function knti2wp and RWordPress
  3. It enables me to avoid the massive ammount of reading I need to do and still feel like I'm doing “work”

The follwoing packages are needed to run the code:

install.packages(c("knitr", "qdap"))
install.packages("RWordPress", repos = "http://www.omegahat.org/R", type = "source")
library(qdap)
library(knitr)
library(RWordPress)

*Mac users see this link and this link

In this blogpost I explored the use of gradient word clouds. It took 31 lines of code to plot the figure. I'm lazy (though I tell other's efficient) and 31 lines is enough to keep me from exploring with the gradient word cloud. In a recent update to qdap I included a function to greatly reduce the lines of code in that post to 6, making gadient clouds more accessible.

Grab the Presidential Debate Transcript

# download transcript of the debate to working directory
url_dl(pres.deb1.docx)

Read in the Data

# load multiple files with read transcript and assign to working directory
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)

# view a truncated version of the data (see also htruncdf)
left.just(htruncdf(dat1, 10, 45))
##    person dialogue                                     
## 1  LEHRER We'll talk about specifically about health ca
## 2  ROMNEY What I support is no change for current retir
## 3  LEHRER And what about the vouchers?                 
## 4  ROMNEY So that's that's number one. Number two is fo
## 5  OBAMA  Jim, if I if I can just respond very quickly,
## 6  LEHRER Talk about that in a minute.                 
## 7  OBAMA  but but but overall.                         
## 8  LEHRER OK.                                          
## 9  OBAMA  And so...                                    
## 10 ROMNEY That's that's a big topic. Can we can we stay

Remove Lehrer (need bivariate variable) and plot

dat2 <- rm_row(dat1, 1, "LEHRER")  #make a bivariate column (remove LEHRER)

gradient_cloud(dat2$dialogue, dat2$person, title = "Debate", X = "blue", Y = "red", 
    stopwords = BuckleySaltonSWL, max.word.size = 2.2, min.word.size = 0.55)

plot of chunk grad_cloud

Notice we have control over min/max word size, the two colors and stopwords? Easy huh?

Try a Few more with Different Parameters

gradient_cloud(dat2$dialogue, dat2$person, title = "fun", X = "green", Y = "orange")

gradient_cloud(dat2$dialogue, dat2$person, title = "fun", rev.binary = TRUE)

gradient_cloud(dat2$dialogue, dat2$person, title = "fun", max.word.size = 5, 
    min.word.size = 0.025)

Now Discussion on knitr to WordPress

Here is the Rmd (text) of the file used to make this post.

Here's the format I used to send the file to WordPress.com

options(WordPressLogin = c(USERNAME = "PASSWORD"), WordPressURL = "https://trinkerrstuff.wordpress.com/xmlrpc.php")
library(knitr)

knit2wp(file.path("C:/Users/trinker/Desktop/gradient_clouds_revisited/PRESENTATION", 
    "gradient_clouds_revisited.Rmd"), title = "knitr2wordpress and gradient_cloud Revisited", 
    shortcode = TRUE)

knit2wp("yourfile.Rmd", title = "knitr2wordpress and gradient_cloud Revisited")

Where USERNAME and PASSWORD are your WordPress username and password.


Please note that there was some confusion I had about where the base.url and
base.dir went. For more on this problem see this thread.

Posted in knitr, qdap, text, Uncategorized, visualization, word cloud | Tagged , , , , , , , , , , , , , , , , | 4 Comments

qdap 0.2.1 Released

I’m very pleased to announce the release of qdap 0.2.1

This is the second installment of the qdap package available at CRAN. The qdap package automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse, including frequency counts of sentence types, words, sentence, turns of talk, syllable counts and other assorted analysis tasks. The package provides parsing tools for preparing transcript data. Many functions enable the user to aggregate data by any number of grouping variables providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text.

logo

Note: qdap is not compiled for Mac users. Installation instructions for Mac user or other OS users having difficulty installing qdap please click here.


Some of the changes in version 0.2.1 include:

NEW FEATURES

* `gradient_cloud`: Binary gradient Word Cloud – A new plotting function
that plots and colors words for a binary variable based on which group of
the binary variable uses the term more frequently.

* `new_project`: A project template generating function designed to increase
efficiency and standardize work flow. The project comes with a .Rproj file
for easy use with RStudio as well as a .Rprofile that makes loading and sourcing
of packages, data and project functions. This function uses the reports package
to generate an extensive reports folder.

BUG FIXES

* `word_associate` colors the word cloud appropriately and deals with the error
caused by a grouping variable not containing any words from 1 or more of the
vectors of a list supplied to match string

* `trans.cloud` produced an error when expand.target was TRUE. This error has
been eliminated.

* `termco` would eliminate > 1 columns matching an identical search.term found
in a second vector of match.list. termco now counts repeated terms multiple
times.

* `cm_df.transcript` did not give the correct speaker labels (fixed).


For a complete list of changes see qdap’s NEWS

Development Version
github

Posted in qdap, text, Uncategorized, visualization, work flow | Tagged , , , , , , , , , , , , , | 2 Comments

reports 0.1.2 Released

I’m very pleased to announce the release of reports : An R package to assist in the workflow of writing academic articles and other reports.

This is the first CRAN release of reports: http://cran.r-project.org/web/packages/reports/index.html

The reports package assists in writing reports and presentations by providing a frame work that brings together existing R, LaTeX/.docx and Pandoc tools. The package is designed to be used with RStudio, MiKTex/Tex Live/LibreOffice, knitr, knitcitations, Pandoc and pander (and installr for Windows users). The user will want to download these free programs/packages to maximize the effectiveness of the reports package. Functions with two letter names are general text formatting functions for copying text from articles for inclusion as a citation.

reports

Github development version: https://github.com/trinker/reports

As reports is further developed the following are planned: (a) a help video section and (b) a vignette detailing workflow and use of reports.

Check out this introductory video:

Quick start slides:

HTML5 Slides
HTML5 Slides

For more on the potential use of reports see this blog post.

Posted in reports, work flow | Tagged , , , , , , , , , , , , , , , , , | 7 Comments

Workflow w/ reports package

NOTE: THIS IS NOW A PACKAGE SEE THIS LINK FOR DETAILS

Let me start with a video for people who just want to see what I’m demo-ing first:

I’ve been interested in speeding up workflow lately and spending a lot of time doing so. I’ve seen people already try to tackle this in R in the past.  This blog post covers many aspects of workflow and increasing productivity.  John Myles White has tackled this problem and created the ProjectTempalte package.  The idea is terrific but the problem is that the R user is so varied in their work flows that it’s difficult to make one workflow template for everyone.  I’ve given up on that.  Instead I propose:

1. The R community modularize workflow into field dependent pieces.

For instance in qdap, an R package for quantitative discourse analysis, I’ve added a work flow template that people in my field would find suiting.  However, the report writing part I intentionally left underdeveloped because I plan to add the reports package as a piece of the workflow.  While my entire work flow is likely only useful for discourse analysis people, the reports section is much more generalizable.  In this way we build work flow from modular pieces.

2. Make the pieces flexible (within reason).

For example in the beta version of reports I have added the ability for users to submit templates via doc_temp (not sure how well this will work) which provides a template that alters the documents that the new_report template will generate. The doc_temp function is similar to package.skeleton.  The functionality will be similar to the way CRAN or CTAN house packages with the templates library housed within the package, provided it doesn’t get to large. The submissions still need to conform to a standard (the within reason part) though the user may choose to keep their template local.

3. Use existing tools (powerful, flexible and efficient).

R has had some great developments in tools, combined with latex, we can really speed up workflow; RStudio, knitr, MikTex/Tex Live, bibtexknitcitations and of course R to name a few.  By utilizing all these tools we really maximize productivity in that we’re not going to multiple places and reloading libraries and user defined functions.  As an example, recently, R bloggers Daniel Liidecke and Andrew Landgraf discussed custom functions that they use frequently .  By placing these in the extra_functions.R script and then opening with RStudio, the project’s .Rprofile will source these functions automatically and load them as well just by opening the project. Better still if these are constantly used functions that don’t yet have a package home the user can supply the path(s) to new_report and the code will be added automatically to the report project’s .Rprofile for sourcing.

The idea is to generate a template that is fast and flexible which keeps everything for a report housed in one place.  In this way the report framework of the reports package can be added as a piece to the rest of your workflow.

Trying the reports package

 #INSTALLING
library(devtools)
install_github("reports", "trinker")

#GETTING STARTED
library(reports)
# setwd("~/your/favorite/directory/here")
new_report("New")

#PLAY AROUND A BIT
templates()   #current internally housed templates

new_report("new proj2", templates(FALSE)[2]) #quantitative Rnw
new_report("new proj3", templates(FALSE)[3]) #qualitative docx

I encourage you to view the intro video, look at the help manual, check out the html5 introductory slides and just play with the reports package a bit.  I want your feedback to make a tool others can use to help them in their work flow. If your comments are more substantial please use the Issue Tracking of GitHub.

Posted in qdap, work flow | Tagged , , , , , , , , , , | 13 Comments

qdap 0.2.0 released

This is the first CRAN release of qdap (qdap 0.2.0) found here.  qdap (Quantitative Discourse Analysis Package) is an R package designed to assist in quantitative discourse analysis. The package stands as a bridge between qualitative transcripts of dialogue and statistical analysis and visualization.

The qdap package automates many of the tasks associated with quantitative discourse analysis of transcripts containing discourse, including frequency counts of sentence types, words, sentence, turns of talk, syllable counts and other assorted analysis tasks. The package provides parsing tools for preparing transcript data. Many functions enable the user to aggregate data by any number of grouping variables, providing analysis and seamless integration with other R packages that undertake higher level analysis and visualization of text. This provides the user with a more efficient and targeted analysis.

qdap’s development version can be found here.

As qdap is further developed the following tasks are planned: (a) a github hosted website via staticdocs (b) a help video section and (c) a vignette detailing workflow and use of qdap.

If you spot bugs or would like to request features please use qdap’s github site.


Special thanks to Dason of talkstats.com for his patience in teaching and mentoring me through the package creation process.

Also thank you to Hadley Wickham for his great package development tools and documentation of the process.

Posted in discourse analysis, package creation, qdap | Tagged , , , , , , | Leave a comment

Tips for R Package Creation

I’m being tortured by the mistakes of my past self. I think I’ve made most every mistake possible in creating a package and I want to go back in time and tell year ago me all I know now. But it seems require(timetravel) isn’t working on my machine. So instead I’ll share with other new package creators what I’ve learned along the way in a sort of tips list (Letterman style). To give context, I am working on documenting a package (qdap), after it’s functions are finished (bad idea) and am lamenting all the mistakes as this was my first package attempt and its a major under taking.

Here are the things (riddled with helpful links) I wish I had known then that I know now:

  1. Start Small – It’s easier to learn to drive in a car than a dump truck.  I suggest making a small package even if it’s for fun to learn the process first (a game, music player  or fun visualization may be perfect for this).  This way you can refer back to this package often for “How did I do that?”   
  2. Use git – GitHub, bitbucket or some other git interface works awesome to upload a repository to a cloud (dropbox style interface) that you can back up your repo as well as share and collaborate with others.  (here’s a clip about github that’s slightly out of date but still good: LINK)  The issues tab is awesome for documenting bugs and requests.
  3. Use Rstudio – When I first started on qdap, as a windows user, the package creation process was painful.  Rstudio makes your life so much better.  Here’s a video example of how quick it is to create a package with Rstudio LINK 1 and a slighly out of date video of the interface between git and Rstudio LINK 2.
  4. Become familiar with “Writing R Extensions” manual – This is the rule book.  It’s like a club, if you don’t have the right look you aren’t getting in. Nuff said.
  5. Steal – github was designed to collaborate (aka stealing).  Find a trusted package developer and steal their format and design.  I personally steal from two places: Hadley Wickham’s github and Dason Kurkiewicz’s github.  All their files are there for easy sourcing.
  6. Document as you go – Trust me documenting over time is easier than documenting at the end.
  7. Document with roxygen2roxygen2 is a less painful way to write documentation (I recommend actually doing an .Rd file, aka a documentation file, by hand to feel the pain and appreciate roxygen2).  Here’s where stealing other people’s format is extremely useful; look at this Hadely .R file.  It’s nice when you’ve used roxygen2 to click roxygenize(path/to/repo) and the documentation is created.
  8. Use devtools – There are some great developmental tools in devtools (though many if not most/all are incorporated into Rstudio).
  9. Use testthat – I didn’t get why this was useful until I started trying to make changes to my package at the end.  Ever pull a thread on a sweater and it makes a big hole, that’s what a change in a package can do and testthat can help to make sure the changes don’t make a big hole.
  10. Learn to debug – I had no clue how cool  browser() was or how to use it when I started.  Here’s a nice video on R’s debugging tools: LINK.  Debugging stinks, debugging without tools really stinks.
  11. Reduce, Recycle, Reuse – Try to think “will I use this code chunk later?”  If the answer is yes break it off as function of its own and throw it in the package as an internal “helper” function.  This saves time and makes the code more readable.  Also try to make the code compact but as fast as possible.  benchmarking and Rcpp can make the code faster.
  12. Make friends/learning community – The folks at talkstats.com and stackoverflow.com have been a tremendous help in asking about the process and getting feedback.  I wouldn’t know about most of the above things if it were not for these two learning places.

Special thanks to Dason of talkstats.com for his patience in teaching and mentoring me through the package creation process.

Posted in package creation, Uncategorized | Tagged , , , , | 4 Comments

Gradient Word Clouds

I like word clouds because they are visually appealing and provide a ton of information in a small space. Ever since I saw Drew Conway’s post (LINK) I have been looking for ways to improve word clouds. One of the nice feature’s of Drew’s post was that he colored the words according to the gradient. Unfortunately, Drew’s cloud lacks some of the aesthetic wow factor that Ian Fellow’s wordcloud package is known for.

This post is going to show you how to color words with a gradient based on degree of usage between two individuals. For me it’s going to help me learn the following things:

  1. How to use knitr + markdown to make a blog post (I’ve been using knitr for reproducible latex/beamer reports).
  2. How to use gradients in base (i.e. outside of ggplot2 that I’ve come to depend on).
  3. How to make a gradient color bar in base.

Installing and Loading qdap and wordcloud

First you’ll need some packages to get started. I’m using my own beta package qdap plus Fellow’s wordcloud packages. If you download qdap wordcloud is part of the install. For the legend we’ll be using the plotrix package.

 library(qdap)
library(wordcloud)
library(plotrix)

Reading in data

Now we’ll need some data. I happen to have presidential debate data (debate # 1) left over that we can still mine.

# download transcript of the debate to working directory
url_dl(pres.deb1.docx)

# load multiple files with read transcript and assign to working directory
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)

#view a truncated version of the data (see also htruncdf)
left.just(htruncdf(dat1, 10, 45))
person dialogue
1 LEHRER We'll talk about specifically about health ca
2 ROMNEY What I support is no change for current retir
3 LEHRER And what about the vouchers?
4 ROMNEY So that's that's number one. Number two is fo
5 OBAMA Jim, if I if I can just respond very quickly,
6 LEHRER Talk about that in a minute.
7 OBAMA but but but overall.
8 LEHRER OK.
9 OBAMA And so...
10 ROMNEY That's that's a big topic. Can we can we stay

Setting Up the Data

  1. Make a word frequency matrix
  2. Remove Lehrer’s words
  3. Scale the word usage
  4. Create a binned fill variable
word.freq <- with(dat1, wfdf(dialogue, person))[, -2]
csums <- colSums(word.freq[, -1])
conv.fact <- csums[2]/csums[1]
word.freq$ROMNEY2 <- word.freq[, "ROMNEY"] * conv.fact
#colSums(word.freq[, -1])
word.freq[, "total"] <- rowSums(word.freq[, -1])
word.freq$continum <- with(word.freq, ROMNEY2-OBAMA)
word.freq <- word.freq[word.freq$total != 0,] #remove Leher only words
MAX <- max(word.freq$continum[!is.infinite(word.freq$continum)])
word.freq$continum <- ifelse(is.infinite(word.freq$continum), MAX, word.freq$continum)
conv.fact2 <- abs(range(word.freq$continum ))
conv.fact2 <- max(conv.fact2)/min(conv.fact2)
word.freq$continum <- ifelse(word.freq$continum > 0, word.freq$continum * conv.fact2, word.freq$continum)
cuts <- c(-250, -25, -15, -10, -5, -2.5, -1.5, -1, -.5, -.25)
cuts <- sort(c(cuts, 0, abs(cuts)))
word.freq$fill.var <- cut(word.freq$continum, breaks=cuts )
head(word.freq, 10)
Words ROMNEY OBAMA ROMNEY2 total continum fill.var
1 a 83 72 73.125 228.125 1.5470 (1.5,2.5]
2 aarp 0 1 0.000 1.000 -1.0000 (-1.5,-1]
3 able 6 7 5.286 18.286 -1.7138 (-2.5,-1.5]
4 about 11 11 9.691 31.691 -1.3087 (-1.5,-1]
5 above 1 0 0.881 1.881 1.2111 (1,1.5]
6 abraham 0 2 0.000 2.000 -2.0000 (-2.5,-1.5]
7 absolutely 2 2 1.762 5.762 -0.2379 (-0.25,0]
8 academy 0 1 0.000 1.000 -1.0000 (-1.5,-1]
9 accept 1 0 0.881 1.881 1.2111 (1,1.5]
10 accomplish 1 0 0.881 1.881 1.2111 (1,1.5]

Convert the Binned Variable to Colors

I was not sure how to produce gradients outside of ggplot2 and so I asked on stackoverflow.com and received a terrific and simple answer from thelatemail (LINK). Now we’ll create a color column based on the fill.var using qdap‘s lookup that uses an environment to recode.

colfunc <- colorRampPalette(c("red", "blue"))
word.freq$colors <- lookup(word.freq$fill.var, levels(word.freq$fill.var),
    rev(colfunc(length(levels(word.freq$fill.var)))))
head(word.freq, 10)
Words ROMNEY OBAMA ROMNEY2 total continum fill.var colors
1 a 83 72 73.125 228.125 1.5470 (1.5,2.5] #BB0043
2 aarp 0 1 0.000 1.000 -1.0000 (-1.5,-1] #5000AE
3 able 6 7 5.286 18.286 -1.7138 (-2.5,-1.5] #4300BB
4 about 11 11 9.691 31.691 -1.3087 (-1.5,-1] #5000AE
5 above 1 0 0.881 1.881 1.2111 (1,1.5] #AE0050
6 abraham 0 2 0.000 2.000 -2.0000 (-2.5,-1.5] #4300BB
7 absolutely 2 2 1.762 5.762 -0.2379 (-0.25,0] #780086
8 academy 0 1 0.000 1.000 -1.0000 (-1.5,-1] #5000AE
9 accept 1 0 0.881 1.881 1.2111 (1,1.5] #AE0050
10 accomplish 1 0 0.881 1.881 1.2111 (1,1.5] #AE0050

Plot the Word Cloud and Gradient Legend

Now that we have color gradients let’s use wordcloud to plot and plotrix‘s color.legend to make a legend. I didn’t know how to create the gradient legend either and asked again on stackoverflow where I received an answer from Dason and mnel (LINK). Both great answers but I went with Dason’s.

par(mar=c(7,1,1,1))
wordcloud(word.freq$Words, word.freq$total, colors = word.freq$colors,
    min.freq = 1, ordered.colors = TRUE, random.order = FALSE, rot.per=0,
    scale = c(5, .7))
# Add legend
COLS <- colfunc(length(levels(word.freq$fill.var)))
color.legend(.025, .025, .25, .04, qcv(Romney,Obama), COLS)

gradient word cloud

Note: If you plot to the console graphics device you can’t get a large enough size to plot all the words comfortably. I achieved the above results plotting externally to png @ 1000 x 1000 (w x h)

Concluding Thoughts

Alright, this is my first knitr generated blog post. Very easy. I regret not having tried it earlier 😦

I accomplished my goal of making a gradient word cloud and a gradient legend. The actual word cloud really isn’t that informative because there’re too many words and too little variation in word choice/colors. In some situations this approach may be useful but in this one I don’t like it. Secondly, I used the blue to red theme because it plays to the political parties but in this visualization better contrasting colors would be more appropriate. Overall I don’t feel I was successful in presenting information better than Drew Conway’s post.

What the Reader Can Take Away from the Post

  1. Using wordcloud’s user defined color feature
  2. Using qdap’s lookup to recode
  3. Creating gradients in base (easy)
  4. Creating the accompanying gradient legend

If the reader has improvements in scaling, visualizing parameters ect. please share these and other comments below.

For a .txt version of this script -click here-

Addendum:
To make a knitr output upload to wordpress.com I found help from
http://www.carlboettiger.info

Posted in discourse analysis, text, visualization, word cloud | Tagged , , , , , | 5 Comments

Presidential Debates 2012

I have been playing with the beta version of qdap utilizing the presidential debates as a data set. qdap is in a beta phase lacking documentation though I’m getting there. In previous blog posts (presidential debate 1 LINK and VP debate LINK) I demonstrated some of the capabilities of qdap. I wanted to further show some of qdap’s capabilities while seeking to provide information about the debates.

In previous posts readers made comments or emailed regarding functionality of qdap . This was extremely helpful in working out bugs that arise on various operating systems. If you have praise or methods you used to run the qdap scripts please leave a comment saying so. However, if you are having difficulty please file an issue at qdap’s home, GitHub (LINK).

In this post we’ll be looking at:

1. A faceted gantt plot for each of the speeches via gantt_plot
2. Various word statistics via word_stats
3. A venn diagram showing the overlap in word usage via trans.venn
4. A dissimilarity matrix indicating closeness in speech via dissimilarity
5. iGraph Visualization of dissimilarity

Reading in the data sets and Cleaning

library(qdap) #load qdap
# download transcript of the debate to working directory
url_dl(pres.deb1.docx, pres.deb2.docx, pres.deb3.docx)   

# load multiple files with read transcript and assign to global environment
dat1 <- read.transcript("pres.deb1.docx", c("person", "dialogue"))
dat2 <- read.transcript("pres.deb2.docx", c("person", "dialogue"))
dat3 <- read.transcript("pres.deb3.docx", c("person", "dialogue"))

# qprep for quick cleaning
dat1$dialogue <- qprep(dat1$dialogue)
dat2$dialogue <- qprep(dat2$dialogue)
dat3$dialogue <- qprep(dat3$dialogue)

# Split each sentece into it's own line
dat1b <- sentSplit(dat1, "dialogue") 
dat1$person <- factor(dat1$person , levels = qcv(ROMNEY, OBAMA, LEHRER))
dat2b <- sentSplit(dat2, "dialogue")  
dat3b <- sentSplit(dat3, "dialogue") 

# Create a large data frame by the three debates times
L1 <- list(dat1b, dat2b, dat3b)
L1 <- lapply(seq_along(L1), function(i) data.frame(L1[[i]], time = paste("time", i)))
dat4 <- do.call(rbind, L1)

#view a truncated version of the data (see also htruncdf)
truncdf(dat4)

Faceted Gantt Plot

#reorder factor levels
dat4$person <- factor(dat4$person, 
    levels=qcv(terms="OBAMA ROMNEY CROWLEY LEHRER QUESTION SCHIEFFER"))

with(dat4, gantt_plot(dialogue, person, time, xlab = "duration(words)", scale = "free"))

rm3

Basic Word Statistics
This section utilizes the word_stats function in conjunction with ggplot2 to create a heat map for various descriptive word statistics. Below is a list of column names for the function’s default print method.

   column title description                           
1  n.tot        number of turns of talk               
2  n.sent       number of sentences                   
3  n.words      number of words                       
4  n.char       number of characters                  
5  n.syl        number of syllables                   
6  n.poly       number of polysyllables               
7  sptot        syllables per turn of talk            
8  wptot        words per turn of talk                
9  wps          words per sentence                    
10 cps          characters per sentence               
11 sps          syllables per sentence                
12 psps         poly-syllables per sentence           
13 cpw          characters per word                   
14 spw          syllables per word                    
15 n.state      number of statements                  
16 n.quest      number of questions                   
17 n.exclm      number of exclamations                
18 n.incom      number of incomplete statements       
19 p.state      proportion of statements              
20 p.quest      proportion of questions               
21 p.exclm      proportion of exclamations            
22 p.incom      proportion of incomplete statements   
23 n.hapax      number of hapax legomenon             
24 n.dis        number of dis legomenon               
25 grow.rate    proportion of hapax legomenon to words
26 prop.dis     proportion of dis legomenon to words  
z <- with(dat4, word_stats(dialogue, list(person, time), tot))
z$ts
z$gts
plot(z, low="white", high="black")
plot(z, label=TRUE, low="white", high="black", lab.digits=1)

heatmap

Venn Diagram
With proper stop word use and small, variable data sets a Venn diagram can be informative. In this case the overlap is fairly strong and less informative though labels are centered. Thus labels closer in proximity are closer in words used.

with(subset(dat4, person == qcv(ROMNEY, OBAMA)), 
    trans.venn(dialogue, list(person, time), 
    title.name = "Presidential Debates Word Overlap 2012")
)

venn

Dissimilarity Matrix

dat5 <- subset(dat4, person == qcv(ROMNEY, OBAMA))
dat5$person <- factor(dat5$person, levels = qcv(OBAMA, ROMNEY))
#a word frequency matrix inspired by the tm package's DocumentTermMatrix
with(dat5, wfm(dialogue, list(person, time)))
#with row and column sums
with(dat5, wfdf(dialogue, list(person, time), margins = TRUE))
#dissimilarity (similar to a correlation 
#The default emasure is 1 - binary or proportion overlap between grouping variable
(sim <- with(dat5, dissimilarity(dialogue, list(person, time))))
              OBAMA.time.1 OBAMA.time.2 OBAMA.time.3 ROMNEY.time.1 ROMNEY.time.2
OBAMA.time.2         0.293                                                      
OBAMA.time.3         0.257        0.303                                         
ROMNEY.time.1        0.317        0.261        0.245                            
ROMNEY.time.2        0.273        0.316        0.285         0.317              
ROMNEY.time.3        0.240        0.276        0.311         0.265         0.312

Network Graph
The use of igraph may not always be the best way to view the data but this exercise shows one way this package can be utilized. In this plot the wlabels are sized based on number of words used. The distance measures that label the edges are taken from the dissimilarity function (1 – binary). Colors are based on political party.

library(igraph)
Z <- with(dat5, adjacency_matrix(wfm(dialogue, list(person, time))))
g <- graph.adjacency(Z$adjacency, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)

set.seed(3952)
layout1 <- layout.auto(g)
opar <- par()$mar; par(mar=rep(.5, 4)) #Give the graph lots of room
plot(g, layout=layout1)

edge.weight <- 9  #a maximizing thickness constant
z1 <- edge.weight * sim/max(sim)*sim
E(g)$width <- c(z1)[c(z1) != 0] #remove 0s: these won't have an edge
numformat <- function(val, digits = 2) { sub("^(-?)0.", "\\1.", sprintf(paste0("%.", digits, "f"), val)) }
z2 <- numformat(round(sim, 3), 3)
E(g)$label <- c(z2)[c(z2) != 0]
plot(g, layout=layout1) #check it out! 

label.size <- 15 #a maximizing label size constant
WC <- aggregate(dialogue~person +time, data=dat5, function(x)  sum(word.count(x), na.rm = TRUE))
WC <- WC[order(WC$person, WC$time), 3]
resize <- (log(WC)/max(log(WC)))
V(g)$label.cex <- 5 *(resize - .8)
plot(g, layout=layout1) #check it out!

V(g)$color <- ifelse(substring(V(g)$label, 1, 2)=="OB", "pink", "lightblue")

plot(g, layout=layout1)
tkplot(g)

igr

This blog post is a rough initial analysis of the three presidential debates. It was meant as a means of demonstrating the capabilities of qdap rather than providing in depth analysis of the candidates. Please share your experiences with using qdap in a comment below and suggestions for improvement via the issues page of qdap’s github site(LINK).

For a pdf version of all the graphics created in the blog post -click here-

Posted in ggplot2, igraph, qdap, text, Uncategorized, visualization | Tagged , , , | 3 Comments

How do I re-arrange…?: Ordering a plot.

One of the most widely seen FAQ coming across list serves and R help sites is the question:

“How do I re-arrange/re-order (plotting geom/aesthetic such as bar/labels) in a (insert plot type here) using(insert graphics system here) in R?”
.

Don’t believe me? google “reorder factor r plot” and see how many hits you get. I’d venture to say that in almost all cases when you use the words “plot” and “re-arrange”/”re-order” in a question the answer is…

Reorder your factor levels!!
.

Here’s a quick and dirty R theater demo of how to do this:

 library(ggplot2)
ggplot(data=mtcars, aes(y=as.factor(carb), x=mpg, colour=hp)) +
    geom_point()

# Rearrange_Guy: But I want 2 to come first and 8 last
# Helpful_Gal: OK use rev with levels 

mtcars$carb2 <- factor(mtcars$carb, levels=rev(levels(factor(mtcars$carb))))

ggplot(data=mtcars, aes(y=carb2, x=mpg, colour=hp)) +
    geom_point()

# Rearrange_Guy: Well I just want to specify the order
# Helpful_Gal: OK type it in by hand then

mtcars$carb2 <- factor(mtcars$carb, levels=c("1", "2", "3", "6", "8", "4"))
ggplot(data=mtcars, aes(y=carb2, x=mpg, colour=hp)) +
    geom_point()

# Rearrange_Guy: What about faceting?  I bet it doesn't work for that.
# Helpful_Gal: Um yes it does.

ggplot(data=mtcars, aes(y=carb2, x=mpg, colour=hp)) +
    geom_point() + facet_grid(cyl~.)

# Rearrange_Guy: OK Helpful_Gal I want it to go 6, 4, and then 8
# Helpful_Gal: OK

mtcars$cyl2 <- factor(mtcars$cyl, levels=c("6", "4", "8"))
ggplot(data=mtcars, aes(y=carb2, x=mpg, colour=hp)) +
    geom_point() + facet_grid(cyl2~.)

# Rearrange_Guy: Why do you keep making new variables?
# Helpful_Gal: It's probably not the best idea to overwrite variables just for the sake of plotting
# Rearrange_Guy: Thank you for showing me the way of re-ordering and re-arranging.
# Helpful_Gal: You welcome.

So if you catch yourself using “re-arrange”/”re-order” and “plot” in a question think…

factor & levels
Posted in factor, ggplot2 | Tagged , , , , , , , , , , , | 7 Comments

Vice Presidential Debates with qdap-beta

After the presidential debates I used the beta version of qdap to provide some initial surface level analysis (LINK to Presidential Debates with qdap-beta). In the comments of that post, annon (a commenter) provided a link to an analysis/visualization that utilizes bubbles to demonstrate proportion of words and colors and labels to show each candidate’s usage (LINK). While I initially liked the graphic it was the shape and colors that appealed to me. Closer inspection reveals that smaller words are hard to get information for and the bubbles make comparing across words difficult. I decided to attempt a visualization for the vice presidential debates using qdap and ggplot2.

I decided to use themes rather than words and categorize similar words together. This approach utilizes a function in qdap called termco. Here’s the function’s arguments:

termco(text.var, grouping.var=NULL, match.list, short.term = FALSE, 
    ignore.case = TRUE, lazy.term = TRUE, elim.old = TRUE, 
    zero.replace = 0, output = "percent", digits = 2)

Basically you can supply a list of named character vectors (our themes) to this function as well as dialogue (the debate text) and grouping variable (person) and it will output a list with several data frames. You can get raw counts, percent/proportions or a combination of raw and percent/proportions by grouping variable (person) for each theme.

The important part is the themes we supply to match list. This function relies on gregexpr meaning it will do partial matching, so there’re some things you’ll want to think about when supplying the themes:

  1. If you want to find “read” but not “bread” or “reading” use a trailing and leading white space as in ” read “
  2. If you want to find and root word with “read” leading white space as in ” read”
  3. This will also find “ready” so if you want any form of the word “read” you’ll have to be explicit and put all these forms in the vector for read with trailing and leading white spaces; ie ” read “, ” reads “, ” reader” (reader and readers), ” reading “
  4. If you use ” obama” and ” obamacare” termco.a will count obamacare two times; instead use ” obama “ and ” obamacare “ or just ” obama”

The basic form for the list of vectors supplied to match.list is:

target.words <- list(
    theme_1 = c(),
    theme_2 = c(),
    theme_n = c(),
)

Let’s look at the results with some themes I examined for VP debates

library(qdap)

url_dl("vpres.deb1.docx")  #downloads a docx file of the debate to wd

dat <- read.transcript("vpres.deb1.docx", col.names=c("person", "dialogue"))
truncdf(dat)
left.just(dat)
dat$dialogue <- qprep(dat$dialogue)  
dat2 <- sentSplit(dat, "dialogue")  
htruncdf(dat2)   #view a truncated version of the data (see also truncdf)
dat2$person <- factor(Trim(dat2$person))

#the themes we're looking at (termco.a is only as good as the researcher who supplied these themes)
tw2 <- list(health=c(" health", " insurance", " medic", "obamacare", " hospital", " doctor"), 
        economic = c(" econom", " jobs", " unemploy", " business", " banks", " mortgage",
            " budget", " market", " paycheck", " wall street"),
        foreign = c(" war ", " terror", " foreign", "iran", "iraq", "sanctions", "nuclear", 
            "al qaida", "libya", "netanyahu", "israel", "africa", "afgha", " embassy", "russia"),
        democratic_people = c("the president", " obama ", " obamas", " obama's", "biden", 
            "the vice president", "mister vice president"),
        rebublican_people = c("my friend", " ryan", "romney"),
        obama_any_name = c("obama ", "obamas", "obama's", "the president"),
        "romney",  #you don't have to name a vector of length 1
        obama_by_name = c("obama ", "obamas", "obama's"))


(a <- with(dat2, termco(dialogue, person, tw2, short.term = TRUE)))

names(a)  #see what else is in the termco object
a$raw  #raw numbers of use
a$prop  #proportions or percentages of use
a$rnp  #default print for termco
plot(a)

For a txt version of the data frame that termco produces click here

Creating the graphic of the themes via ggplot2

library(ggplot2)
library(reshape2)
dat3 <- melt(a$raw[-2,], id=qcv(person, word.count)) #drop the moderator
dat3$labs <- melt(a$rnp[-2,], id=qcv(person, word.count))[, 4]
dat3$variable <- factor(dat3$variable, levels=names(sort(apply(a$prop[-2, -c(1:2)], 2, max))))
dat3$loc <- dat3$value - 6.5; dat3$loc[15] <- 7; dat3$loc[6] <- 65.75
dat3$cols <- rep("white", 16); dat3$cols[1] <- "black"

ggplot(dat3, aes(x=variable,  y=value, fill=person)) + 
    geom_bar(position="dodge", stat="identity")  +
    coord_flip() + theme_bw() + 
    theme(legend.position=c(.91, 0.07), legend.background = element_rect(color="grey60"),
        panel.grid.major=element_blank(),panel.grid.minor=element_blank()) +
    ylab("Occurances") +
    xlab("Theme") +
    scale_fill_manual(values=c("#0000FF", "#FF0000"),
        name="Candidate", guide = guide_legend(reverse=TRUE)) +
    geom_text(aes(label = labs,  y = loc, x = variable),
              size = 5, position = position_dodge(width=0.9), color=dat3$cols)  + 
    scale_y_discrete(expand = c(0, 0), breaks=seq(0,80,20))

The graphic
vp themes

For a pdf version of the output click here

Discussion of the results
At first I ran a search to see who used the name Obama the most and I saw Vice President Biden only used the name once. At first I concluded (wrongly) he was focused on himself; after all the point of the vice presidential debates is to sell your boss as the winner. I did more inspection of the terminology (via word clouds) and I found Biden refers to President Obama as “The President”. This must be an inner circle respect thing that’s so ingrained in The Vice President that using the term “Mr. Obama” or “President Obama” just doesn’t happen for him.

I also noticed Ryan pushed the economic theme hard. Vice President Biden discussed the opposition quite a bit as well.

This was a quick and dirty demo. I didn’t actually put a tremendous amount of thought into the themes but was more demonstrating the ability of qdap for aiding the researcher in representing themes numerically and visually

Posted in discourse analysis, ggplot2, qdap, text, Uncategorized, visualization | Tagged , , , , , , , , , , | 5 Comments