Karin Groothuis gave a presentation of the mice package that she co-authored. It imputes data in the case of missing data and automatically integrates statistical results across all separate analyses on the imputed data sets. mice_R_TRUG
Function to do sentiment analysis (on twitter data for example)
# based on https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107
score.sentiment = function(sentences, pos.words, neg.words, .progress=’none’)
{
require(plyr)
require(stringr)
# we got a vector of sentences. plyr will handle a list or a vector as an “l” for us
# we want a simple array of scores back, so we use “l” + “a” + “ply” = laply:
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# clean up sentences with R’s regex-driven global substitute, gsub():
sentence = gsub(‘[[:punct:]]’, ”, sentence)
sentence = gsub(‘[[:cntrl:]]’, ”, sentence)
sentence = gsub(‘\\d+’, ”, sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, ‘\\s+’)
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) – sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
Checking p-values in papers
Analysing twitter data with twitteR package
R code:
library(twitteR)
library(tm)
library(wordcloud)
library(stringr)
library(plyr)
#Log in to twitter (you can find these details on https://apps.twitter.com)
#Replace string between “” with the codes from your own twitter account.
consumer_key <- “Consumer Key (API Key)”
consumer_secret <- “Consumer Secret (API Secret)”
access_token <- “Access Token”
access_secret <- “Access Token Secret”
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#Some examples
userTimeline(“utwente”, n=10) # search for tweets of specific user
searchTwitter(“utwente”, n = 10) # search for keywords
#Discover twitteR object
?searchTwitter
tweetsList <- searchTwitter(“utwente”, n = 10)
tweet <- tweetsList[[1]]
tweet$getScreenName()
tweet$getText()
tweet$favoriteCount
tweet$retweetCount
#Harvest tweets based on keyword.
mach_tweets <- searchTwitter(“#prayforparis”, n=1500, lang=”en”)
#Extract the text from the tweets in a vector
#See http://davetang.org/muse/2013/04/06/using-the-r_twitter-package/ for an
#approach in Windows.
mach_text <- sapply(mach_tweets, function(x) x$getText())
mach_text <- iconv(mach_text, to = “utf-8-mac”)
###Some initial cleaning
# Remove URLs
mach_text <- gsub(“(f|ht)(tp)(s?)(://)(.*)[.|/](.*)”, “”, mach_text, ignore.case = TRUE)
# Remove @UserName
#mach_text <- gsub(“@\\w+”, “”, mach_text)
# Create a corpus
mach_corpus <- Corpus(VectorSource(mach_text))
# create document term matrix applying some transformations
tdm <- TermDocumentMatrix(mach_corpus,
control = list(removePunctuation = TRUE,
stopwords = c(“prayforparis”, “paris”, “http”, “https”, stopwords(“english”)),
removeNumbers = TRUE, tolower = TRUE))
## further exploration termd document matrix
#frequent words
findFreqTerms(tdm, lowfreq = 100)
#association?
findAssocs(tdm, terms = “syria”, corlimit = 0.3)
## define tdm as matrix
tdMatrix <- as.matrix(tdm)
# get word counts in decreasing order
word_freqs <- sort(rowSums(tdMatrix), decreasing=TRUE)
# create a data frame with words and their frequencies
df <- data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
pdf(“wcParis.pdf”)
wordcloud(df$word, df$freq, random.order=FALSE, colors=brewer.pal(8, “Dark2”), min.freq = 20)
dev.off()
####Sentiment analyses####
# based on https://github.com/jeffreybreen/twitter-sentiment-analysis-tutorial-201107
# download Opinion Lexicon (Hu and Liu, KDD-2004) http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
listPosWords <- scan(“positive-words.txt”, what = “character”, comment.char = “;”)
listNegWords <- scan(“negative-words.txt”, what = “character”, comment.char = “;”)
sentScoreTweets <- score.sentiment(mach_text, listPosWords, listNegWords, .progress = “text”)
hist(sentScoreTweets$score)
Next TRUG meeting: 16 November
Our next meeting will take place on 16 November at 12 o’clock in the Cubicus building, room C124. Elze Ufkes will give a presentation on the analysis of twitter data in R.
The slides of the last meeting can be found here:
https://twenterug.wordpress.com/wp-content/uploads/2014/11/sukaesi.ppt
February meeting
Our February meeting will take place on 25 February (next week Wednesday) at 12 o’clock in the Cubicus building, room C124. Sukaesi Marianti will give a presentation on the estimation of growth models in R.
The slides of the December meeting can be found here: https://twenterug.wordpress.com/wp-content/uploads/2014/11/trug_dec_2014_pipeline.pdf
Time & location of the next meeting
Due to the low responses, we decided to change the location of the next meeting (so you don’t have to bicycle to far-away Boekelo anymore).
The meeting will now take place on 15 December (next Monday) at 12 o’clock in the Cubicus building, room C232a.
The February-meeting has been postponed to 26 February (time, place + topic to be announced).
We hope to see you then!
Next TRUG meetings!
I am very happy to finally (!) pronounce the next two TRUG meetings!
Next meeting, Stéphanie van den Berg will give a presentation on a pipeline for analysing data on twins, using the R library knitr. The pipeline automatically reads in the data, runs the analysis and creates a LateX file with the results of the analysis. The LateX file is then automatically converted into a pdf file.
Depending on how many of you are able to come, the meeting will take place on December 15th (monday) or December 18th (thursday). Please indicate your preference:
We will inform you by e-mail about the exact date and location as soon as there is a clear preference for one of the dates. As usual, the meeting will take place at Stephanie’s little farm in Boekelo (time to be announced).
The meeting after that will take place on February 12th (thursday). Presenter will be Sukaesi Marianti (topic to be announced).
There is of course still a lot of time left to make up your mind, but if you already know that you can (or cannot) attend, please let us know:
The dplyr package
In the June ’14 meeting of the Twente R User group, Martin Schmettow gave a presentation on the R package ‘dplyr’. Code of this package runs fast, can transparently deal with remote data and produces readable code. Furthermore, it interfaces well with the plyr and ggplot package. You can find the slides of the presentation here: Dplyr package.
Next meeting: The dplyr package
At the next meeting, Wednesday 25 2014, Martin Schmettow will give a presentation on the R library dplyr. The dplyr package provides useful tools for efficiently manipulating datasets in R. For those who are familiar with the package plyr, dplyr is the ‘next iteration’ of plyr. It focuses on data frames and is faster and easier to use than plyr.
We are meeting at Stephanie’s little farm in Boekelo at 17.30. A group of TRUG members is going by bicycle to Boekelo. In case you want to join us, we are meeting at the entrance of the Cubicus building at 17.00 o’clock.