Text Mining

How to scrape, import and visualize Telegram chats

Intro

Telegram is a cloud-based and cross-platform instant messaging service. Unlike WhatsApp, Telegram clients exist not only for mobile devices but also for desktop operating systems. In addition, there is also a web based client.

In this blog post I show, how to import Telegram chats into R and how to plot a chat using the qdap package.

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, qdap, lubridate)

Scraping the data

A very straightforward way to save Telegram chats is to use the Chrome extension Save Telegram Chat History. On Quora, Stacy Wu explains how to use it:

Visit https://web.telegram.org.
Select a peer you want to get the chat history from.
Load the messages.
Select all text (Ctrl+A) and copy it (Ctrl+C).
Paste the text into a text editor (Ctrl+V) and save to a local file.

I have saved the chat to a local csv-file. Since the first two lines contain non-tabular information, they need to be manually removed. Furthermore, undesired line breaks must be manually removed as well.

Importing the data

The following line of code shows how to import the csv file into R using the read_csv() function of the readr package.

mydata <- readr::read_csv("telegram.csv", col_names=FALSE)

Data wrangling

After importing the file, our data frame consists of two string variables: X1 containing information about day and time of the conversations and X2 containing the names of the persons involved in the chat as well as the chat text. With the following lines of code we create 4 new variables:

day containing the dates of the chats,
time containing the times of day of the chats,
person containing the names of the persons involved in the chat,
txt containing the actual chat text.

mydata$day <- stringr::str_sub(mydata$X1, 1, 10)
mydata$day <- lubridate::dmy(mydata$day)
mydata$time <- stringr::str_sub(mydata$X1, 12, 19)
mydata$time <- lubridate::hms(mydata$time)
mydata$person <- stringr::str_extract(mydata$X2, "[^:]*")
mydata$person <- factor(mydata$person, levels = unique(mydata$person), labels = c('Me', 'Other'))
mydata$txt <- gsub(".*:\\s*","", mydata$X2)
mydata <- mydata[, c(3:6)]
head(mydata, 2)

## # A tibble: 2 × 4
##          day         time person   txt
##       <date> <S4: Period> <fctr> <chr>
## 1 2017-01-20  21H 10M 14S     Me Hello
## 2 2017-01-20  21H 11M 42S  Other  Huhu

Gradient word cloud

Since the chat involves only two persons, I decided to plot it as gradient word cloud, a visualization technique developed by Tyler Rinker. The function gradient_cloud() I use in the next code snippet is part of his qdap package. Gradient word clouds “color words with a gradient based on degree of usage between two individuals” (See).

gradient_cloud(mydata$txt, mydata$person, title = "Gradient word cloud of Telegram chat")

The chat I have ploted is very short and, thus, not very telling. I'm wondering how it looks in a couple of months.

Text mining PubMed using R – a case study

Intro

For a couple of years, I've been doing research in the field of psychooncology, which basically is “concerned both with the effects of cancer on a person's psychological health as well as the social and behavioral factors that may affect the disease process of cancer and/or the remission of it.” The most influential psycho-oncological journal is named “Psycho-Oncology”. The journal was founded in 1992 and is published monthly by Wiley-Blackwell.

Data

In this blog post, I'll play arround with some bibliographic data from the PubMed data base. With the following query, we receive bibliographical information for every paper published in Psycho-Oncology from 1997 up today.

("Psycho-oncology"[Jour])

As the following screen shot shows, our search query returned 2785 entries. The search results can be downloaded and saved into a .csv file by clicking on the arrow right beside Send to, choosing destination (File) and format (CSV).

Fig. 1: Screenshot from PubMed

R packages

The following code will install load and / or install the R packages required for this blog post.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(pacman::p_load(readr, stringr, ggplot2, qdap, qdapRegex))

Import

The data can be easily imported into R using the read_csv() function from the readr package.

# import .csv file
mydata <- read_csv("pubmed_result.csv")

Data manipulation

With the next code snippet, we do some data manipulation in order to make our data frame tidy. We start with removing the last row of the data frame containing data about a paper published in 1995, remove superfluous headlines, entries without author and select the variables we need for analysis. The str_detect() function stems from the stringr package.

# remove only article from 1995
mydata <- mydata[1:nrow(mydata)-1,]
# remove headlines
mydata <- subset(mydata, str_detect(URL, 'URL')==FALSE)
# remove entries without author
mydata <- subset(mydata, str_detect(Description, 'Author A.')==FALSE)
mydata <- subset(mydata, str_detect(Description, '\\[No authors listed\\]')==FALSE)
# Select variables
mydata <- mydata[, c('Title', 'Description', 'ShortDetails')]

In the next step, we extract the year of publication from the ShortDetails variable. Based on the year variable, we create a variable named time merging the years to four periods of time (five years sub-periods between 1998 and 2016). Furthermore, we calculate the number of authors for each paper using the str_count function of the stringr package. Since each author except for the last is followed by a comma, we just need to count the number of commas and add 1.

# Return Year of publication
mydata$Year <- as.numeric(str_sub(mydata$ShortDetails, 
                                  nchar(mydata$ShortDetails)-3, 
                                  nchar(mydata$ShortDetails)))
mydata$ShortDetails <- NULL
colnames(mydata) <- c('title', 'authors', 'year')
mydata$time <- ifelse(mydata$year <= 2001, 1,
                      ifelse(mydata$year >= 2002 & mydata$year <=2006, 2,
                             ifelse(mydata$year >= 2007 & mydata$year <=2011, 3, 4)))
mydata$year <- as.factor(mydata$year)
# Categorize years
mydata$time <- factor(mydata$time, levels = c(1:4),
                      labels = c('1997-2001', '2002-2006', '2007-2011', '2012-2016'))
# Calculate number of authors
mydata$auth.n <- str_count(mydata$authors, '\\,') + 1

Before we can plot these data, we need to create 4 more data frames:

df.pup containing the number of papers published each year between 1997 and 2006,
df.auth.1 containing the mean number of authors per paper and year,
df.auth.2 containing the mean number of authors per paper and time period,
df.auth.3 containing the maximum number of authors per paper and year,

df.pup <- data.frame(table(mydata$year))
df.auth.1 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), mean))
df.auth.2 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$time), mean))
df.auth.3 <- data.frame(aggregate(mydata$auth.n, list(time = mydata$year), max))

Now, we can plot the data using the ggplot2 package.

ggplot(df.pup, aes(x=Var1, y=Freq, fill=0-Freq)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=Freq), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.pup$Freq)+10), breaks = seq(0, max(df.pup$Freq)+10, by=20)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 2: Number of papers published each year",
       caption = "Data source: PubMed")

plot of chunk ggplot1

Except for a few years, there has been a continuous rise in the number of publications. Since January 2013, Psycho-Oncology has been available exclusively online. Thus, the number of annual papers published is not restricted by requirements of the printed version anymore. Consequently, there was an enormous rise in the number of publications in 2013. However, in the following year (2014) the number of publications declined quite remarkably. The reason for this decline may be a lack of submissions.

ggplot(df.auth.1, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 3: Mean number of authors per paper and year",
       caption = "Data source: PubMed")

plot of chunk ggplot2

According to figure 3 and 4, the number of authorships per article has been rising over the whole time period. This is in accordance with the general trend towards a rise in the number of personal authors per citation.

ggplot(df.auth.2, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, 10), breaks = seq(0,10, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 4: Mean number of authors per paper and time period",
       caption = "Data source: PubMed")

plot of chunk ggplot3

Between 1997 and 2016, the mean number of authors per publication has increased from about four to six. The enormous rise in the maximum number of authors (see Fig. 5) supports the idea of an inflation in the number of authors per paper.

ggplot(df.auth.3, aes(x=time, y=x, fill=0-x)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(x,1)), size = 4, 
            fontface=2, hjust =  .45, vjust = -0.75) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.auth.3$x)+5), breaks = seq(0, max(df.auth.3$x)+5, by=5)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  labs(title = "Fig. 5: Max number of authors per paper and year",
      caption = "Data source: PubMed")

plot of chunk ggplot4

Before we analyze the data, some janitorial work needs to be done.
We start with (1) defining a list of stopwords, (2, 3) remove non-word characters, (4, 5) replace spaces in two word groups we want to be grouped together, and (6) remove the defined stopwords from the variable containing the citation titles.

sw <- c(Top200Words, 'among') # qdap
mydata$txt <- rm_non_words(mydata$title) # qdapRegex
mydata$txt <- str_replace_all(mydata$txt, "'", "") # stringr
keeps <- c('quality of life', 'health related')
mydata$txt <- space_fill(mydata$txt, keeps, sep = "~~") # qdap
mydata$txt <- rm_stop(mydata$txt, sw,
                      separate = FALSE) # qdap

In the next step, we generate a distribution table for all words of the citation titles. The resulting data frame (df.words) consits of three variables: interval (citation title terms), freq (term frequency), and percent (percentage of the term).

df.words <- dist_tab(breaker(mydata$txt)) # qdap
df.words <- df.words[,c(1, 2, 4)]
df.words <- subset(df.words, percent > 0.5)
df.words <- df.words[order(df.words$percent),] 
df.words$interval <- factor(df.words$interval, 
                            levels=df.words$interval,
                            labels=df.words$interval, 
                            ordered=TRUE) 
kable(head(df.words))

	interval	freq	percent
666	coping	141	0.51
2407	prostate	140	0.51
789	depression	143	0.52
1335	health	147	0.53
2678	review	150	0.54
2190	patient	154	0.56

With the next R code we create a list l.txt_time containing four sub lists.

# group titles by time period
l.txt_time <- by(mydata$txt, mydata$time, breaker)
str(l.txt_time)

## List of 4
##  $ 1997-2001: chr [1:2160] "relationship" "between" "trait" "anxiety" ...
##  $ 2002-2006: chr [1:3608] "measurement" "coping" "stress" "responses" ...
##  $ 2007-2011: chr [1:7310] "novel" "stop" "multidisciplinary" "clinic" ...
##  $ 2012-2016: chr [1:14606] "relationship" "between" "objectively" "measured" ...
##  - attr(*, "dim")= int 4
##  - attr(*, "dimnames")=List of 1
##   ..$ mydata$time: chr [1:4] "1997-2001" "2002-2006" "2007-2011" "2012-2016"
##  - attr(*, "call")= language by.default(data = mydata$txt, INDICES = mydata$time, FUN = breaker)
##  - attr(*, "class")= chr "by"

The following code snippet is to convert these four lists into separate data frames, each containing the most frequent citation title terms of the different periods of time.

# time span 1997-2001
df.t1 <- dist_tab(l.txt_time[[1]])
df.t1 <- df.t1[,c(1, 2, 4)]
df.t1 <- subset(df.t1, percent > 0.2)
df.t1 <- df.t1[order(df.t1$percent),] 
df.t1$interval <- factor(df.t1$interval, 
                         levels=df.t1$interval,
                         labels=df.t1$interval, 
                         ordered=TRUE) 
df.t1$year <- levels(mydata$time)[1]
# time span 2002-2006
df.t2 <- dist_tab(l.txt_time[[2]])
df.t2 <- df.t2[,c(1, 2, 4)]
df.t2 <- subset(df.t2, percent > 0.2)
df.t2 <- df.t2[order(df.t2$percent),] 
df.t2$interval <- factor(df.t2$interval, levels=df.t2$interval,
                         labels=df.t2$interval, ordered=TRUE) 
df.t2$year <- levels(mydata$time)[2]
# time span 2007-2011
df.t3 <- dist_tab(l.txt_time[[3]])
df.t3 <- df.t3[,c(1, 2, 4)]
df.t3 <- subset(df.t3, percent > 0.3)
df.t3 <- df.t3[order(df.t3$percent),] 
df.t3$interval <- factor(df.t3$interval, levels=df.t3$interval,
                         labels=df.t3$interval, ordered=TRUE) 
df.t3$year <- levels(mydata$time)[3]
# time span 2002-2016
df.t4 <- dist_tab(l.txt_time[[4]])
df.t4 <- df.t4[,c(1, 2, 4)]
df.t4 <- subset(df.t4, percent > 0.3)
df.t4 <- df.t4[order(df.t4$percent),] 
df.t4$interval <- factor(df.t4$interval, levels=df.t4$interval,
                         labels=df.t4$interval, ordered=TRUE) 
df.t4$year <- levels(mydata$time)[4]

Eventually, we create a data frame (df.time) containing the same variables as the data frame df.words and a variable (year) defining the period of time the paper was published.

# merging
df.time <- rbind(df.t1, df.t2, df.t3, df.t4)
df.time <- merge(df.time, df.words, by='interval')
df.time <- df.time[,1:4]
colnames(df.time) <- c('term', 'freq', 'percent', 'year')
df.time$year <- as.factor(df.time$year)

Finally, we plot the most frequent citation title terms over four periods of time.

ggplot(df.time, aes(x=term, y=percent, fill=year)) +
  geom_bar(stat="identity", position="dodge", width = .5) + 
  geom_text(aes(label=round(percent,1)), size = 4, 
            fontface=2, hjust =  -0.3, vjust = 0.4) + 
  theme_grey() +
  scale_size_area() +
  scale_y_continuous('', 
                     limits=c(0, max(df.time$percent)+1), 
                     breaks = seq(0, max(df.time$percent)+1, by=2)) +
  scale_x_discrete('') +
  theme(legend.position="none") +
  coord_flip() +
  facet_grid(. ~ year) +
  labs(title = "Fig. 6: Most frequent terms in time",
       caption = "Data source: PubMed")

plot of chunk ggplot5

Above all, the terms survivors and distress were increasingly used over time which reflects a growing interest in so-called psychological distress (for a critical discussion see) and cancer survivorship.

Auswertung von LEGIDA-Polizeiberichten. Teil II: Worthäufigkeiten

In Teil II meiner Serie über Pressemitteilungen, die von der Polizeidirektion Leipzig anlässlich der Demonstrationen der fremdenfeindlichen LEGIDA-Bewegung veröffentlicht wurden, zeige ich heute, welche Worte in diesen Berichten am häufigsten verwendet werden.

## Warning in readChar(con, 5L, useBytes = TRUE): kann komprimierte Datei 'C:/
## ProgrammeNK/GDrive/Projects/R/Polizeiberichte/Legida.RData' nicht öffnen.
## Grund evtl. 'Datei oder Verzeichnis nicht gefunden'

## Error in readChar(con, 5L, useBytes = TRUE): kann Verbindung nicht öffnen

Die Auszählung von Worthäfigkeiten bezieht sich dabei (fast) nur auf bedeutungstragende Worte. Sogenannte Stopwords wurden von der Analyse ausgeschlossen. Die verwendete Stopwords-Liste findet sich unter dem folgenden Link zum Download.

library(ggplot2)

plt.words <- ggplot(df.words, aes(interval, freq, fill = 500 - freq)) +
  geom_bar(stat="identity", position="dodge", width = 0.75) + 
  scale_size_area() +
  scale_y_continuous('', limits=c(0, max(df.words$freq)+10), breaks = seq(0, max(df.words$freq)+10, by = 20)) +
  scale_x_discrete('') +
  theme(legend.position="none") + 
  coord_flip() +
  ggtitle("HÃ¤ufigste Wortnennungen") +
  geom_text(aes(label = paste0(percent, '%'), ymax = 0), size = 3, fontface=2, 
            hjust = -0.5, vjust = 0.2)

plt.words

plot of chunk bar

Mit dem wordcloud-Paket lässt sich das Ganze auch als Wordcloud darstellen.

colfunc <- colorRampPalette(c("blue", "red"))

set.seed(4)
par(mar = c(0, 0, 0, 0))
wordcloud::wordcloud(txt.wc, 
                     scale=c(3,.3),
                     min.freq=3,
                     max.words=150,
                     random.order=FALSE,
                     colors = colfunc(200))

plot of chunk wc

Schaut man sich die Auszählung der Worthäufigkeiten an, so ist erkennbar, dass es in den polizeilichen Pressemitteilungen häufig um eine zeitliche und r?umliche Einordnung des Geschehens geht. So werden zum einen sehr oft Uhrzeiten berichtet (vgl.). Zum anderen lässt sich erkennen, dass vor allem der Richard-Wagner-Platz, der Augustusplatz sowie der Leipziger Hauptbahnhof zentrale Örtlichkeiten der LEGIDA-Demonstartionen sind.

Serie: Analyse interaktiver Patientenportale (Teil I)

Im Internet befindet sich eine Vielzahl von Informationen zum verschiedenen gesundheitsbezogenen Themen. Mit der Einführung des Web 2.0 haben sich in den letzten zehn Jahren eine Reihe von interaktiven Online-Foren etabliert, auf denen betroffene Personen zum einen die Möglichkeit haben, untereinander Erfahrungen auszutauschen und sich zum anderen ratsuchend an medizinische Experten wenden können. Der in diesen Foren geführte Diskurs enthält zahlreiche frei zugängliche Informationen über Fragen und Probleme der Betroffenen, die potentiell auch für die medizinische Versorgung relevant sind.

Da es sich bei den so geführten Diskursen um Texte handelt, kommen etablierte statistische Verfahren zur Analyse dieser Daten nicht in Frage. Andererseits stoßen aber auch die klassischen interpretativen Methoden der qualitativen Sozialforschung angesichts der im Netz verfügbaren riesigen Textkorpora schon allein aus forschungspraktischen Gründen an ihre Grenzen.

Zwar stehen für die Analyse solcher Textkospora einige neuere Analysemethoden zur Verfügung (“Computational Linguistics”, “Natural Language Processing”, “Text Mining” etc.), jedoch sind dise in den Sozialwissenschaften bisher wenig bis gar nicht etabliert.

In diesem Blog werde ich in unregelmäßigen Abständen über meine eigene Forschung auf dem Gebiet der quantitativen Textanalyse berichten.

Das Kanzlerduell 2013

Diese “Gradient Word Cloud” habe ich mit Hilfe des GNU R “qdap”-Paketes generiert. Die Analyse des “Kanzlerduells” zwischen Angela Merkel und Peer Steinbrück zeigt zum einen, welche bedeutungstragenten Wörter am häufigsten benutzt wurden (markiert durch die Größe des Worts) und zum anderen, welcher der beiden Kanzlerkandidaten das entsprechende Wort häufiger benutzt hat als der andere. Dabei wurden alle blau gefärbten Wörter von Angela Merkel und alle rot eingefärbten Wörter von Peer Steinbrück häufiger genannt.

Intro

R packages

Scraping the data

Importing the data

Data wrangling

Gradient word cloud

Share this:

Intro

Data

Fig. 1: Screenshot from PubMed

R packages

Import

Data manipulation

Text mining

Share this:

Share this:

Share this:

Share this: