Overview

Performing quantitative sentiment analysis of all Tweets made by U.S. Senators in the 117th Congress provides an empirical means to understanding the way which our Senators communicate. More so, it offers insight into how the sentiment of public officials has changed over time in America's increasingly polarized political climate. Studying the intersection of the polarity of the sentiment of the Tweets in relation to illuminate if certain sentiments garner more interaction from Twitter users.

Explore the project on on my website ianmacdonald.me!

Inspiration

Inspiration for this project came from primarily two sources. First was my own background knowledge of similar sentiment analysis projects, especially in regards to the notorious now-banned Twitter account of Donald Trump. Conclusions drawn by those types of analysis always seemed quite insightful (if not comical) and I have wanted to experiment with the same workflow of Twitter scraping and sentiment analysis since seeing them. Prior to the start of Cypher VII, my hallmate Andrew suggested analyzing Twitter posts relating to the ongoing Supreme Court Justice confirmation hearings. I credit him with returning the idea of political Twitter analysis to the forefront of my mind. The original direction of Senatorial Sentiment was going to be contrasting Twitter posts from Senators during their campaigns compared to their terms, exploring whether they used a heightened level of positive or negative rhetoric. This eventually shifted into the present form of the project which provides a more overarching analysis of recent Senatorial Twitter trends.

Acquiring the Tweets

  • Congressional Twitter URLs for all 100 members of the 117th Congress are sourced from USCD political scientists.
  • The corresponding username IDs are found and scraped using Twitter API v2
  • The Python Tweepy module is utilized efficiently scrape the most recent several thousand Tweets from each Senator.
    • These Tweets are later processed to remove any retweets as they are not indicative of the Senator's own Tweeting patterns.
  • Each Tweet is stored as an object containing the Tweet ID, the text, the creation date, and the public interaction metrics such as likes and retweets.
  • Every politician is assigned a list containing their respective Tweet objects as well as their state and party affiliation before being written externally in a JSON format to reduce further API calls when performing subsequent analysis.

Analysis Methodology

The Natural Language Toolkit is a tool for classifying, tokenizing, parsing, and otherwise processing language data. The NLTK Python Module contains a variety of sentiment analysis tools which have been trained from billions of social media posts to discern the tone and sentiment of a string of text. Four scores are presented per string inputted into the NLTK - positive, negative, neutral, and compound. Based on the phrasing, word choice, and sentence structures the module approximates how much of the sentiment of the string, the Tweet in this case, was positive vs. negative vs. neutral. These scores are then arithmetically calculated into a single compound sentiment score.

Prior to this process the Tweets will have any URLs filtered out via Regex as the NLTK works most effectively with natural speech rather than computer-specific text strings. Each compound score is added and averaged by the number of Tweets downloaded from each Senator (~3,000) to provide an overall sentiment of the Senator's online Twitter presence.

Challenges During the Project

Twitter's own API limits pose an interference in compiling a comprehensive chronological archive of each Senator's Twitter presence, as a maximum of only the 3,200 most recent Tweets from each user are acquirable. This results in an incomplete archive of the Senators' Tweets and introduces inconsistencies regarding from which point in time each Senator is being analyzed from. Political and world events which overlapped with the timelines of only some Senators are therefore unaccounted for in the above analysis.

Furthermore, it took Twitter several hours longer than anticipated to approve the request for an elevated developer account, leading to delays in testing the bulk Tweet scraping functionality.

Accomplishments Worth Celebrating

  • Successfully interfacing with the TwitterAPI v2 to pull over a quarter of a million Tweets combined from all the current U.S. Senators
  • Storing this collection of Tweets in an easily readable and writable serialized JSON format to reduce API calls (over ~2.8 million lines of text!)
  • Discerning meaningful data from each Tweet to be analyzed and visualized using Tableau
  • Creating a project which provides meaningful statistics and information to the public regarding a highly germane subject (especially when considering the widespread notion of increasing political polarization)

Lessons Learned

Twitter's API has the capability to be good, but it also can be quite infuriating. Due to a lack of familiarity with web APIs in general, it took a substantially larger amount of time than originally planned to script the functions which interfaced with the API to find Tweets from specific users while filtering out retweets and other irrelevant information.

Tableau is an absolutely amazing software for visualizing data, and it has a ton of features to be aware of. Certain features learned of mid-way through the project (such as using calculation fields for normalizing data ranges) made the process way easier. With proper consideration for design theory and aesthetics, graphs and charts can quickly go from underwhelming to exceptional.

The Future of Senatorial Sentiment

Senatorial Sentiment has a lot of future directions which it could be taken if continued beyond its original scope as a Hackathon project.

  • Using more advanced scraping tools which bypass the Twitter API limitations would allow for acquiring several thousand Tweets more per-Senator
    • A completed chronology would allow for the scope to tackle the original project question of "Do political figures use more highly charged (positive or negative) sentiments on their Twitter when campaigning?" as each dataset would be guaranteed to include that time period
  • Expand the project to include members of the House of Representatives as well, which already were included in the dataset of Twitter handles
    • This change would incorporate over a million additional Tweets into the dataset and could help clarify questions about how geography or partisan affiliation impact the polarization of sentiment
  • Broadcast the project to the public so that derivative works and analyses can be performed with these tools and datasets, with instructions available on the Senatorial Sentiment GitHub
Share this project:

Updates