News Spectrum

Inspiration

In an era where media consumption plays a pivotal role in shaping opinions, our project is inspired by the need to address media bias and its implications on individual perspectives. Concerns about misinformation and polarization have motivated us to create a tool that empowers users with a nuanced understanding of news articles. Our goal is to combat the echo chamber effect, where individuals are exposed to information that aligns with their existing beliefs. By labeling news articles as liberal, conservative, or central, our app aims to break through filter bubbles and provide users with a broader spectrum of perspectives. We aspire to enhance media literacy, encourage critical thinking, and foster open dialogue by offering a platform that promotes a more comprehensive and balanced consumption of news. Through this project, we aim to contribute to a more informed and interconnected society, where individuals are equipped to navigate the complexities of media landscape, make conscious decisions about the information they consume, and engage in constructive conversations that transcend ideological boundaries.

What it does

The user searches for a topic, and our website meticulously gathers articles from popular news sites (Fox, ABC, New York Times, New York Post). Each article is accompanied by a probability rating indicating its bias towards Republicans or Democrats. If an article is deemed to be sufficiently neutral, it transparently communicates that it lacks bias.

How we built it

Our project uses Natural Language Processing to classify the articles. The source of our training data is Reddit, particularly the /r/Democrats and /r/Republicans subreddits. We began by utilizing the Reddit API with PRAW to scrape the information from these two. In total, we attempted to pull 1000 posts per subreddit for a total of 2000. In our efforts to classify the text, we combined the text from the title as well as the body of the post in order to maximize our analytical process' capacity. We utilized the Pandas library to convert all title and post text dictionaries into dataframes where the text and subreddit location are included, but the subreddit itself is binarized in order to make it feasible for our model to understand. Our dataset is then split up into train and test partitions. The vectorizer that we employed to tokenize the word and clean the text is CountVectorizer. For the model, we chose LogisticRegression, since it is adept at classification problems. The accuracy of the model was 67%, well above the baseline of 50%. To improve the results we decided to classify the 40% to 60% range as unbiased, this improves the validity of the results.

For gathering the articles, we used web scraping with Beautiful Soup and Selenium. The media sites that we scraped for the articles are Fox News, ABC News, The New York Times and New York Post. We scraped 3 different kinds of data, the title, body and the href link of articles. Since both the title and the body are emphasised for the readers we decided that this is the data which would work the best with the classification model.

For the front end, we used JavaScript, html and CSS. Theoretically once the search button was clicked we wanted the javascript to call the python file with the user input as the keyword parameter, which would lead to run the web scraping file and then the classification model with the returned data from the web scraping file.

Challenges we ran into

This was only the first hackathon for two of our members, and the second one for the two others. Prior to the hackathon, none of us had experience with web scraping and front end development. So, we had to learn and implement these during the hackathon. We also had difficulty with calling a Python function from JavaScript code.

Accomplishments that we're proud of

We are proud to have a working Natural Language Processing model with a fair level of accuracy. We are also proud to have been able to scrape data from websites.

What we learned

Front end development, Web scraping, Classification with LogisticRegression

What's next for News Spectrum

We hope to increase the accuracy of our model. One way of doing this might be to analyze the context in which a word appears, rather than just using the frequency of words. We also want to make the web scraping process faster and more efficient. For the website, we could add more features, like sorting the articles by the probability that it is biased towards a certain ideology.