Inspiration

Inspired by the qualms of trying to organize class schedules, we wanted to make finding the "easy" gen eds an easier process. In case you also want to challenge yourself, we also were inspired to create an efficient means of determining the general sentiment surrounding any course.

What it does

We web scraped relevant comments about Purdue courses from Reddit and performed sentiment analysis on top voted comments. We utilized a Naive Bayes model trained on reviews from RateMyProfessor to return a quality, difficulty, and overall rating score for each course based on reviews. This information was all displayed on a user-friendly interface that allowed for users to query and pull up information regarding specific courses.

How we built it

The web interface utilized a React front end and a Flask backend.

The Naive Bayes model was trained on comments that were associated with "easy/hard/high quality/low quality" professors, which were explicitly labeled on RateMyProffesor. Reviews associated with a difficulty score of 3 or less were associated with an "easy" label. All other reviews were associated with a "hard" label. Labels for quality and general rating which would either be "awesome", "average" or "awful" were assigned labels in a similar fashion.

Text data was transformed into a document term matrix, which allowed for the model to determine what the highest probability of certain class labels was based on the frequency of different words that appeared in reviews. We utilized that model to perform general sentiment classification on Reddit reviews, which we hypothesized would follow similar patterns that reviews from RateMyProffesor did.

In order to ensure that only relevant Reddit comments were scraped we filtered for threads that only contained specific course IDs/names and only utilized reviews with the greatest amount of upvotes.

Challenges we ran into

web scraping portion
Imbalances in the training data for the model

Accomplishments that we're proud of

web scraping course descriptions for 8,000 courses -web scraping Reddit for relevant course descriptions -created a model to predict the quality, difficulty, and overall rating of a course with around 70 percent accuracy for each label

What we learned

Josh: I learned how to utilize a Reddit API to scrape data from the site. Arnob: I learned how to make global states using React context and web scraping

Shelly: I learned how to use sklearn to convert textual data into a document term matrix. This was my also first time implementing a Naive Bayes model for textual classification.

What's next for BoilerLogs

In the future, we hope to add more courses to our course reviews catalog as well as improve our model accuracy by training it on more data. Make the range for quality, difficulty, and overall rating more expansive (this would involve collecting more data)