Introduction

QuestionMatch is an NLP-driven program that detects, labels, and sorts questions from a large volume of messages, in order to connect them with answers. It can be easily integrated into a variety of platforms and contexts to boost efficiency.

In recent years, the popularity of online communities has skyrocketed. Platforms like Discord and Reddit are now widely used by individuals, organizations, and businesses alike to connect with their respective communities.

In particular, we envisioned QuestionMatch as a way to help students and schools. Each year, almost 4 million students will enter college for the first time, full of questions and eager to meet their new peers. Before move-in day, online communities are often one of the primary sources of interaction and information. For advisors and older students willing to help, navigating this onslaught can be both chaotic and difficult.

Inspiration

These past few months, all three of us witnessed that chaos firsthand. Questions over topics ranging from academics to logistics were often repeated, lost in other messages, and missed by the relevant individuals. As a result, we were inspired to create QuestionMatch to streamline this process. To do so, it identifies then labels questions based on topic, before grouping them together in an organized and accessible manner.

How we built it

QuestionMatch utilizes Python’s NLTK and Gensim libraries. The process consists of four steps: identification, processing, scoring, and labeling.

Steps

The first challenge with identification was accounting for the informal nature of online communities. In informal contexts, people often ask questions without question marks, append question marks to non-questions, and make unconventional grammatical choices. To address this, we used the NLTK dataset nps_chat to train a simple Naive Bayes Classifier to decide if the chat messages were questions (with a test set accuracy of 83%). In addition, we hard-coded recognition for interrogative phrases and a minimum length filter to boost accuracy.

Question identification

Following question identification, we processed the input message. To do this, we used NLTK’s tokenizer to remove unnecessary “stop words” from each question. We then ran a modified version of the Lesk algorithm to determine the remaining keywords’ senses.

Question processing

Finally, to compare the keywords with our set topics, we decided to use a variant of Gensim’s Word2Vec model that better considers word context, called Sense2Vec. Using the model, we mathematically determined a semantic similarity score between the set of keywords and each topic by averaging individual scores. If this topic score was above a certain threshold (insert here) we include the topic as a label for the question.

Implementation

To demonstrate QuestionMatch in a live environment, we integrated it into TigerBot, the bot previously developed for use in the Princeton 2026 server.

What we learned

Ultimately, during this project our team (all new to NLP) learned a lot about the intricacies of natural language processing and machine learning, particularly NLTK and Gensim. Tasks that initially seemed easy often became difficult after various nuances were considered. Developing QuestionMatch was an interesting exploration of the potential of computing.

Besides expansion, there are several ways we’d like to improve QuestionMatch with more time. Trying more complex models than Naive Bayes and Sense2Vec, such as neural networks, would lead to greater accuracy in the long run. Given more generous time constraints, users could also train models on more specific datasets, such as chat histories from their own communities.

Built With

Share this project:

Updates