Finding the Mystery Embeddings

Challenge 1

Embedding 1: An article that discusses notable or popular global art (music/sports) records. Our second guess is an article that discusses shopping and great deals to catch.

Embedding 2: Description of a financial “crisis” that a city may face/soon face and the events that occurred or will occur.

Embedding 3: A medical/health-related news article that informs people about the current state of an epidemic. Potentially discusses how people or places are affected by the epidemic.

Embedding 4: Discusses a human-rights issue, where the government and people have different opinions. Might additionally talk about protests and have a few opinions from notable activists within a human rights movement.

Embedding 5: An article that talks about climate change and describes one of its consequences such as deforestation, biodiversity loss, an increase in natural disasters, or rising sea levels in detail.

Challenge 2

Repository: link

Problem Statement: Classify the source an article came from: Federal Sources or CNN sources Preprocessing Techniques: We noticed that samples from Federal sources can be small and imbalanced when broken down into departments. To counteract the small sample size, we performed SMOTE for regression or upsampling the number of samples to create balanced classes. We also split the samples into training and testing sets in an 80:20 ratio.

Model Selection: We tried out nine different models, and performed K-Fold cross validation to track each model’s performance. From this experiment, we discovered that logistic regression was the best performing model, as evaluated by an F1 score measure. Additionally, we tried different regularization strengths and types, and found that the default value of C=1 resulted in the best results for a logistic regression model.

Bonus Challenge

The wildest guess we can come up with): Twenty reason why we should adopt a cat

Challenges we ran into

We had trouble trying to understand what each of the embeddings meant. At first, our team wanted to create a model to recreate the embeddings as best as we could. However, given the small number of data provided, the limited amount of API requests allowed, and the amount of customizations we can apply for preprocessing and vectorization, this idea was close to impossible to build. We also briefly thought about finding patterns in the embedding values and looking more into the behavior of the tanh function, but we decided that comparing the similarity between the embeddings may be the best approach to take.
We decided upon using cosine similarity and different variation Euclidean distance to determine the best matching embeddings for each mystery embedding. Now we needed to find some way to find similarities within the top five articles returned for each mystery embedding. In essence, this boils down to a clustering problem, but we weren't familiar with clustering textual data. As a result, we tried out LDA, NMF, and LSA matrix factorization models for topic modeling.
We also had trouble deciding what model to use for our Challenge 2 problem statement. The outputs from our model were fairly similar with accuracy and F1-Score.

Accomplishments that we're proud of

We're a team that sits across multiple regions, from Malaysia, Iowa, and New York. We are proud to be able to work together virtually across vastly different time zones.
We're proud of being able to find answers to these challenges. It may be incorrect, but we gave it our best shot!

What we learned

We also learned that the LDA and NMF are both locally optimal since they all have non-convex cost functions. We were super confused when we got different topics when running our algorithm multiple times.
We learned about preprocessing techniques for textual data, such as stemming, lemmatization, removing stopwords, etc.
We learned about the importance of word embeddings and how to go from a piece of text to an embedding from different vectorizers.

What's next for Bloomberg Challenge

We thought it would be cool to build a knowledge graph representing the relationships between the entities discussed in each of these articles. It could potentially provide leads into similar/related topics!
Exclude specific parts of speech from topic modeling, such as proper nouns

Built With

Updates

Private user started this project — Oct 09, 2022 10:40 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.