Inspiration

We found our first dataset on Kaggle. This was a topic that hit close for lots of us because in the US, healthcare is expensive. We wanted to be able to predict how much treatment for certain conditions would cost.

Our second dataset was more a recreational activity. Who doesn't love to sit down to a nice game of chess?

What it does

Our code analyzes the different openings that can happen, and the chances that you have of winning as a result.

How we built it

We started by preprocessing the two datasets separately, since they dealt with very different domains — healthcare and chess. For the chess dataset, we dropped several features that didn’t contribute meaningful predictive value, such as opening_eco (which was just an identifier for chess openings), as well as id, white_id, and black_id. We also cleaned time-based columns like start_time and end_time, since they weren’t relevant for our win prediction task. Once cleaned, we one-hot encoded categorical variables and standardized the relevant numeric features to prepare for model training.

Challenges we ran into

One of the biggest challenges we faced was with the healthcare dataset. The treatment cost data was heavily grouped by disease type, which caused issues with generalization. For example, while "cancer" was listed as a single category, the actual costs varied widely depending on the type and severity of the cancer. However, the model would often just learn an average value and assume all cancer cases fell somewhere in the middle, leading to poor predictions for extreme cases. This made it clear how important more granular data is in real-world applications.

On the chess side, we also had to deal with a large number of unique openings and ensure that the model didn’t overfit to rare ones. Simplifying and encoding these effectively took some careful preprocessing.

Accomplishments that we're proud of

We got started with data — and that’s something we’re proud of. Even though it might feel like we didn’t accomplish much on the surface, taking that first step and getting hands-on experience was a huge milestone.

We created a basic healthcare cost model that helped us understand the pipeline from data preprocessing to training, even if the result was something as simple as a line of best fit. More excitingly, we built a chess prediction model that reached 93% accuracy, and that felt like a real win. We’re proud of what we learned and what we built — and even more excited for what comes next.

What we learned

This project gave us our first real look into working with data, and we’re proud of how far we’ve come. We learned how to explore datasets, clean them, and — most importantly — evaluate whether a dataset is suitable for predictive modeling.

We also gained hands-on experience with one of the most important parts of data science: turning raw, messy data into something a model can understand. From encoding categorical variables to dropping irrelevant columns and starting feature engineering, we now feel much more confident in handling the early stages of any data project.

In addition, we learned how to go beyond just accuracy by using other evaluation metrics — such as precision, recall, and F1 score — to test the robustness of our models and better understand where they succeed or fall short.

What's next for From Scrubs to Squares

Next, we want to keep exploring more datasets and keep building. For the chess model, we’re excited to dig deeper into feature engineering to boost accuracy. One challenge we ran into was the moves column — while it holds valuable information, one-hot encoding all possible sequences would make the data explode in size.

A possible solution we’re considering is to extract and encode just the first few moves of each game. These opening moves are often the most strategic and could have a strong correlation with game outcomes, without overwhelming the model. We’re excited to try it out and see how much more accurate we can get.

Additionally, we plan to continue practicing our data science skills by working with a variety of different datasets. The more we explore, the better we’ll get at asking the right questions, cleaning messy data, and building meaningful models.

Built With

Share this project:

Updates