Inspiration

The Olympic Games are a symbol of hope, courage, and determination. The athletes who participate in these games represent the spirit of humanity, pushing the limits of what is possible and inspiring us all to be better. With this in mind, we embarked on a journey to understand the Olympics from a data perspective, and to uncover the hidden stories behind the games.

Our team used the collection of data from 120 years of Olympic history and analyzed it to gain insights into this world of sports. We wanted to see what sets the medalists apart from the rest of the competitors, and what factors contribute to their success.

We built multiple classifiers and trained them on the data, including Logistic Regression, Decision Tree, Naive Bayes, K-Nearest Neighbors, Random Forest, SVM, and K-Means. Each of these models helped us uncover various aspects of the data and gave us a more complete understanding of what makes a successful Olympic athlete.

But we did not stop there. We also built a tool that analyzes the features of athletes and compares them to the other Olympic competitors. With this tool, you can see how similar your features are to the athletes who have competed and get a sense of what it takes to be a champion.

At the heart of our project is a message of hope and inspiration. Our goal is to inspire and motivate individuals to work towards their own athletic goals, as the data demonstrates that anything is possible with determination and hard work. Our goal is to inspire and motivate individuals to work towards their own athletic goals, as the data demonstrates that anything is possible with determination and hard work.

Results/Analysis

Classifier Analysis

Decision Tree: The accuracy of 0.8180 is decent, but not the highest among the models tested. This could be because decision trees are prone to overfitting, meaning they may not generalize well to unseen data. Additionally, decision trees may struggle to capture complex relationships between features and target outputs.

Naive Bayes: The accuracy of 0.8640 is very good, particularly given the simplicity of the Naive Bayes algorithm. Naive Bayes is a fast and efficient algorithm based on the Bayes theorem, which makes assumptions about the independence of features. When these assumptions are held, Naive Bayes can perform well. It is usually underrated for good reason. However, given one’s computing resources, it certainly gives some of the other classifiers a run for their money considering how resource demanding they are.

K-Nearest Neighbors: The accuracy of 0.8598 is also very good, although not quite as high as Naive Bayes. KNN is a non-parametric algorithm that makes predictions based on the closest neighbors in the feature space. It can be effective in cases where there is a clear and well-defined notion of similarity, in our case with the Olympic dataset.

Random Forest: The accuracy of 0.8555 is good, but not as high as some of the other models. Random Forest is an ensemble algorithm that combines multiple decision trees to produce a final prediction. The idea behind this approach is that by combining multiple trees, the algorithm can reduce the overfitting that can occur with individual trees. By itself, random forest did a good job.

K-Means: The accuracy of 0.2014 is very low, and it is not surprising that K-Means did not perform well in this case. K-Means is a clustering algorithm that is used for unsupervised learning, and it is not well suited for prediction tasks like this. We sought to potentially cluster the data into 4 clusters: Gold, Silver, Bronze, and None, but the data itself was varying in density as a specific set of athletic features does not guarantee a specific medal outcome. Potentially exposes RNG within Olympics?

SVM: The accuracy of 0.8668 is exceptionally good, and it is not surprising that SVM performed well on this dataset. SVM is a powerful and versatile algorithm that can handle both linear and non-linear decision boundaries. Also, SVM can be effective when there are a few features and many samples, as was the case with the Olympic dataset.

LightGBM: The accuracy of 0.8724 is excellent, and it is the highest accuracy among all the models tested. LightGBM is a gradient boosting algorithm designed to handle large datasets and has been shown to be effective in various applications. Given its use of gradient-based one-side sampling and exclusive feature bundling techniques to reduce overfitting and make the algorithm more efficient, it is likely that LightGBM would be a desirable choice for future prediction tasks involving similar data.

Multi-Layer Perceptron: The accuracy of 0.8670 is very good, and it is not surprising that a neural network performed well on this dataset. Neural networks are powerful algorithms that can capture complex relationships between features and target outputs. However, like in our case, it was resource demanding in use of the wide range of data and probably not the best choice for real-time prediction.

How we built it

Olympic medal classification

We built this machine learning project using Python libraries: Pandas, Matplotlib, Sklearn and LightGBM to analyze the Olympic dataset.

We trained and tested several classifiers such as Logistic Regression, K-Nearest Neighbors, Decision Trees, Random Forest, Support Vector Machines, LightGBM, Multi-layer Perceptron, Gaussian Mixture Model, and Naive Bayes to predict the medals won by athletes based on their features.

We also used feature selection to determine the most key features in the dataset and improve the accuracy of our predictions. We realized how ID, Name, Games, City, Team, Year, and Event were not the best contributors towards the prediction model using formal reasoning and a correlation map.

Olympic Standard Analyzer

We utilized HTML, CSS, and Python to develop a website that delivers the user experience side of this project. As a user, not only can you review our analysis from our ML project but get insight as to how your athletic standard compares to the world champions. This is to show that Olympians are not people who are out of reach by the general public – for example, the ages in this data set ranged from 11 to 71 years old.

Everything is hosted on github! Including this website (huge thanks to domain.com for the free domain)

Challenges we ran into

Data parsing and normalization: Parsing the Olympic dataset was straightforward, but normalizing the data was challenging due to the presence of NaN values. To handle this, we used various data imputation techniques to fill in missing values in a meaningful way.

Determining data validity: One of the biggest challenges was determining how much data to use. Older data may not be as relevant in today's world, so it was crucial to determine a threshold for the data's validity. We made the decision to use data only up until a certain year and used trial and error and the frequency of athletes to determine the best cutoff year.

Country names: Another challenge was the inconsistent naming of countries, particularly when it came to the "Team" feature. Some countries had changed their names or combined with others over time, which made it difficult to categorize the data. We found that the National Olympic Committee (NOC) 3-letter code was a reliable indicator of a country and used this instead of the "Team" feature.

Multicollinearity: In some cases, we found that features were highly correlated with each other, leading to multicollinearity issues. This was particularly evident when using a neural network, which resulted in poor performance. To overcome this challenge, we chose to use a multi-layer perceptron (MLP) as our classifier.

Feature selection: Finally, we faced the challenge of determining which features were most important for accurate predictions. This involved a trial-and-error process of trying different feature combinations and evaluating their impact on the classifier's accuracy.

Accomplishments that we're proud of

We ended up with 224864 unique rows of data after cleaning. With the use of feature-rich data and machine learning algorithms such as LightGBM, we achieved an impressive accuracy rate of 87%. More specific details regarding classifiers to be explored in the Analysis section.

Additionally, we were able to create a full user interface to highlight how data is not only the focus of analysis but can also be interactive with each user. This truly brought our work with the dataset to life as we were able to repurpose the set to show the percentiles of height, age, weight based off personal specification.

Our results have real-world applications as they can be applied to improve sports management and tracking systems for athletes, helping them to advance and reach their full potential. It reaches out to not only the Olympic population but also to other athletic competitions. This project serves as a testament to the incredible impact that machine learning and data analysis can have on the sports world, and we look forward to continuing our work in this field.

What's next for The Olympic Standard

We would like to try out other techniques: combination of multiple models to achieve better prediction and more robust feature engineering (given time I believe we can further optimize our features to reach the 0.9 threshold).

Furthermore, adding a real-time predictive feature could be fun to be used to predict a medal.

Collaboration with experts. Working alongside sports scientists and medical professionals, I believe we can gain a deeper understanding of our domain and in turn improve the accuracy. Overall, we are looking to make our resources and research available to a wider audience. We are here to prove a point: anyone can be an athlete.

Built With

Share this project:

Updates