Gaussian Curveballs

Contributors: Calvin Aberg, Emma Gruber, Eleazar Martin, Will Miraglia

We were inspired to create our project by the Rice Datathon 2024 Astros challenge, where we were tasked to investigate the relationship between MLB team travel and performance. We broke our analysis down into 3 major aspects:

Overall league trends
Player performance differences
Team predictive capabilities

Added Data

We decided to augment the provided data with several data sources to get a more comprehensive view of games, performance, and travel.

Location Data: we manually found data from batchgeo.com on the latitudes, longitudes, and time zones of the cities in which games took place. This was added to the game data to calculate travel metrics.
Pitch-by-Pitch Data: we gathered pitch-by-pitch data Statcast data from Baseball Savant through pybaseball. This data included all regular season pitches since 2008 and helped us in our player performance evaluations.
Betting odds data: we gathered game betting odds for all games since 2008 through web scraping oddsportal.com with Selenium. This data helped us in our evaluation of travel's predictive power.

Travel Definitions

To analyze the trends created by travel, we looked at several travel factors:

Distance traveled to game/series
Time since last travel
Time zones crossed since last game/series

Overall League Trend

First, we looked into overall trends in distance traveled by teams, impact on win percentage based on time zones crossed since last game, and several other exploratory factors - particularly paying attentions to travel differences caused by the MLB schedule change in 2023.

Player Performance Differences

To look into the capabilities of player performance impact due to travel, we took pitching as an exploratory aspect to dive into. We went about this in a drill-down style process with the pitch-by-pitch data to expand any trends that could show in small aspects.

First, we created a logistic regression model for each pitch type, where we deemed a successful pitch as one where the batter swung and missed. We thought of this outcome as the one outcome that always favors the pitcher, and decided to evaluate its causes. From these models, we evaluated feature importances in the statcast data. These features and their explanations can be found here. An example of this is the Fastball data, whose most important features were vertical acceleration and vertical velocity.

With the feature importances established, we drilled into each feature in every model and evaluated travel's correlation on that feature. An example of this is vertical acceleration, where we found that pitcher time zone difference had a -0.019 Pearson correlation, with a p value very close to 0. For many of our impact measurements, we found low to very low correlations with very strong p-values. To us, this suggests that, even if it is a minuscule impact, there could certainly be correlation in individual performance with travel measures.

Team Predictive Capabilities

The last aspect that we looked at was the ability that travel measurements had to predict the outcomes of games. To do this, we decided to look at predicting betting line implied probability rather than actual win outcomes. In our minds, this approach allowed us to gauge ourselves against a tested and extremely accurate measure while being able to delineate some outcome bias and noise.

As we believed there would likely be some complex relationships between travel metrics and wins, we decided to create a random forrest regression model to predict betting line. We also know that random forrest is an algorithm that is particularly able to ignore noisy columns and sift out signal from noise.

To create a basic model to layer weather data onto, we predicted second half of season betting lines based on season first half aggregate stats.

In our modeling, we found that there was absolutely no model performance increase by adding travel data along with consistently low feature impact. However, with just our basic aggregated stats, we were able to very closely predict the betting line.

With these results, we concluded that, although there are some observable correlations in player and team performance, there is no predictive evidence to make us say that travel impacts team winning.