CSV Downloads of Retrosheet Data / Who Controls the Count?

Collecting Retrosheet Data

One reason behind the contributions in baseball research is the availability of data. Retrosheet was founded in 1989 with the purpose of computerizing baseball game records and game-by-game and play-by-play data are available for download from retrosheet.org. The contribution by Retrosheet was remarkable and deserves special recognition. David W. Smith received the Henry Chadwick Award from SABR for his work with Retrosheet. The availability of this data motivated the writing of the first edition of Analyzing Baseball Data with R.

There has been one hurdle in making play-by-play Retrosheet data available in spreadsheet format suitable for R and other languages. The raw Retrosheet data need to processed by Chadwick programs — this process of using these programs has been described in several posts including here. A number of folks have struggled with the process of using the Chadwick files — that motivated my thought that it would be simpler if these Retrosheet files were available in csv form easily imported into a statistics package.

GREAT NEWS! The Retrosheet folks have recently addressed this concern and much of the Retrosheet files are now available in csv format. (One does not need to use the Chadwick programs.) Specifically, this page contains “parsed play by play” Retrosheet files for every season where this data is available. You can download csv data from individual seasons. You can even download a single file that contains play-by-play data for all available seasons — this file has 15,852,865 records from almost 200K games.

I’ve played with these new Retrosheet files — here are some comments.

  • For fun, I was able to download the single file containing all Retrosheet play-by-play data for multiple seasons. Also using the fread() function from the data.table package, I was able to read this file into R as a single data frame with over 15 million records. But it was hard to work with this data frame in R in real time, so I wouldn’t suggest doing this. (If you are interested in working with large datafiles, I’d suggest reading Chapters 11 and 12 of ABDWR.)
  • But it was quick to download a Retrosheet file for a single season such as 2024. The file is downloaded as a compressed zip file that is easy to unzip to create a single csv file and this file is quickly read into R by use of the read_csv() function in the readr package.
  • There are several differences between this Retrosheet csv file and the one that I had earlier had processed using the format processed by the Chadwick programs. One difference is that the csv file had more records since it included playoff games and the All-Star game — it was easy to filter out the other dates to get a dataset for the regular season.
  • A more significant difference is that the number of variables and variable names of the csv files differ from the Retrosheet files produced by the Chadwick programs. The variable names for the new Retrosheet files are listed on the same page containing the links for the data. I think it is straightforward to use these new variable names. But the functions that use Retrosheet data, such as the ones supplied in our ABDWR book, need to be adjusted to work with the new variables. For example, it would be straightforward to adjust the R work from Chapter 5 in ABDWR to compute the runs expectancy matrix using the new files.

Who Controls the Count?

Here’s an application of Retrosheet play-by-play data for the 2024 season. In an earlier post, I asked the question:

Given a particular outcome of a plate appearance or a batted ball or a pitch, how much of the variability of that outcome is due to the batter and how much to the pitcher?

In this earlier post, I looked at a number of outcome variables such as SO, BB, HBP and several Statcast variables such as launch speed and launch angle. For each variable, I fit a nonnested multilevel model and I compared the variation of the pitchers with the variation of the batters. We obtained some interesting findings — for example, the variability in launch speeds is primarily attributed to the batter but the variability in launch angles is primarily attributed to the pitcher.

Let’s consider a different outcome — the count. Suppose you turn on a ball game and the current count of the plate appearance is 3-0 — does that say something about the pitcher or is this count more due to the batter?

For a particular count, say 3-0, we record 1 if that count happens in a PA and 0 otherwise. Let p denote the probability we’ll see a 3-0 count and we fit a multilevel model of the form

logit(p) = constant + pitcher_j + batter_k

where the different 2024 pitchers are assigned effects {pitcher_j} that are normal with mean 0 and standard deviation sigma_p and the batters are assigned effects {batter_k} that are normal with mean 0 and standard deviation sigma_b. We fit this model and get estimates of the two standard deviations. We compute the fraction

F = sigma_p^2 / (sigma_p^2 + sigma_b^2)

The fraction F is the fraction of the total variation (or variance) that is attributable to the pitcher. If this fraction is large (close to 1), then the occurrence of the 3-0 count is more a function of the pitcher — if the fraction is close to 0, the occurrences of this particular count is more due to the variation of the batter.

I fit this multilevel model for each of the 11 possible counts in a plate appearance. In each model fit, I estimate the fraction F. Below I graph the value of F for all counts.

What do we learn from this graph?

  • The outcome of the initial pitch (either 1-0 or 0-1) is really due to the pitcher and not the batter. This makes sense since the batter rarely swings at the first pitch — the pitcher controls the outcome of the first pitch.
  • Interesting, although the outcomes 0-2 and 2-0 in a PA depend on the pitcher, the occurrence of the 1-1 pitch in a PA depends more on the batter.
  • As the PA progresses, the occurrences of later counts like 3-1, 2-2 and 3-2 depends more on the batter than the pitcher. Hitters like Juan Soto, Kyle Schwarber and Mike Trout are very patient and tends to have long counts. There is a relatively large variation in batter discipline and that is reflected by the higher variance (large values of the standard deviation sigma_b) for later counts.

How is this graph helpful for baseball teams? A player should try to excel in the attributes that he has control over. A pitcher has much of the control on the outcome of the first pitch so he should try to get a first pitch strike. Similarly, a batter need to have good discipline since the graph tells me that the outcome of later pitch counts depend more on the batter than the pitcher.

R Notes: Using the Retrosheet csv file, I renamed the pitches variable in that dataset to pitch_seq_tx. Then I used the retrosheet_add_counts() function from the abdwr3edata package to create the count indicator variables c00, c01, c10, etc. Then the glmer() function from the lme4 package was used to do the fitting of the nonnested multilevel models.

Design a site like this with WordPress.com
Get started