{"id":9188,"date":"2026-03-19T08:02:37","date_gmt":"2026-03-19T12:02:37","guid":{"rendered":"https:\/\/datacolada.org\/?p=9188"},"modified":"2026-03-19T08:00:15","modified_gmt":"2026-03-19T12:00:15","slug":"9188","status":"publish","type":"post","link":"https:\/\/datacolada.org\/134","title":{"rendered":"[134] Figuring Out Figure 1"},"content":{"rendered":"<style>\nbody, p, li, ol, ul {<br \/>    font-family: \"Helvetica Neue\", Helvetica, Arial, sans-serif !important;<br \/>}<br \/><\/style>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">A few years ago our Journal Club discussed an interesting methods paper entitled, \u201cPutting Psychology to the Test: Rethinking Model Evaluation Through Benchmarking and Prediction\u201d (<a href=\"https:\/\/journals.sagepub.com\/doi\/full\/10.1177\/25152459211026864\">.htm<\/a>). This post describes my attempt to understand what\u2019s happening in Figure 1 of that paper, which shows that extremely simple experiments can generate extremely <em>negative <\/em>R<sup>2<\/sup>s. I learned a lot, much of it unexpected (at least to me) and interesting (at least to me). In this post, I\u2019ll share what I learned [<a href=\"#footnote_0_9188\" id=\"identifier_0_9188\" class=\"footnote-link footnote-identifier-link\" title=\"I make no claims that what I learned is new to the world; it was new to me, and so may be new to some of our readers.\">1<\/a>]. The data and code for this post are here: <a href=\"https:\/\/researchbox.org\/6202\">https:\/\/researchbox.org\/6202<\/a>.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Before I show you the figure that perplexed me, some background is necessary.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">The paper makes many points, one of which is that our standard statistical procedures \u2013 like running basic OLS regressions on our full datasets \u2013 often convey an extremely optimistic impression of how good our models are at predicting new observations.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To help make that case, the authors analyzed data from Many Labs, an effort by many different labs around the world to try to replicate some published findings in psychology (<a href=\"https:\/\/econtent.hogrefe.com\/doi\/10.1027\/1864-9335\/a000178\">.htm<\/a>). The figure that sparked this post focused on replications of a sunk cost effect, in which participants were randomly assigned to one of two conditions and then submitted a rating on a 9-point scale (the details aren\u2019t critical, but if you want them they are here: [<a href=\"#footnote_1_9188\" id=\"identifier_1_9188\" class=\"footnote-link footnote-identifier-link\" title=\"Participants imagined having tickets to see their favorite football team on a day on which it is freezing outside. They imagined that the tickets had been free or that they had paid for them, and they rated how likely they would be to go to the game, where 1 = definitely stay at home and 9 = definitely go to the game. Many Labs replicated the original finding: People rated themselves as more likely to go to the game when they paid for the tickets. (Across all labs the median cohen&rsquo;s d was .31). Because people indicated being more likely to sit in the freezing cold if they had paid for the tickets than if they had gotten them for free, this is taken as evidence that people honor sunk costs.\">2<\/a>]). Specifically, it plots the results from 15 labs that found a statistically significant effect of the manipulation on the dependent variable [<a href=\"#footnote_2_9188\" id=\"identifier_2_9188\" class=\"footnote-link footnote-identifier-link\" title=\"There were 36 labs in total, of which 15 found a significant effect. This post will focus only on those 15, since only those 15 results are contained in the figure.\">3<\/a>]. The models producing these results could not be simpler: regressing ratings on experimental condition (0 = control; 1 = treatment). That\u2019s it. No additional variables.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">The figure compares the R<sup>2<\/sup> estimates that those labs obtained using OLS regression (orange dots) to the R<sup>2<\/sup> estimates that these authors obtained using <em>10-fold cross validation<\/em> (purple dots). It lists each of the 15 Many Labs sites on the y-axis and represents the R<sup>2<\/sup>s on the x-axis. Here it is, along with my annotations; the figure note is theirs:<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9194\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-1024x519.png\" alt=\"\" width=\"800\" height=\"405\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-1024x519.png 1024w, https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-300x152.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-768x389.png 768w, https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-1536x778.png 1536w, https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-2048x1038.png 2048w, https:\/\/datacolada.org\/wp-content\/uploads\/Annotated-Figure-1-scaled-e1753548014903-850x431.png 850w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">The takeaway is that the cross-validation estimates are always lower than the OLS estimates, and in many cases much lower. For example, the median OLS R<sup>2<\/sup> is .051, and the median cross-validation R<sup>2<\/sup> is -.128.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">The authors interpret this as evidence that the \u201cR<sup>2<\/sup>s differ widely because of overfitting\u201d and that the OLS estimates are \u201coverly optimistic.\u201d But that is not obvious to me. I had at least two questions.<\/span><\/p>\n<ol style=\"text-align: justify;\">\n<li><span style=\"font-family: Helvetica;\">Most of the cross-validation R<sup>2<\/sup>s are <em>negative<\/em>. Indeed, one of them \u2013 for the \u201cku\u201d sample \u2013 is <em>very <\/em>negative<em>.<\/em> What??<\/span><\/li>\n<li><span style=\"font-family: Helvetica;\">Overfitting happens when a model fits (in-sample) data in a way that does not generalize to other (out-of-sample) data. This problem is much more likely when models are complex, contorted to fit the data that it is analyzing. But this model is built on a single binary predictor, a situation in which overfitting seems almost impossible. What??<\/span><\/li>\n<\/ol>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">I am no expert on cross-validation, and I wanted to understand these things. To do so, I dug into the authors\u2019 posted (and helpfully immaculate) code. But before I tell you what I found, there are two things you need to understand first: (1) What is 10-fold cross-validation? and (2) How can R<sup>2<\/sup> be negative?<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><strong>What is 10-fold cross-validation?<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To perform cross-validation, you build a model on one subset of data \u2013 called the <em>training sample <\/em>\u2013 and then use it to predict data in a different subset \u2013 called the <em>test sample<\/em>.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To perform <em>10-fold<\/em> cross-validation, you divide the sample into 10 equally sized subsamples called <em>folds<\/em>. So if your dataset has a total of 100 observations, you divide the sample into 10 folds of 10 observations each.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">You then use Fold 1 as the test sample and build a model using the remaining 90% of the data (Folds 2-10). You use that model to predict the Fold 1 observations and measure how well it performs. You repeat this process nine more times, each time holding out a different fold as the test set.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">The assessment of model performance can take many forms, but what the authors did was calculate an R<sup>2<\/sup> for each fold \u2013 yielding 10 R<sup>2<\/sup> values \u2013 and then average them. Those across-fold averages are what appear in Figure 1 [<a href=\"#footnote_3_9188\" id=\"identifier_3_9188\" class=\"footnote-link footnote-identifier-link\" title=\"Really, you shouldn&#039;t be averaging across them. You should be computing R2 by pooling over all predicted values. You don&#039;t compute the standard deviation of a sample by chopping it into subsamples and averaging across those subsamples, because that&#039;s not going to give you the same answer as computing the standard deviation of the whole sample. The authors did it this way because they used the sklearn package in Python, and that&rsquo;s the way it does it. If you do the pooling instead of the averaging, Figure 1 looks a lot different: the lowest R2 is -.151 (instead of -.881) and the median R2 is -.031 (instead of -.128). The fact that this matters so much supports a thesis of this post - minor cross-validation decisions can severely influence the results - but going forward I&#039;m going to ignore it.\">4<\/a>].<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><strong>How can R<sup>2<\/sup> be negative?<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Inconveniently, R<sup>2<\/sup> means different things in different contexts.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">For everyday OLS regressions, R<sup>2<\/sup> represents the percentage of variance explained. And since you can't explain negative percent of the variance, this R<sup>2<\/sup> cannot be negative [<a href=\"#footnote_4_9188\" id=\"identifier_4_9188\" class=\"footnote-link footnote-identifier-link\" title=\"*Adjusted* R2s can be negative if the model is bad and there are many predictors that aren&#039;t helping. We aren&#039;t in that situation here - the models are significant and there is just one predictor - so we&#039;ll ignore this.\">5<\/a>]. The authors refer to this as <em>in-sample R<sup>2<\/sup>. <\/em>I will refer to it as\u00a0<em>OLS R<sup>2<\/sup><\/em> since it\u2019s coming from our OLS regressions<em>.<\/em><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">In the context of cross-validation, R<sup>2<\/sup> is defined this way:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9195\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193-300x74.png\" alt=\"\" width=\"400\" height=\"99\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193-300x74.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193-1024x254.png 1024w, https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193-768x191.png 768w, https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193-850x211.png 850w, https:\/\/datacolada.org\/wp-content\/uploads\/R-Squared-Formula-e1772459134193.png 1354w\" sizes=\"auto, (max-width: 400px) 100vw, 400px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">This <em>cross-validation R<sup>2<\/sup><\/em> is inherently <em>comparative<\/em>. It <em>compares<\/em> how well you\u2019d do if you used your model to predict a set of values vs. how well you\u2019d do if you simply predicted the mean every time. It is positive if your model does better than the mean, and negative if it does worse than the mean. In general, these R<sup>2<\/sup> values can range from -\u221e to +1 [<a href=\"#footnote_5_9188\" id=\"identifier_5_9188\" class=\"footnote-link footnote-identifier-link\" title=\"You get -&infin; when all of the test observations are equal to the mean; in this case the mean perfectly predicts every value, and thus exhibits no error.\">6<\/a>], [<a href=\"#footnote_6_9188\" id=\"identifier_6_9188\" class=\"footnote-link footnote-identifier-link\" title=\"Interestingly, and perhaps concerningly, when you use the most common cross-validation R package (caret), you never get negative R2s, because within each fold it computes the correlation between predicted and actual values and then squares it. So even if the predicted values negatively correlate with the actual values, the R2s will be positive. This means that you&rsquo;ll get different answers if you use the most commonly used Python package vs. the most commonly used R package. Fun stuff.\">7<\/a>].<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><strong>Figuring Out Fold #9<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">In trying to understand Figure 1, I decided to focus on the outlier at the far left: the \u2018ku\u2019 sample\u2019s (n = 113) very negative cross-validation R<sup>2<\/sup> value of -.88. ('Ku' stands for Ko\u00e7 University in Istanbul. Go Rams.) I wanted to see exactly where that came from. So I ran the authors\u2019 code, and I looked at the R<sup>2<\/sup> values generated in each of the 10 folds. Here they are:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9197 size-medium\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-300x272.png\" alt=\"\" width=\"300\" height=\"272\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-300x272.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-1024x927.png 1024w, https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-768x695.png 768w, https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-1536x1391.png 1536w, https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-2048x1854.png 2048w, https:\/\/datacolada.org\/wp-content\/uploads\/Fold-Stats-For-KU-Sample-850x770.png 850w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Check out Fold #9 and its gigantically negative R<sup>2<\/sup> value of -8.077. Let\u2019s see what\u2019s going on there.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">My first thought was that the model used to estimate the Fold #9 values must be a lot worse than the model used to estimate, say, the Fold #1 values. Let\u2019s take a look at those models:<\/span><\/p>\n<p style=\"text-align: center;\"><em><span style=\"font-family: Helvetica;\">Fold #1 model: y = 6.66 + 1.15*treatment<\/span><\/em><\/p>\n<p style=\"text-align: center;\"><em><span style=\"font-family: Helvetica;\">Fold #9 model: y = 6.39 + 1.19*treatment<\/span><\/em><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Um, those aren\u2019t very different.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Fold #1's model estimates a mean difference of +1.15, while Fold #9's model estimates a nearly identical mean difference of +1.19. And yet the former has an R<sup>2<\/sup> of .019 while the latter has an R<sup>2<\/sup> of -8.077.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">What??<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To make sense of this, remember that these R<sup>2<\/sup> values are comparative. They are comparing how well the model is doing against how well the mean is doing. This R<sup>2<\/sup> value of -8.077 is not telling us that the model is doing terribly in an absolute sense. It is telling us is that it is doing terribly in a relative sense. The <i>mean<\/i> is doing a lot better at predicting the Fold #9 values than the model is.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To understand why, let\u2019s take a look at the training sample data (n = 102) and test sample data (n = 11) used in Fold #9:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9191 size-large\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-1024x512.png\" alt=\"\" width=\"640\" height=\"320\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-1024x512.png 1024w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-300x150.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-768x384.png 768w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-1536x768.png 1536w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-2048x1024.png 2048w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-Site-Fold-9-Train-vs.-Test-Data-850x425.png 850w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">You can see that in the training sample, the values range from 1-9. But in the Fold #9 test sample, there is almost no variation. No values are below 7, most are equal to 9, and the mean is very close to 9. <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">This led me to ask a na\u00efve question: When we compare the model\u2019s performance to \u201cthe mean\u201d, which mean are we using: the training sample mean (6.91) or the test sample mean (8.73)? <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">I was surprised to learn that in the land of cross-validation, the benchmark is the test-sample mean. That\u2019s kinda brutal for the model, because the test-sample mean is a cheat: it is calculated from the very observations we are trying to predict. It effectively gets to peek at the answers. <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">By analogy, imagine I flip 10 coins, you predict 5 heads, but it turns out to be 7 heads. Your R<sup>2<\/sup> will be negative, because although you (sensibly) predicted 50% heads, you had no chance against the observed mean of 70% heads. You\u2019ll be punished for not foreseeing the unusual outcome. <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">You can\u2019t beat the mean when every observation is equal to the mean.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">That cross-validation R<sup>2<\/sup> behaves this way is not a mistake. It\u2019s simply how it\u00a0is defined. It isn\u2019t telling you that the model is <em>generally<\/em> bad or that the OLS R<sup>2 <\/sup>is extremely optimistic.\u00a0It is telling you that, <i>in this particular<\/i> <i>sample<\/i>, a model built on 90% of the observations does worse than the observed mean at predicting the remaining 10% of observations. It is telling you something super specific, not something general about model performance.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Once you understand the machinery of all of this, you can see that, by the metric of R<sup>2<\/sup>, models stand very little chance when samples are really small. This is because small samples can be very unusual, containing observations that are unusually clustered, and thus very close to that sample\u2019s mean. When I flip 10 coins, I might observe 70% heads. But when I flip 1,000 coins, I won\u2019t. Similarly, when I observe 11-12 observations in the \u2018ku\u2019 sample, I might not see any values below 7, even though they range from 1-9 in the whole sample. But when I task the model with predicting a larger sample of observations, that happenstance clustering of values is less likely, the mean\u2019s advantage shrinks, and the model has more of a fighting chance. <\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">To illustrate this, what if instead of doing <i>10-fold<\/i> cross-validation on the \u201cku\u201d sample \u2013 which has us predicting folds containing 11-12 observations \u2013 we do <i>3-fold<\/i> cross-validation \u2013 which has us predicting folds containing 33-34 observations? Note this is still an extremely small sample, and our model can still look bad. But we sure see a lot of improvement, with an R<sup>2<\/sup> of .027 instead of -.881: <\/span><\/p>\n<p style=\"text-align: justify;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9749 size-medium\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold-300x232.png\" alt=\"\" width=\"300\" height=\"232\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold-300x232.png 300w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold-1024x793.png 1024w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold-768x595.png 768w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold-850x658.png 850w, https:\/\/datacolada.org\/wp-content\/uploads\/KU-3-fold-CV-results-per-fold.png 1091w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">So making a small change to how cross-validation is done drastically changes the results.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Though I have focused on the small \"ku\" sample, it is worth noting that all but two of the 15 samples are small, containing fewer than 300 observations. Not coincidentally, the two large samples \u2013 mturk and pi \u2013 had cross-validation R<sup>2<\/sup>s that were pretty similar to the OLS R<sup>2<\/sup>s. But, in general, you can see that if we rebuild Figure 1 using 3-fold cross-validation, the R<sup>2<\/sup>s are better, and in fact mostly positive:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9748 size-large\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/Figure-1-3-Fold-vs.-10-Fold-1024x768.png\" alt=\"\" width=\"640\" height=\"480\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">There is something peculiar about this. In 10-fold cross validation the models are built on 90% of the data, but in 3-fold cross-validation they are built on only 67% of the data. And yet, the models look better under 3-fold cross validation than under 10-fold cross-validation. That is, the models look better when you build them on smaller samples. This strange fact emerges entirely because R<sup>2<\/sup> is comparative and the thing it compares against \u2013 the test-sample mean \u2013 has a bigger edge when the test samples are smaller (as in 10-fold cross-validation) than when they are larger (as in 3-fold cross-validation).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">So the negative cross-validation R<sup>2<\/sup>s in Figure 1 don\u2019t show that those OLS R<sup>2<\/sup> estimates are wildly optimistic or that these simple models are somehow severely overfitting. Rather, they largely reflect how difficult it is for those models to outperform a structurally advantaged sample mean when test folds are very small.\u00a0<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><strong>So Are The Models Actually Overfitting?<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">R\u00b2 can behave strangely in small samples. So instead of asking \u201chow much variance is explained,\u201d let\u2019s ask a simpler question: how wrong are the predictions?<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">If a model is overfitting, it will be a lot worse at predicting new observations than at predicting the data it was trained on. For each of the 15 samples, we can use cross-validation to compare how far off the model\u2019s predictions are when predicting new observations (out-of-sample error) vs. how far off they are when predicting the data it was trained on (in-sample error).<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">We can do this using two common performance metrics: Root Mean Squared Error \u2013 which gives greater weight to larger errors \u2013 and Mean Absolute Error \u2013 which simply represents the average mistake. The results are in the figure below. Note that the diagonal line marks the points at which the in-sample and out-of-sample errors are identical. Points above that line are consistent with overfitting (more error when predicting new observations than in-sample observations) and points below are consistent with underfitting (<em>less\u00a0<\/em>error when predicting new observations):<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-9747\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/RMSE_and_MAE_Plots-1024x512.png\" alt=\"\" width=\"800\" height=\"400\" \/><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">Across all 15 samples, those errors are nearly identical. In fact, the left panel shows that in 13 out of the 15 samples, the model made slightly <em>smaller<\/em> squared errors when predicting the new data than when predicting the data it was built on, more consistent with <em>under<\/em>fitting than with overfitting. The right panel shows slightly larger absolute errors in the out-of-sample than in the in-sample predictions, but the largest gap is only .06 points on this 9-point scale, hardly something to care about.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">This evidence is much more consistent with the view that these simple models are not meaningfully overfitting than with the claim that they are.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\"><strong>Conclusion<\/strong><\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">In their paper, the authors make some important points. For example, in-sample R<sup>2 <\/sup>is generally optimistic about how well a model will perform on new data, behavioral scientists should care about how well our models forecast new observations, and we should adopt methods that allow us to evaluate that performance.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">But that paper also included Figure 1, which, to a na\u00efve reader like me, makes it look like even these simple models suffer from extreme overfitting; and that, if anything, those models are performing so poorly as to have no predictive value at all. That impression, I think, is misleading. Figure 1's results look so dire because the authors are using R<sup>2<\/sup> to assess cross-validated predictions in a context in which samples are very small. When you rely on different performance metrics &#8211; ones that don\u2019t structurally disadvantage the model in very small test samples &#8211; you can see that these models are performing about as well as you\u2019d expect given the modest effect sizes that the traditional analyses reveal.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica;\">From this post, you might take away some specific lessons about cross-validation or R<sup>2<\/sup>. But for me this exercise reinforced a broader lesson: Everything that scientists say hinges on the specifics of what they actually <em>did<\/em>. If you can\u2019t (or don't) evaluate the details, you can\u2019t evaluate the science. I'm grateful to the authors for providing access to those details.<\/span><\/p>\n<p style=\"text-align: justify;\"><span style=\"font-family: helvetica, arial, sans-serif;\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-376\" src=\"https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo-300x145.jpg\" alt=\"Wide logo\" width=\"78\" height=\"38\" srcset=\"https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo-300x145.jpg 300w, https:\/\/datacolada.org\/wp-content\/uploads\/2014\/02\/Wide-logo.jpg 320w\" sizes=\"auto, (max-width: 78px) 100vw, 78px\" \/><\/span><\/p>\n<hr \/>\n<p style=\"text-align: justify;\"><span style=\"font-family: Helvetica; color: #0000ff;\"><strong>Author feedback<br \/>\n<\/strong>I shared an earlier draft of this post with the authors last summer, and they provided me with extremely helpful feedback that led me to significantly revise the post. I am very grateful for their feedback. I shared the updated draft with them more recently and they did not offer any additional feedback. If they do decide to comment or reply, I will post it here.\u00a0<\/span><\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p><span style=\"font-family: helvetica, arial, sans-serif;\"><div class=\"jetpack_subscription_widget\"><h2 class=\"widgettitle\">Subscribe to Blog via Email<\/h2>\n\t\t\t<div class=\"wp-block-jetpack-subscriptions__container\">\n\t\t\t<form action=\"#\" method=\"post\" accept-charset=\"utf-8\" id=\"subscribe-blog-1\"\n\t\t\t\tdata-blog=\"58049591\"\n\t\t\t\tdata-post_access_level=\"everybody\" >\n\t\t\t\t\t\t\t\t\t<div id=\"subscribe-text\"><p>Enter your email address to subscribe to this blog and receive notifications of new posts by email.<\/p>\n<\/div>\n\t\t\t\t\t\t\t\t\t\t<p id=\"subscribe-email\">\n\t\t\t\t\t\t<label id=\"jetpack-subscribe-label\"\n\t\t\t\t\t\t\tclass=\"screen-reader-text\"\n\t\t\t\t\t\t\tfor=\"subscribe-field-1\">\n\t\t\t\t\t\t\tEmail Address\t\t\t\t\t\t<\/label>\n\t\t\t\t\t\t<input type=\"email\" name=\"email\" autocomplete=\"email\" required=\"required\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tvalue=\"\"\n\t\t\t\t\t\t\tid=\"subscribe-field-1\"\n\t\t\t\t\t\t\tplaceholder=\"Email Address\"\n\t\t\t\t\t\t\/>\n\t\t\t\t\t<\/p>\n\n\t\t\t\t\t<p id=\"subscribe-submit\"\n\t\t\t\t\t\t\t\t\t\t\t>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"action\" value=\"subscribe\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"source\" value=\"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/9188\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"sub-type\" value=\"widget\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" name=\"redirect_fragment\" value=\"subscribe-blog-1\"\/>\n\t\t\t\t\t\t<input type=\"hidden\" id=\"_wpnonce\" name=\"_wpnonce\" value=\"475b6cc4a4\" \/><input type=\"hidden\" name=\"_wp_http_referer\" value=\"\/wp-json\/wp\/v2\/posts\/9188\" \/>\t\t\t\t\t\t<button type=\"submit\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tclass=\"wp-block-button__link\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tstyle=\"margin: 0; margin-left: 0px;\"\n\t\t\t\t\t\t\t\t\t\t\t\t\t\tname=\"jetpack_subscriptions_widget\"\n\t\t\t\t\t\t>\n\t\t\t\t\t\t\tSubscribe\t\t\t\t\t\t<\/button>\n\t\t\t\t\t<\/p>\n\t\t\t\t\t\t\t<\/form>\n\t\t\t\t\t\t<\/div>\n\t\t\t\n<\/div><strong><br \/>\nFootnotes.<\/strong><\/span><\/p>\n<ol class=\"footnotes\">\n<li id=\"footnote_0_9188\" class=\"footnote\">I make no claims that what I learned is new to the world; it was new to me, and so may be new to some of our readers. [<a href=\"#identifier_0_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_1_9188\" class=\"footnote\">Participants imagined having tickets to see their favorite football team on a day on which it is freezing outside. They imagined that the tickets had been free or that they had paid for them, and they rated how likely they would be to go to the game, where 1 = definitely stay at home and 9 = definitely go to the game. Many Labs replicated the original finding: People rated themselves as more likely to go to the game when they paid for the tickets. (Across all labs the median cohen\u2019s d was .31). Because people indicated being more likely to sit in the freezing cold if they had paid for the tickets than if they had gotten them for free, this is taken as evidence that people honor sunk costs. [<a href=\"#identifier_1_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_2_9188\" class=\"footnote\">There were 36 labs in total, of which 15 found a significant effect. This post will focus only on those 15, since only those 15 results are contained in the figure. [<a href=\"#identifier_2_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_3_9188\" class=\"footnote\">Really, you shouldn't be averaging across them. You should be computing R<sup>2<\/sup> by pooling over all predicted values. You don't compute the standard deviation of a sample by chopping it into subsamples and averaging across those subsamples, because that's not going to give you the same answer as computing the standard deviation of the whole sample. The authors did it this way because they used the sklearn package in Python, and that\u2019s the way it does it. If you do the pooling instead of the averaging, Figure 1 looks a lot different: the lowest R<sup>2<\/sup> is -.151 (instead of -.881) and the median R<sup>2<\/sup> is -.031 (instead of -.128). The fact that this matters so much supports a thesis of this post &#8211; minor cross-validation decisions can severely influence the results &#8211; but going forward I'm going to ignore it. [<a href=\"#identifier_3_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_4_9188\" class=\"footnote\">*Adjusted* R<sup>2<\/sup>s can be negative if the model is bad and there are many predictors that aren't helping. We aren't in that situation here &#8211; the models are significant and there is just one predictor &#8211; so we'll ignore this. [<a href=\"#identifier_4_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_5_9188\" class=\"footnote\">You get -\u221e when all of the test observations are equal to the mean; in this case the mean perfectly predicts every value, and thus exhibits no error. [<a href=\"#identifier_5_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<li id=\"footnote_6_9188\" class=\"footnote\">Interestingly, and perhaps concerningly, when you use the most common cross-validation R package (caret), you never get negative R<sup>2<\/sup>s, because within each fold it computes the correlation between predicted and actual values and then squares it. So even if the predicted values negatively correlate with the actual values, the R<sup>2<\/sup>s will be positive. This means that you\u2019ll get different answers if you use the most commonly used Python package vs. the most commonly used R package. Fun stuff. [<a href=\"#identifier_6_9188\" class=\"footnote-link footnote-back-link\">&#8617;<\/a>]<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>A few years ago our Journal Club discussed an interesting methods paper entitled, \u201cPutting Psychology to the Test: Rethinking Model Evaluation Through Benchmarking and Prediction\u201d (.htm). This post describes my attempt to understand what\u2019s happening in Figure 1 of that paper, which shows that extremely simple experiments can generate extremely negative R2s. I learned a&#8230;<\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wp_rev_ctl_limit":""},"categories":[4,77],"tags":[],"class_list":["post-9188","post","type-post","status-publish","format-standard","hentry","category-paper","category-hard_stats"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack-related-posts":[],"_links":{"self":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/9188","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/comments?post=9188"}],"version-history":[{"count":5,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/9188\/revisions"}],"predecessor-version":[{"id":9801,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/posts\/9188\/revisions\/9801"}],"wp:attachment":[{"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/media?parent=9188"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/categories?post=9188"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/datacolada.org\/wp-json\/wp\/v2\/tags?post=9188"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}