Category Archives: summary

All your p-values are wrong

Or they don’t mean what you think, or they are not interpretable in most situations (Wagenmakers, 2007; Kruschke, 2013). Why is that? Let’s consider how a p-value is calculated. For simplicity, we focus on a one-sample one-sided t-test. Imagine we collected this sample of 50 observations.

The mean is indicated by the vertical solid line. Imagine our hypothesis is a mean 70% correct. The t value is 1.52, and the p-value is 0.0670. We obtain the p-value by comparing our observed t value to a hypothetical distribution of t values obtained from imaginary experiments we will never carry out. The default approach is to assume a world in which there is no effect and we sample from a normal distribution. In each imaginary experiment, we get a sample of n=50 observations from our null distribution, and calculate t. Over an infinite number of imaginary experiments, we get this sampling distribution:

The p-value is the tail area highlighted in red, corresponding to the probability to observe imaginary t values at least as extreme as our observed t value under our model. Essentially, a p-value is a measure of surprise, which can be expressed as a s-value in bits (Greenland, 2019). For a p-value of 0.05, the s-value = -log2(0.05) = 4.32. That’s equivalent to flipping a coin 4 times and getting 4 heads in a row. A p-value can also be described as a continuous measure of compatibility between our model and our data, ranging from 0 for complete incompatibility, to 1 for complete compatibility (Greenland et al. 2016). This is key: p-values are not absolute, they change with the data and the model, and even with the experimenter’s intentions.

Let’s unpack this fundamental property of p-values. Our sample of n=50 scores is associated with a p-value of 0.067 under our model. This model includes:

a hypothesis of 70% correct;
sampling from a normal distribution;
independent samples;
fixed sample size of n=50;
a fixed statistical model, that is a t-test is applied every time.

(The full model includes other assumptions, for instance that our sample is unbiased, that we have precise measurements, that our measure of interest is informative in the context of a causal model linking data and theory (Meehl, 1997), but we will ignore these aspects here.)

In practice some of these assumptions are incorrect, making the interpretation of p-values difficult.

Data-generating process

The scores in our sample do not come from a normal distribution. Like any proportion data, they follow a beta distribution. Here is the population our data came from:

A boxplot suggests the presence of one outlier:

A Q-Q plot suggests some deviations from normality, but a Shapiro-Wilk test fails to reject:

Should we worry or rely on the reassuring call to the central limit theorem and some hand-wavy statement about the robustness of t-test and ANOVA? In practice, it is a bad idea to assume that empirical t distributions will match theoretical ones because skewness and outliers can significantly mess things up, even for relatively large sample sizes (Wilcox, 2022). In the one-sample case, ignoring skewness and outliers can lead to inflated false positives and low power (Wilcox & Rousselet, 2023; Rousselet & Wilcox, 2020).

In our case, we can simulate the t distribution under normality–Normal (sim) in the figure below, and compare it to the t distribution obtained when sampling from our beta population–Beta (sim). As a reference, the figure also shows the theoretical, non-simulated t distribution–Normal (theory). The simulation involved 100,000 iterations with n=50 samples.

Let’s zoom in to better compare the right side of the distributions:

The simulated t distribution under normality is a good approximation of the theoretical distribution. The simulated t when sampling from our beta population is not accessible to the user, because we typically don’t know exactly how the data were generated. Here we have full knowledge, so we can derive the correct t distribution for our data. Remember that the p-value from the standard t-test was 0.0670. Using our simulation under normality the p-value is 0.0675. When using the t distribution obtained by sampling from the correct beta distribution, now the p-value is 0.0804.

In most situations, p-values are calculated using inappropriate theoretical sampling distributions of t values. This might not affect the observed p-value much, but the correct p-value is unknown.

Independence

The independence assumption is violated whenever data-dependent exclusion is applied to the sample. For instance, it is very common for outliers to be identified and removed before applying a t-test or other frequentist inferential test. This is often done using a non-robust method, such as flagging observations more than 2 SD from the mean. A more robust method could also be used, such as a boxplot rule or a MAD-median rule (Wilcox & Rousselet, 2023). Whatever the method, if the outliers are identified using the sample we want to analyse, the remaining observations are no longer independent, which affects the standard error of the test. This is well documented in the case of inferences about trimmed means (Tukey & McLaughlin, 1963; Yuen, 1974; Wilcox, 2022). Trimmed means are robust estimators of central tendency that can boost statistical power in the presence of skewness and outliers. To calculate a 20% trimmed mean, we sort the data, and remove the lowest 20% and the highest 20% (so 40% of observations in total), and average the remaining observations. This introduces a dependency among the remaining observations, which is taken into account in the calculation of the standard error. In other words, removing observations in a data-dependent manner, and then using a t-test as if the new, lower, sample size was the one intended is inappropriate. To illustrate the problem, we can do a simulation in which we sample from a normal or a beta population, each time take a sample of n=50, trim 20% of observations from each end of the distribution, and either apply the incorrect t-test to the remaining n=30 observations, or apply the t formula from Tukey & McLaughlin (1963; Wilcox, 2022) to the full sample. Here are the results:

T values computed on observations left after trimming are far too large. The discrepancy depends on the amount of trimming. Elegantly, the equation of the t-test on trimmed means reverts to the standard equation if we trim 0%. Of course, the amount of trimming should be pre-registered and not chosen after seeing the data.

If we apply a t-test on means to our beta sample after trimming 20%, the (incorrect) p-value is 0.0007. The t-test on trimmed means returns p = 0.0256. That’s a large difference! Using the t distribution from the simulation in which we sampled from the correct beta population, now p = 0.0329. Also, with a t-test on means we fail to reject, whereas we do for an inference on 20% trimmed means. In general, inferences on trimmed means tend to be more powerful relative to means in the presence of skewness or outliers. However, keep in mind that means and trimmed means are not interchangeable: they ask different questions about the populations. Sample means are used to make inferences about population means, and both are non-robust measures of central tendency. Sample trimmed means are used to make inferences about population trimmed means.

Now the problem is more complicated, and somewhat intractable, if instead of trimming a pre-registered amount of data, we apply an outlier detection method. In that case, independence is violated, but correcting the standard error is difficult because the number of removed observations is a random variable: it will change between experiments.

In our sample, we detect one outlier. Removing the outlier and applying a t-test on means, pretending that our sample size was always n-1 is inappropriate, although very common in practice. The standard error could be corrected using a similar equation to that used in the trimmed mean t-test. However, in other experiments we might reject a different number of outliers. Remember that p-values are not about our experiment, they reflect what could happen in other similar experiments that we will never carry out. In our example, we can do a simulation to derive a sampling distribution that match the data generation and analysis steps. For each sample of n=50 observations from the beta distribution, we apply a boxplot rule, remove any outliers, and then compute a t value. If we simply remove the outlier from our sample, the t-test returns p = 0.0213. If instead we compute the p-value by using the simulated sampling distribution, we get p = 0.0866. That p-value reflects the correct data generating process and the fact that, in other experiments, we could have rejected a different number of outliers. Actually, in the simulation the median number of rejected outliers is zero, the 3rd quartile is 1, and the maximum is 9.

In practice, we don’t have access to the correct t sampling distribution. However, we can get a good approximation by using a percentile bootstrap that incorporates the outlier detection and rejection step after sampling with replacement from the full dataset, and before calculating the statistic of interest (Rousselet, Pernet & Wilcox, 2021).

If outliers are expected and common, a good default strategy is to make inferences about trimmed means. Another approach is to make inferences about M-estimators in conjunction with a percentile bootstrap (Wilcox, 2022). M-estimators adjust the amount of trimming based on the data, instead of removing a pre-specified amount. Yet another approach is to fit a distribution with a tail parameter that can account for outliers (Kruschke, 2013). Or it might well be that what looks like an outlier is a perfectly legitimate member of a skewed or heavy-tailed distribution: use more appropriate models that account for rich distributional differences (Rousselet, Pernet, Wilcox, 2017; Farrell & Lewandowsky, 2018; Lindeløv, 2019).

Fixed sample size?

The t-test assumes that the sample size is fixed. This seems obvious, but in practice it is not the case. As we saw in the previous example, sample sizes can depend on outlier rejection, a very common procedure that make p-values uninterpretable. In general, data-dependent analyses will mess up traditional frequentist inferences (Gelman & Loken 2014). Sample sizes can also be affected by certain inclusion criteria. For instance, data are included in the final analyses for participants who scored high enough in a control attention check. Deriving correct p-values would require simulations of sampling distributions that incorporate the inclusion check. In other situations, the sample sizes vary because of reasons outside the experimenters’ control. For instance, data are collected in an online experiment until a deadline. In that case the final sample size is a surprise revealed at the end of the experiment and is thus a random variable. Consequently, deriving a sampling distribution for a statistic of interest requires another sampling distribution of plausible sample sizes that could have been obtained. The sampling distribution, say for a t value, would be calculated by integrating over the sampling distribution of sample sizes, and any other sources of variability, such as different plausible numbers of outliers that could have been removed, even if they were not in our sample. Failure to account for these sources of variability leads to incorrect p-values. It gets even more complicated in some situations: p-values also depend on our sampling intentions.

Imagine this scenario inspired by Kruschke (2013), in which a supervisor asked two research assistants to collect data from n=8 participants in total. They misunderstood the instructions, and instead collected n=8 each, so a total of n=16. The plan was to do a one-sample t-test. What sample size should the research team use to compute the degrees of freedom: 8 or 16? So 7 df or 15 df? Here is a plot of the p-values as a function of the critical t values in the two situations.

The answer depends on the sampling distribution matching the data acquisition process, including the probability that the instructions are misunderstood (Kruschke, 2013). If we assume that a misunderstanding leading to this specific error could occur in 10% of experiments, then the matching curve is the dashed one in the figure below, obtained by mixing the two curves for n=8 and n=16.

That’s right, even though the sample size is n=16, because it was obtained by accident and we intended to collect n=8, the critical t and the p-value are obtained from a distribution that is in-between the two for n=8 and n=16, but closer to n=8. This correct distribution reflects the long-run perspective of conducting imaginary experiments in which the majority would have led to n=8. Again, the p-value is not about the current experiment. This scenario reveals that p-values depend on intentions, which has consequences in many situations. In practice, all the points raised so far demonstrate that p-values in most situations are necessarily inaccurate and very difficult to interpret.

Conditional analyses

Another common way to mess up the interpretation of our analyses is to condition one analysis on another one. For instance, it is common practice to conduct a test of data normality: reject and apply a rank-based test; fail to reject and apply a t-test. Testing for normality of the data is a bad idea for many reasons, including because it makes the subsequent statistical tests conditional on the outcome of the normality test. Again, unless we can simulate the appropriate conditional sampling distribution for our statistic, our p-value will be incorrect. Similarly, anyone tempted to use such an approach would need to justify sample sizes using a power simulation that includes the normality step, and any other step that affects the statistic sampling distribution. In my experience, all conditional steps are typically ignored in power analyses and pre-registrations. It’s not just p-values, all power analyses are wrong too.

No measurement error?

It gets worse. Often, t-tests and similar models are applied to data that have been averaged over repetitions, for instance mean accuracy or reaction times averaged over trials in each condition and participant. In this common situation, the t-test ignores measurement error, because all variability has been wiped out. Obviously, in such situations, mixed-effect (hierarchical) models should be used (DeBruine & Barr, 2021). Using a t-test instead of a mixed-effect model is equivalent to using a mixed-effect model in which the trial level data have been copied and pasted an infinite number of times, such that measurement precision becomes infinite. This is powerfully illustrated here.

Conclusion

In most articles, the p-values are wrong. How they would change using appropriate sampling distributions is hard to determine, and ultimately a futile exercise. Even if the p-values changed very little, the uncertainty makes the obsession for declaring “statistical significance” whenever p<0.05, no matter how close the p-value is to the threshold, all the more ridiculous. So the next time you read in an article that “there was a trend towards significance, p=0.06”, or some other non-sense, in addition to asking the authors if they pre-registered a threshold for a trend, and asking them to also write “a trend towards non-significance, p=0.045”, also point out that the p-value matching their design, analyses, and data generation process is likely to be different from the one reported.

What can we do? A plan of action, from easy to hard:

[1] Take a chill pill, and consider p-values as just one of many outputs, without a special status (Vasishth & Gelman, 2021). Justify your choices in the methods section, unlike the traditional article in which tests pop up out of the blue in the results section, with irrational focus on statistical significance.

[2] Use bootstrap methods to derive more appropriate sampling distributions. Bootstrap methods, combined with robust estimators, can boost statistical power and help you answer more interesting questions. These methods also let you include preprocessing steps in the analyses, unlike standard parametric methods.

[3] Pre-register everything, along with careful justifications of models, pre-processing steps, and matching power simulations.

[4] Abandon the chase for statistical significance. Instead of focusing on finding effects, focus on a model-centric approach (Devezer & Buzbas, 2023). The goal is to contrast models that capture different hypotheses or mechanisms by assessing how they explain or predict data (Farrell & Lewandowsky, 2018; Gelman, Hill & Vehtari, 2020; James et al., 2021; McElreath, 2020; Yarkoni & Westfall, 2017). What is the explanatory power of the models? What is their predictive accuracy?

Code

https://github.com/GRousselet/blog-pwrong

References

DeBruine, L., & Barr, D. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119

Devezer, B., & Buzbas, E. O. (2023). Rigorous exploration in a model-centric science via epistemic iteration. Journal of Applied Research in Memory and Cognition, 12(2), 189–194. https://doi.org/10.1037/mac0000121

Farrell, S., & Lewandowsky, S. (2018). Computational Modeling of Cognition and Behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–465. https://www.jstor.org/stable/43707868

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer US. https://doi.org/10.1007/978-1-0716-1418-1

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603. https://doi.org/10.1037/a0029146

Lindeløv, J. K. (2019). Reaction time distributions: An interactive overview. https://lindeloev.github.io/shiny-rt/

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press. https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46(2), 1738–1748. https://doi.org/10.1111/ejn.13610

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

Yuen, K. K. (1974). The Two-Sample Trimmed t for Unequal Population Variances. Biometrika, 61(1), 165–170. https://doi.org/10.2307/2334299

You get the symptoms of a replication crisis even when there isn’t one: considering power

Leave a reply

Many methods have been proposed to assess the success of a replication (Costigan et al. 2024; Cumming & Maillardet 2006; Errington et al. 2021; LeBel et al. 2019; Ly et al. 2019; Mathur & VanderWeele 2020; Muradchanian et al., 2021; Patil et al. 2016; Spence & Stanley 2024; Verhagen & Wagenmakers 2014). The most common method, also used to determine if results from similar experimental designs are consistent across studies, is consistency in statistical significance: do the two studies report a p value less than some (usually) arbitrary threshold? This approach can be misleading for many reasons, for instance when two studies report the same group difference, but different confidence intervals: one including the null, the other one excluding it. Even though the group differences are the same, sampling error combined with statistical significance would lead us to conclude that the two studies disagree. There is a very nice illustration of the issue in Figure 1 of Amrhein, Greenland & McShane (2019).

More generally:

“if the alternative is correct and the actual power of two studies is 80%, the chance that the studies will both show P ≤ 0.05 will at best be only 0.80(0.80) = 64%; furthermore, the chance that one study shows P ≤ 0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20 = 32% or about 1 chance in 3.”
Greenland et al. 2016 (see also Amrhein, Trafimow & Greenland 2019)

So, in the long run, even if two studies always sample from the same population (even assuming all unmeasured sources of variability are the same across labs; Gelman et al. 2023), the literature would look like there is a replication crisis when none exists.

Let’s expand the single values from the example by Greenland et al. (2016) and plot the probability of finding consistent and inconsistent results as a function of power:

When deciding about consistency between experiments using the statistical significance criterion, the probability to reach the correct decision depends on power, and unless power is very high, we will often be wrong.

In the previous figure, why consider power as low as 5%? If that seems unrealistic, a search for n=3 or n=4 in Nature and Science magazines will reveal recent experiments carried out with very small sample sizes in the biological sciences. Also, in psychology, interactions require much larger sample sizes than typically used, for instance when comparing correlation coefficients (Rousselet, Pernet & Wilcox, 2023). So very low power is still a real concern.

In practice, the situation is probably worse, because power analyses are typically performed assuming parametric assumptions are met; so the real power of a line of research will be lower than expected — see simulations in Rousselet & Wilcox (2020); Wilcox & Rousselet (2023); Rousselet, Pernet & Wilcox (2023).

To provide an illustration of the effect of skewness on power, and in turn, on replication success based on statistical significance, let’s use g-and-h distributions — see details in Rousselet & Wilcox (2020) and Yan & Genton (2019). Here we consider h=0 and vary g from 0 (normal distribution) to 1 (shifted lognormal distribution):

Now, let’s do a simulation in which we vary g, take samples of n=20, and perform a one-sample t-test on means, 10% trimmed means and 20% trimmed means. The code is on GitHub. To assess power, a constant is added to each sample, assuming a power of 80% when sampling from a standard normal population (g=h=0). Alpha is set to the arbitrary value of 0.05. The simulation includes 100,000 iterations.

Here are the results for false positives, showing a non-linear increase as a function g, with the one-sample t-test much more affected when using means than trimmed means:

And the true positive results, showing lower power for trimmed means under normality, but much more resilience to increasing skewness than the mean.

These results are well known (see for instance Rousselet & Wilcox, 2020).
Now the novelty is to consider in turn the impact on the probability of a positive outcome in both experiments.

If we assume normality and determine our sample size to achieve 80% power in the long run, skewness can considerably lower the probability of observing two studies both showing p<0.05 if we employ a one-sample t-test on means. Trimmed means are much less affected by skewness. Other robust methods will perform even better (Wilcox & Rousselet, 2023).

In the same setting, here is the probability of a positive outcome in one experiment and a negative outcome in the other one:

Let’s consider h = 0.1, so that outliers are more likely than in the previous simulation:

In the presence of outliers, false positives increase even more with g for the mean:

And power is overall reduced for all methods:

This reduction in power leads to even lower probability of consistent results than in the previous simulation:

And here are the results on the probability of observing inconsistent results:

So in the presence of skewness and outliers, the situation is overall even worse than suggested by Greenland et al. (2016). For this and other reasons, consistency in statistical significance should not be used to infer the success of a replication.

References

Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305. https://doi.org/10.1038/d41586-019-00857-9

Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, 73(sup1), 262–270. https://doi.org/10.1080/00031305.2018.1543137

Costigan, S., Ruscio, J., & Crawford, J. T. (2024). Performing Small-Telescopes Analysis by Resampling: Empirically Constructing Confidence Intervals and Estimating Statistical Power for Measures of Effect Size. Advances in Methods and Practices in Psychological Science, 7(1), 25152459241227865. https://doi.org/10.1177/25152459241227865

Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11(3), 217–227. https://doi.org/10.1037/1082-989X.11.3.217

Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601

Gelman, A., Hullman, J., & Kennedy, L. (2023). Causal Quartets: Different Ways to Attain the Same Average Treatment Effect. The American Statistician. https://www.tandfonline.com/doi/full/10.1080/00031305.2023.2267597

LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A Brief Guide to Evaluate Replications. Meta-Psychology, 3. https://doi.org/10.15626/MP.2018.843

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication Bayes factors from evidence updating. Behavior Research Methods, 51(6), 2498–2508. https://doi.org/10.3758/s13428-018-1092-x

Mathur, M. B., & VanderWeele, T. J. (2020). New Statistical Metrics for Multisite Replication Projects. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(3), 1145–1166. https://doi.org/10.1111/rssa.12572

Muradchanian, J., Hoekstra, R., Kiers, H., & van Ravenzwaaij, D. (2021). How best to quantify replication success? A simulation study on the comparison of replication success metrics. Royal Society Open Science, 8(5), 201697. https://doi.org/10.1098/rsos.201697

Patil, P., Peng, R. D., & Leek, J. T. (2016). What should we expect when we replicate? A statistical view of replicability in psychological science. Perspectives on Psychological Science : A Journal of the Association for Psychological Science, 11(4), 539–544. https://doi.org/10.1177/1745691616646366

Rousselet, G., Pernet, C. R., & Wilcox, R. R. (2023). An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations. Meta-Psychology, 7. https://doi.org/10.15626/MP.2019.2058

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Spence, J. R., & Stanley, D. J. (2024). Tempered Expectations: A Tutorial for Calculating and Interpreting Prediction Intervals in the Context of Replications. Advances in Methods and Practices in Psychological Science, 7(1), 25152459231217932. https://doi.org/10.1177/25152459231217932

Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143(4), 1457–1475. https://doi.org/10.1037/a0036731

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yan, Y., & Genton, M. G. (2019). The Tukey g-and-h distribution. Significance, 16(3), 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

How to test for variance homogeneity

Leave a reply

It’s simple: don’t do it! (Or do it but for other reasons — see below)

“A Levene’s test was used to test if the distributions have equal variances.”

“To establish normality, we used a Shapiro-Wilk test with p > 0.05; and for equal variances we used Levene’s test.”

This approach of checking assumptions and then deciding on the test to perform suffers from the limitations described in my previous post. In that post, I described why you should avoid normality tests, and instead could use them in a statistics class to cover a lot of core topics. The same goes for tests of variance homogeneity, which are often used in conjunction with normality tests. In particular, methods to detect differences in variance between groups can have low power and inaccurate type I error rates (false positives), which are the topics of this post. But how to capture variance differences is worth looking at, because in some situations a legitimate question is whether distributions differ in spread, or more specifically in variance. There is a wide range of applications of test of variance differences in fields such as neuroscience, economics, genetics, quality control (Coroner et al. 1982; Li et al. 2015; Ritchie et al. 2018; Patil & Kulkarni, 2022). Thus, it is useful to consider the false positive rate and power of these methods in different situations.

Before exploring error rates for methods aimed at detecting variance differences, it’s worth pointing out a paper by Zimmerman (2004), which looked at the error rates for a combo approach, in which a test of variance homogeneity is performed first, followed by a standard t-test or a Welsh t-test depending on the outcome. This approach of conducting a preliminary check is not recommended: false positives (type I errors) and power of the combo depend on the relative and absolute sample sizes, as well as the magnitude of the variance difference between groups, leading to poor performance in realistic situations. What works best is to use methods for unequal variances by default (heteroscedastic methods); a recommendation echoed in more recent work (Delacre et al. 2017). In the presence of skewness or outliers, or both, power can be boosted by using trimmed means in conjunction with parametric or bootstrap methods (Wilcox & Rousselet, 2023). In general, it is a bad idea to rely on methods that make strong assumptions about the sampled populations and to hope for the best.

Now let’s look at power and type I error rates for a few methods aimed at comparing variances. There are so many methods available it is hard to decide on a few candidate approaches. For instance, Conover et al. (1981) compared 56 tests of variance homogeneity. And many more tests have been proposed since then. Conover et al. (1981) recommended several tests: the Brown-Forsythe test, which is the same as Levene’s test, but using the median instead of the mean, and the Fligner-Killeen test. So we’ll use these three tests here. Zimmerman (2004) used Levene’s test. The three tests belong to a family of tests in which absolute or squared distances between observations and a measure of central tendency are compared using parametric (t-tests, ANOVAs) or rank-based methods (Conover et al. 1981; 2018). Levene’s test uses the mean to centre the distributions, whereas the Brown-Forsythe and Fligner-Killeen tests use the median. In addition, we’ll look at Bartlett’s test, which is known to be very sensitive to departures from normality, and a percentile bootstrap method to compare variances (Wilcox 2002). As we will see, all these tests perform poorly in the presence of skewness, for reasons explained in Conover et al. (2018), who go on to suggest better ones.

The code to reproduce the simulations and the figures is available on GitHub. The simulations involved 10,000 iterations, with 1,000 samples for the bootstrap method.

Simulation 1: normality

Let’s first look at what happens under normality, an unlikely situation in many fields, but a useful benchmark. We consider only 2 groups. One population has a standard deviation of 1; the other population has a standard deviation that varies from 1 to 2 in steps of 0.1. Sample size varies from 10 to 100, in steps of 10. Here are the populations we sample from:

The results for the 5 methods are summarised in the next figure:

The results for SD=1 correspond to the type I error rate (false positives), which should be around 5%, because that is the arbitrary threshold I chose here. It is the case for Bartlett, but notice how Levene overshoots, and BF and FK undershoot at the lowest sample sizes. The percentile bootstrap method is systematically under 0.05 at all sample sizes. This is a reminder that your alpha is not necessarily what you think. The differences among methods for SD=1 (false positives) are easier to see in this figure:

As the standard deviation in the second population increases, power increases for all methods, as expected. Notice the large sample sizes required to detect variance differences. How do the methods compare for the largest population difference? Bartlett performed best, FK came last:

Simulation 2: skewness

Now, what happens when the populations are skewed? Here we consider g-and-h distributions, with parameters g=1 and h=0, which gives the same shape as a lognormal distribution, but with median zero. The g-and-h distributions are nicely described and illustrated in Yan & Genton (2019). We vary the standard deviation from 1 to 2, leading to these populations:

This is not to suggest that such distributions will often be encountered in applied work. Rather, methods that can maintain false positive rates near nominal level and high power when dealing with these distributions should be able to handle a large variety of situations and can therefore be recommended. Also, fun fact, the distributions above have the same median (0) and skewness, but differ in mean and variance.

Here is what we get when we sample from skewed populations:

Compared to results obtained under normality, the maximum power is lower (more illustrations of that later), but look at what happens to the false positives (SD=1): they skyrocket for Bartlett and Levene, and increase less dramatically for FK. The BF test handles skewness very well, whereas the percentile bootstrap is a bit liberal. Let’s compare the false positives of the different methods in one figure:

The same for maximum power (SD=2):

Obviously, the higher power of Bartlett and FK cannot be trusted given their huge false positive rates.

Simulation 3: false positives as a function of skewness

Here we sample from g-and-h distributions that vary from g=0 to g=1. So we explore the space between the two previous simulations parametrically. First, we consider false positives (no difference in variance). Here are the populations we sample from:

And the results for the 5 methods:

Comparison of the 5 methods for g=1:

BF and the percentile bootstrap tests are clearly more robust to skewness than the other methods. Bartlett is useless, but why is this test available in stat packages? In R, bartlett.test doesn’t even come with a warning.

Simulation 4: true positives as a function of skewness

We proceed as in simulation 3, but now the populations always differ by one standard deviation:

All the methods are strongly affected by skewness, with power dropping more for methods that are better at preserving the type I error rate at the nominal level (BF and percentile bootstrap):

A reminder that if you rely on power calculators that assume normality, your power is probably lower than you think.

Conclusion

None of the methods considered here were satisfactory. Only the Brown-Forsythe test and Wilcox’s bootstrap method controlled the type I error rate under non-normality, but their power was strongly affected by skewness. Conover et al. (2018) have proposed alternative methods to maintain high power in the presence of skewness. They recommend methods in which distributions are centred using the global mean or median (across groups), a simple step that improves performance considerably over the subtraction of the mean or median separately in each group (as used here and by default in R). See their discussion for an explanation of the lack of robustness to skewness of standard tests when individual group means or medians are used instead of the global ones. Conover et al. (2018) also considered the lognormal distribution, which corresponds to the g-and-h distribution with g=1 studied here, so their proposed methods should perform much better than the ones we considered. And there are plenty more tests on the market. For instance, Patil & Kulkarni (2022) have proposed a new method that promises high power in a range of situations. Please post a comment if you know of R packages that implement modern robust methods.

Finally, comparing variances is a very specific question. More broadly, one might be interested in differences in spread between distributions. For this more general question, other tools are available, relying on robust measures of scale (Wilcox, 2017, chapter 5) and quantile approaches (Rousselet et al. 2017). The distinction between variance and spread is important, because differences in variance could be driven by or masked by outliers or skewness, which might not affect a robust estimator of scale.

References

Conover, W.J., Johnson, M.E., & Johnson, M.M. (1981) A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics, 23, 351–361.

Conover, W.J., Guerrero-Serrano, A.J., & Tercero-Gómez, V.G. (2018) An update on ‘a comparative study of tests for homogeneity of variance.’ Journal of Statistical Computation and Simulation, 88, 1454–1469.

Delacre, M., Lakens, D., & Leys, C. (2017) Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1).

Li, X., Qiu, W., Morrow, J., DeMeo, D.L., Weiss, S.T., Fu, Y., & Wang, X. (2015) A Comparative Study of Tests for Homogeneity of Variances with Application to DNA Methylation Data. PLoS One, 10, e0145295.

Patil, K.P. & Kulkarni, H.V. (2022) An uniformly superior exact multi-sample test procedure for homogeneity of variances under location-scale family of distributions. Journal of Statistical Computation and Simulation, 92, 3931–3957.

Ritchie, S.J., Cox, S.R., Shen, X., Lombardo, M.V., Reus, L.M., Alloza, C., Harris, M.A., Alderson, H.L., Hunter, S., Neilson, E., Liewald, D.C.M., Auyeung, B., Whalley, H.C., Lawrie, S.M., Gale, C.R., Bastin, M.E., McIntosh, A.M., & Deary, I.J. (2018) Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cereb Cortex, 28, 2959–2975.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46, 1738–1748.

Wilcox, R.R. (2002) Comparing the variances of two independent groups. Br J Math Stat Psychol, 55, 169–175.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing, 4th edition. edn. Academic Press.

Wilcox, R.R. & Rousselet, G.A. (2023) An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3, e719.

Yan, Y. and Genton, M.G. (2019), The Tukey g-and-h distribution. Significance, 16: 12-13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Zimmerman, D.W. (2004) A note on preliminary tests of equality of variances. Br J Math Stat Psychol, 57, 173–181.

Why normality tests are great…

6 Replies

…as a teaching example and should be avoided in research.

These statements are common in the psychology and neuroscience literature:

“In order to assess the normal distribution of the population in terms of age, BV% and CSF%, the Lilliefors-corrected Kolmogorov–Smirnov test was performed” (Porcu et al. 2019)

“The Kolmogorov–Smirnov-Test revealed a normal distribution (p = 0.82).” (Knolle et al. 2019)

“The distribution was not normal (P < 0.01 with the Shapiro–Wilk test).” (Beaudu-Lange et al. 2001)

“Assumptions of the one-way anova for normality were also confirmed with the Shapiro–Wilk test.” (Holloway et al. 2015)

“The Shapiro-Wilk-W-test (P < 0.05) revealed that all distributions could be assumed to be Gaussian as a prerequisite for the application of a t-test.” (Dicke et al. 2008)

“Given the non-normal distribution of such data (Shapiro–Wilk’s p < .05), we applied a nonparametric one-sample t test (the one-sample Wilcoxon signed rank test).” (Zapparoli et al. 2019)

A common recipe goes like this:

apply a normality test;
if p>0.05, conclude that the data are normally distributed and proceed with a parametric test;
if p<0.05, conclude that the data are not normally distributed and proceed with a non-parametric test (or transform the data to try to achieve normality).

It is a useful exercise or class activity to consider the statements above with the goal of identifying all the underlying issues. It could take several hours of teaching to do justice to the rich topics we need to cover to properly understand these issues.

Here is a succinct and non-exhaustive list of issues, with references for follow-up readings:

[1] In the general context of linear regression, the normality assumption applies to the residuals, not the marginal distributions. The main solution involves graphical checks of the residuals (Ernst & Albers, 2017; Vanhove, 2018).

Resources for graphical checks:

Visualization of Regression Models Using visreg

Visualizing regression model predictions

Extracting and visualizing tidy residuals from Bayesian models

Other solutions involve model comparison, to contrast models making different assumptions, and using models robust to assumption violations (Bürkner, 2017; Kruschke, 2013; Wilcox & Rousselet, 2018).

[2] The p value from standard frequentist tests, such as normality tests, cannot be used to accept the null (Rouder et al., 2016; Kruschke, 2018). The p value being computed assuming that the null is true, it cannot in turn be used to support the null — that’s circular. To find support for the null, we need an alternative hypothesis (to compute a Bayes Factor; Rouder et al., 2016; Wagenmakers et al., 2020) or a Region of Practical Equivalence (ROPE, to compute a test of equivalence; Freedman et al., 1984; Kruschke, 2018; Lakens, 2017; Campbell & Gustafson, 2022). Setting an alternative hypothesis is also crucial to get a consistent test (Rouder et al., 2016; Kruschke & Liddell, 2018). Tests of normality, like all Point Null Hypothesis Significance Tests (PNHST), are inconsistent: given alpha = 0.05, even if normality holds, 5% of tests will be positive no matter how large the sample size is.

[3] Failure to reject (p>0.05) doesn’t mean data were sampled from a normal distribution. Another function could fit the data equally well (for instance a shifted lognormal distribution). This point follows directly from [2]. Since our alternative hypothesis is extremely vague, the possibility of another distribution being a plausible data generation process is ignored: the typical test considers only a point null hypothesis versus “anything else”. So when we ask a very vague question, we can only get a very vague answer (there is no free lunch in inference – Rouder et al., 2016).

[4] Failure to reject (p>0.05) could be due to low power. This is well known but usually ignored. Here are the results of simulations to illustrate this point. The code is available on GitHub. We sample from g-and-h distributions (Yan & Genton, 2019), which let us vary asymmetry (parameter g) and tail-thickness (parameter h, which also affects how peaky the distribution is). We start by varying g, keeping a constant h=0.

g-and-h populations used in the simulation in which we vary parameter g

Here are results for the Shapiro-Wilk test, based on a simulation with 10,000 iterations.

The Shapiro-Wilk test has low power unless the departure from normality is pronounced, or sample sizes are large. With small departures from normality (say g=0.1, g=0.2), achieving high power won’t be possible with typical sample sizes in psychology and neuroscience. For g=0, the proportion of false positives is at the expected 5% level (false positive rate).

The Kolmogorov-Smirnov test is dramatically less powerful than the Shapiro-Wilk test (Yap & Sim, 2011).

What happens if we sample from symmetric distributions that are more prone to outliers than the normal distribution? By varying the h parameter, keeping a constant g=0, we can consider distributions that are progressively more kurtotic than the normal distribution.

g-and-h populations used in the simulation in which we vary parameter h

Are the tests considered previously able to detect such deviations from normality? Here is how the Shapiro-Wilk test behaves.

And here are the catastrophic results for the Kolmogorov-Smirnov test.

[5] As the sample size increases, progressively smaller and smaller deviations from normality can be detected, eventually reaching absurd levels of precision, such that tiny differences of no practical relevance will be flagged. This point applies to all PNHST and again follows from [2]: because in PNHST no alternative is considered, tests are biased against the null (Rouder et al., 2016; Wagenmakers et al., 2020). Even when p<0.05, contrasting two hypotheses could reveal that a normal distribution and a non-normal distribution are equally plausible, given our data. Also, because PNHST is not consistent, even when the null is true, 5% of tests will be positive.

[6] Choosing a model conditional on the outcome of a preliminary check affects sampling distributions and thus p values and confidence intervals. The same problem arises when doing balance tests. If a t-test is conditional on a normality test, the p value of the t-test will be different (but unknown) from the one obtained if a t-test is performed without a preliminary check. That’s because p values depend on sampling distributions of imaginary experiments, which in turn depend on sampling and testing intentions (Wagenmaker, 2007; Kruschke & Liddell, 2018). This dependence can make p values difficult to interpret, because unless we simulate the world of possibilities that led to our p value, the sampling distribution for our statistic (say t statistic) is unknown.

[7] When non-normality is detected or suspected, a classic alternative to the two sample t-test is the Wilcoxon-Mann-Whitney test. However, in general different tests or models address different hypotheses — they are not interchangeable. For instance, the WMW’s U statistics is related to the distribution of all pairwise differences between two independent groups; unlike the t-test it doesn’t involve a comparison of the marginal means. Similarly, if instead of the mean, we use a trimmed mean, a robust measure of central tendency, our inferences are about the population trimmed mean, not the population mean.

[8] In most cases, researchers know the answer to the normality question before conducting the experiment. For instance, we know that reaction times, accuracy and questionnaire data are not normally distributed. Testing for normality when we already know the answer is unnecessary and falls into the category of tautological tests. Since we know the answer in most situations, it is better practice to use appropriate models and drop the checks altogether. For instance, accuracy data follow beta-binomial distributions (Jaeger, 2008; Kruschke, 2014); questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018; Bürkner & Vuorre, 2019; Taylor et al., 2022); reaction time data can be modelled using several families of skewed distributions (Lindeløv, 2019).

References

Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. & Vuorre, M. (2019) Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science, 2, 77–101. https://journals.sagepub.com/doi/full/10.1177/2515245918823199

Campbell, H. & Gustafson, P. (2021) re:Linde et al. (2021): The Bayes factor, HDI-ROPE and frequentist equivalence tests can all be reverse engineered – almost exactly – from one another. https://arxiv.org/abs/2104.07834

Ernst AF, Albers CJ. 2017. Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. PeerJ 5:e3323 https://doi.org/10.7717/peerj.3323

Freedman, L.S., Lowe, D., & Macaskill, P. (1984) Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586.

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434–446.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573–603.

Kruschke, J.K. (2014) Doing Bayesian Data Analysis, 2nd Edition. edn. Academic Press.

Kruschke, J.K. (2018) Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1, 270–280.

Kruschke, J.K. & Liddell, T.M. (2018) The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev, 25, 178–206.

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Liddell, T.M. & Kruschke, J.K. (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348.

Lindeløv, J.K. (2019) Reaction time distributions: an interactive overview
https://lindeloev.github.io/shiny-rt/

Rouder, J.N., Morey, R.D., Verhagen, J., Province, J.M. and Wagenmakers, E.-J. (2016), Is There a Free Lunch in Inference?. Top Cogn Sci, 8: 520-547. https://doi.org/10.1111/tops.12214

Taylor, J.E., Rousselet, G.A., Scheepers, C. et al. Rating norms should be calculated from cumulative link mixed effects models. Behav Res (2022). https://doi.org/10.3758/s13428-022-01814-7

Torrin M.Liddell & John K.Kruschke (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348
https://www.sciencedirect.com/science/article/abs/pii/S0022103117307746

Vanhove (2018) Checking model assumptions without getting paranoid. https://janhove.github.io/analysis/2018/04/25/graphical-model-checking

Wagenmakers, E.-J. (2007) A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2020) The Principle of Predictive Irrelevance or Why Intervals Should Not be Used for Model Comparison Featuring a Point Null Hypothesis. In Gruber, C.W. (ed), The Theory of Statistics in Psychology: Applications, Use, and Misunderstandings. Springer International Publishing, Cham, pp. 111–129.

Wilcox RR, Rousselet GA. A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci. 2018 Jan 22;82:8.42.1-8.42.30. doi: 10.1002/cpns.41. PMID: 29357109.

Yan, Y. & Genton, M.G. (2019) The Tukey g-and-h distribution. Significance, 16, 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Yap, B.W. & Sim, C.H. (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141–2155. DOI: 10.1080/00949655.2010.520163

Planning for measurement precision, an alternative to power analyses

1 Reply

When we estimate power curves, we ask this question: given some priors about the data generating process, the nature of the effect and measurement variance, what is the probability to detect an effect for a given statistical test (say using an arbitrary p<0.05 threshold) for various sample sizes and effect sizes. While there are very good reasons to focus on power estimation, this is not the only or the most important aspect of an experimental procedure to consider (Gelman & Carlin, 2014). Indeed, finding the number of observations needed so that we get p<0.05 in say 87% of experiments, is not the most exciting part of designing an experiment.

The relevant question is not “What is the power of a test?” but rather is “What might be expected to happen in studies of this size?” (Gelman & Carlin, 2014)

A related but more important question is that of measurement precision: given some priors and a certain number of participants, how close can we get to the unknown population value (Maxwell et al., 2008; Schönbrodt & Perugini, 2013; Peters & Crutzen, 2018; Trafimow, 2019)? Not surprisingly, measurement precision depends on sample size. As we saw in previous posts, sampling distributions get narrower with increasing sample sizes:

And with narrower sampling distributions, measurement precision increases. To illustrate, let’s consider an example from a lexical decision task – hundreds of reaction times (RT) were measured in hundreds of participants who had to distinguish between words and non-words presented on a computer screen.

Here are examples of RT distributions from 100 participants for each condition:

figure_flp_100 — Reaction time distributions from 100 participants. Participants were randomly selected among 959. Distributions are shown for the same participants (colour coded) in the Word (A) and Non-Word (B) conditions.

If we save the median of each distribution, for each participant and condition, we get these positively skewed group level distributions:

The distribution of pairwise differences between medians is also positively skewed:

Notably, most participants have a positive difference: 96.4% of participants are faster in the Word than the Non-Word condition – a potential case of stochastic dominance (Rouder & Haaf, 2018; see also this summary blog post).

Now let say we want to estimate the group difference between conditions. Because of the skewness at each level of analysis (within and across participants), we estimate the central tendency at each level using the median: that is, we compute the median for each participant and each condition, then compute the medians of medians across participants (a more detailed assessment could be obtained by performing hierarchical modelling or multiple quantile estimation for instance).

Then we can assess measurement precision at the group level by performing a multi-level simulation. In this simulation, we can ask, for instance, how often the group estimate is no more than 10 ms from the population value across many experiments. To simplify, in each iteration of the simulation, we draw 200 trials per condition and participant, compute the median and save the Non-Word – Word difference. Group estimation of the difference is then based on a random sample of 10 to 300 participants, with the group median computed across participants’ differences between medians. Because the dataset is very large at the two level of analysis, we can pretend we have access to the population values, and define them by first computing, for each condition, the median across all available trials for each participant, second by computing across all participants the median of the pairwise differences.

Having defined population values (the truth we’re trying to estimate, here a group difference of 78 ms), we can calculate measurement precision as the proportion of experiments in which the group estimate is no more than X ms from the population value, with X varying from 5 to 40 ms. Here are the results:

figure_flp_sim_precision — Group measurement precision for the difference between the Non-Word and Word conditions. Measurement precision was estimated by using a simulation with 10,000 iterations, 200 trials per condition and participant, and varying numbers of participants.

Not surprisingly, the proportion of estimates close to the population value increases with the number of participants. More interestingly, the relationship is non-linear, such that a larger gain in precision can be achieved by increasing sample size for instance from 10 to 20 compared to from 90 to 100.

The results also let us answer useful questions for planning experiments (see the black arrows in the above figure):

• So that in 70% of experiments the group estimate of the median is no more than 10 ms from the population value, we need to test at least 56 participants.

• So that in 90% of experiments the group estimate of the median is no more than 20 ms from the population value, we need to test at least 38 participants.

Obviously, this is just an example, about a narrow problem related to lexical decisions. Other aspects could be considered too, for instance the width of the confidence intervals (Maxwell, Kelley & Rausch, 2008; Peters & Crutzen, 2017; Rothman & Greenland, 2018). And for your particular case, most likely, you won’t have access to a large dataset from which to perform a data driven simulation. In this case, you can get estimates about plausible effect sizes and their variability from various sources (Gelman & Carlin 2014):

related data;
(systematic) literature review;
meta-analysis;
outputs of a hierarchical model;
modelling.

To model a range of plausible effect sizes and their consequences on repeated measurements, you need priors about a data generating process and how distributions differ between conditions. For instance, you could use exGaussian distributions to simulate RT data. For research on new effects, it is advised to consider a large range of potential effects, with their plausibility informed by the literature and psychological/biological constraints.

Although relying on the literature alone can lead to over-optimistic expectations because of the dominance of small n studies and a bias towards significant results (Yarkoni 2009; Button et al. 2013), methods are being developed to overcome these limitations (Anderson, Kelley & Maxwell, 2017). In the end, the best cure against effect size over-estimation is a combination of pre-registration/registered reports (to diminish literature bias) and data sharing (to let anyone do their own calculations and meta-analyses).

Code

The code is on figshare: the simulation can be reproduced using the flp_sim_precision notebook, the illustrations of the distributions can be reproduced using flp_illustrate_dataset.

Shiny app by Malcolm Barrett (@malco_barrett)

https://malcolmbarrett.shinyapps.io/precisely/

References

Anderson, S.F., Kelley, K. & Maxwell, S.E. (2017) Sample-Size Planning for More Accurate Statistical Power: A Method Adjusting Sample Effect Sizes for Publication Bias and Uncertainty. Psychol Sci, 28, 1547-1562.

Bland J.M.. The tyranny of power: is there a better way to calculate sample size? BMJ 2009; 339 :b3985 (https://www.bmj.com/content/339/bmj.b3985)

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S. & Munafo, M.R. (2013) Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews. Neuroscience, 14, 365-376.

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Meot, A., Augustinova, M. & Pallier, C. (2010) The French Lexicon Project: lexical decision data for 38,840 French words and 38,840 pseudowords. Behav Res Methods, 42, 488-496.

Gelman, A. & Carlin, J. (2014) Beyond Power Calculations: Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspect Psychol Sci, 9, 641-651.

Maxwell, S.E., Kelley, K. & Rausch, J.R. (2008) Sample size planning for statistical power and accuracy in parameter estimation. Annu Rev Psychol, 59, 537-563.

Peters, G.-J.Y. & Crutzen, R. (2017) Knowing exactly how effective an intervention, treatment, or manipulation is and ensuring that a study replicates: accuracy in parameter estimation as a partial solution to the replication crisis. PsyArXiv. doi:10.31234/osf.io/cjsk2.

Rothman, K.J. & Greenland, S. (2018) Planning Study Size Based on Precision Rather Than Power. Epidemiology, 29, 599-603.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Rousselet, G.A. & Wilcox, R.R. (2018) Reaction times and other skewed distributions: problems with the mean and the median. bioRxiv. doi: https://doi.org/10.1101/383935

Rousselet, G.; Wilcox, R. (2018): Reaction times and other skewed distributions: problems with the mean and the median. figshare. Fileset. https://doi.org/10.6084/m9.figshare.6911924.v1

Schönbrodt, F.D. & Perugini, M. (2013) At what sample size do correlations stabilize? J Res Pers, 47, 609-612.

Trafimow, D. (2019) Five Nonobvious Changes in Editorial Practice for Editors and Reviewers to Consider When Evaluating Submissions in a Post p < 0.05 Universe, The American Statistician, 73:sup1, 340-345, DOI: 10.1080/00031305.2018.1537888

Yarkoni, T. (2009) Big Correlations in Little Studies: Inflated fMRI Correlations Reflect Low Statistical Power‚ Commentary on Vul et al. (2009). Perspectives on Psychological Science, 4, 294-298.

Power estimation for correlation analyses

1 Reply

Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you).

To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X.

demo_homo_dist

For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives.

figure_power_homo

Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.

For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.

To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.

To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations.

figure_power_homo_arrows

Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.

So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:

outliers
the magnitude of the slope around which points are clustered
curvature
the magnitude of the residuals
restriction of range
heteroscedasticity

The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).

Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:

“X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”

Here is an example of such data:

demo_wilcox_dist

Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity.

figure_power_hetero_wilcox

As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.

Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.

We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X.

demo_hetero_dist

For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance.

figure_power_hetero_rho0

From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).

Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power.

figure_power_hetero_rho04

We can zoom in:

figure_power_hetero_rho04_zoom

The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).

Conclusion

The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.

But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).

Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.

The R code to reproduce the simulations and the figures is on GitHub.

References

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Small n correlations + p values = disaster

2 Replies

Previously, we saw that with small sample sizes, correlation estimation is very uncertain, which implies that small n correlations cannot be trusted: the observed value in any experiment could be very far from the population value, and the sign could be wrong too. In addition to the uncertainty associated with small sample sizes, the selective report of results based on p values < 0.05 (or some other threshold), can lead to massively inflated correlation estimates in the literature (Yarkoni, 2009 ☜ if you haven’t done so, you really should read this excellent paper).

Let’s illustrate the problem (code is on GitHub). First, we consider a population rho = 0. Here is the sampling distribution as a function of sample size, as we saw in an earlier post.

Figure 1: Sampling distribution for rho=0.

Now, here is the sampling distribution conditional on p < 0.05. The estimates are massively inflated and the problem gets worse with smaller sample sizes, because the smaller the sample size, the larger the correlations must be by chance for them to be significant.

Figure 2: Sampling distribution for rho=0, given p<0.05

So no, don’t get too excited when you see a statistically significant correlation in a paper…

Let’s do the same exercise when the population correlation is relatively large. With rho = 0.4, the sampling distribution looks like this:

Figure 3: Sampling distribution for rho=0.4.

If we report only those correlations associated with p < 0.05, the distribution looks like this:

Figure 4: Sampling distribution for rho=0.4, given p<0.05

Again, with small sample sizes, the estimates are inflated, albeit in the correct direction. There is nevertheless a small number of large negative correlations (see small purple bump around -0.6 -0.8). Indeed, in 0.77% of simulations, even though the population value was 0.4, a large and p < 0.05 negative correlation was obtained.

Small n correlations cannot be trusted

12 Replies

This post illustrates two important effects of sample size on the estimation of correlation coefficients: lower sample sizes are associated with increased variability and lower probability of replication. This is not specific to correlations, but here we’re going to have a detailed look at what it means when using the popular Pearson’s correlation (similar results are obtained using Spearman’s correlation, and the same problems arise with regression). The R code is available on github.

UPDATE: 2018-06-02

In the original post, I mentioned non-linearities in some of the figures. Jan Vanhove replied on Twitter that he was not getting any, and suggested a different code snippet. I’ve updated the simulations using his code, and now the non-linearities are gone! So thanks Jan!

Johannes Algermissen mentioned on Twitter that his recent paper covered similar issues. Have a look! He also reminded me about this recent paper that makes points very similar to those in this blog.

Gjalt-Jorn Peters mentioned on Twitter that “you can also use the Pearson distribution in package suppdists. Also see pwr.confintR to compute the required sample size for a given desired accuracy in parameter estimation (AIPE), which can also come in handy when planning studies”.

Wolfgang Viechtbauer‏ mentioned on Twitter “that one can just compute the density of r directly (no need to simulate). For example: link. Then everything is nice and smooth”.

UPDATE: 2018-06-30

Frank Harrell wrote on Twitter: “I’ll also push the use of precision of correlation coefficient estimates in justifying sample sizes. Need n > 300 to estimate r. BBR Chapter 8″

Let’s start with an example, shown in the figure below. It is common to see such an array of scatterplots in articles (though confidence intervals are typically not reported). In my experience, the accompanying description goes like that:

“There was a significant correlation in group/condition 5 (p < 0.05); however, there was no association in the other groups/conditions (p>0.05).”

Of course there are many problems with this description:

– there is no mention of estimator (Pearson correlation is the default, but this should be explicit);
– there is no acknowledgment that Pearson correlation is sensitive to other features of the data than the presence of an association (same goes for OLS regression);
– there is no control for multiple comparisons;
– correlations are not explicitly compared – an example of interaction fallacy;
– there is no acknowledgment that p values near 0.05 typically only provide weak evidence against the null;
– authors have committed the fallacy of assuming that the lack of evidence (p>0.05) is the same as evidence for a lack of effect;
– …

Finally, to bring us back to the topic of this blog: researchers tend to forget that promising looking correlations are easily obtained by chance when sample sizes are small.

unnamed-chunk-4-1

The data in the scatterplots were sampled from a bivariate population with zero correlation and a bit of skewness to create more realistic examples (you can play with the code to see what happens in different situations). I suspect a lot of published correlations might well fall into that category. Nothing new here, false positives and inflated effect sizes are a natural outcome of small n experiments, and the problem gets worse with questionable research practices and incentives to publish positive new results.

To understand the problem with estimation from small n experiments, we can perform a simulation in which we draw samples of different sizes from a normal population with a known Pearson’s correlation (rho) of zero. The sampling distributions of the estimates of rho for different sample sizes look like this:

figure_sampling_distributions

Sampling distributions tell us about the behaviour of a statistics in the long run, if we did many experiments. Here, with increasing sample sizes, the sampling distributions are narrower, which means that in the long run, we get more precise estimates. However, a typical article reports only one correlation estimate, which could be completely off. So what sample size should we use to get a precise estimate? The answer depends on:

the shape of the univariate and bivariate distributions (if outliers are common, consider robust methods);
the expected effect size (the larger the effect, the fewer trials are needed – see below);
the precision we want to afford.

For the sampling distributions in the previous figure, we can ask this question for each sample size:

What is the proportion of correlation estimates that are within +/- a certain number of units from the true population correlation? For instance:

for 70% of estimates to be within +/- 0.1 of the true correlation value (between -0.1 and 0.1), we need at least 109 observations;
for 90% of estimates to be within +/- 0.2 of the true correlation value (between -0.2 and 0.2), we need at least 70 observations.

These values are illustrated in the next figure using black lines and arrows. The figure shows the proportion of estimates near the true value, for different sample sizes, and for different levels of precision. The bottom-line is that even if we’re willing to make imprecise measurements (up to 0.2 from the true value), we need a lot of observations to be precise enough and often enough in the long run.

figure_precision

The estimation uncertainty associated with small sample sizes leads to another problem: effects are not likely to replicate. A successful replication can be defined in several ways. Here I won’t consider the relatively trivial case of finding a statistically significant (p<0.05) effect going in the same direction in two experiments. Instead, let’s consider how close two estimates are. We can determine, given a certain level of precision, the probability to observe similar effects in two consecutive experiments. In other words, we can find the probability that two measurements differ by at most a certain amount. Not surprisingly, the results follow the same pattern as those observed in the previous figure: the probability to replicate (y-axis) increases with sample size (x-axis) and with the uncertainty we’re willing to accept (see legend with colour coded difference conditions).

figure_replication

In the figure above, the black lines indicates that for 80% of replications to be at most 0.2 apart, we need at least 83 observations.

So far, we have considered samples from a population with zero correlation, such that large correlations were due to chance. What happens when there is an effect? Let see what happens for a fixed sample size of 30, as illustrated in the next figure.

figure_sampling_distributions_rho

As a sanity check, we can see that the modes of the sampling distributions progressively increase with increasing population correlations. More interestingly, the sampling distributions also get narrower with increasing effect sizes. As a consequence, the larger the true effect we’re trying to estimate, the more precise our estimations. Or put another way, for a given level of desired precision, we need fewer trials to estimate a true large effect. The next figure shows the proportion of estimates close to the true estimate, as a function of the population correlation, and for different levels of precision, given a sample size of 30 observations.

figure_precision_rho

Overall, in the long run, we can achieve more precise measurements more often if we’re studying true large effects. The exact values will depend on priors about expected effect sizes, shape of distributions and desired precision or achievable sample size. Let’s look in more detail at the sampling distributions for a generous rho = 0.4.

figure_sampling_distributions_rho04

The sampling distributions for n<50 appear to be negatively skewed, which means that in the long run, experiments might tend to give biased estimates of the population value; in particular, experiments with n=10 or n=20 are more likely than others to get the sign wrong (long left tail) and to overestimate the true value (distribution mode shifted to the right). From the same data, we can calculate the proportion of correlation estimates close to the true value, as a function of sample size and for different precision levels.

figure_precision_rho04

We get this approximate results:

for 70% of estimates to be within +/- 0.1 of the true correlation value (between 0.3 and 0.5), we need at least 78 observations;
for 90% of estimates to be within +/- 0.2 of the true correlation value (between 0.2 and 0.6), we need at least 50 observations.

You could repeat this exercise using the R code to get estimates based on your own priors and the precision you want to afford.

Finally, we can look at the probability to observe similar effects in two consecutive experiments, for a given precision. In other words, what is the probability that two measurements differ by at most a certain amount? The next figure shows results for differences ranging from 0.05 (very precise) to 0.4 (very imprecise). The black arrow illustrates that for 80% of replications to be at most 0.2 apart, we need at least 59 observations.

figure_replication_rho04

We could do the same analyses presented in this post for power. However, I don’t really see the point of looking at power if the goal is to quantify an effect. The precision of our measurements and of our estimations should be a much stronger concern than the probability to flag any effect as statistically significant (McShane et al. 2018).

There is a lot more to say about correlation estimation and I would recommend in particular these papers from Ed Vul and Tal Yarkoni, from the voodoo correlation era. More recently, Schönbrodt & Perugini (2013) looked at the effect of sample size on correlation estimation, with a focus on precision, similarly to this post. Finally, this more general paper (Forstmeier, Wagemakers & Parker, 2016) about false positives is well worth reading.

Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

2 Replies

This is part 4 of a 4 part series. Part 1 is here.

In this post, I look at median bias in a large dataset of reaction times from participants engaged in a lexical decision task. The dataset was described in a previous post.

After removing a few participants who didn’t pay attention to the task (low accuracy or too many very late responses), we’re left with 959 participants to play with. Each participant had between 996 and 1001 trials for each of two conditions, Word and Non-Word.

Here is an illustration of reaction time distributions from 100 randomly sampled participants in the Word condition:

figure_flp_w_100_kde

Same in the Non-Word condition:

figure_flp_nw_100_kde

Skewness tended to be larger in the Word than the Non-Word condition. Based on the standard parametric definition of skewness, that was the case in 80% of participants. If we use a non-parametric estimate instead (mean – median), it was the case in 70% of participants.

If we save the median of every individual distribution, we get the two following group distributions, which display positive skewness:

figure_flp_all_p_median

The same applies to distributions of means:

figure_flp_all_p_mean

So we have to worry about skewness at 2 levels:

individual distributions
group distributions

Here I’m only going to explore estimation bias as a result of skewness and sample size in individual distributions. From what we learnt in previous posts, we can already make predictions: because skewness tended to be stronger in the Word than in the Non-Word condition, the bias of the median will be stronger in the former than the later for small sample sizes. That is, the median in the Word condition will tend to be more over-estimated than the median in the Non-Word condition. As a consequence, the difference between the median of the Non-Word condition (larger RT) and the median of the Word condition (smaller RT) will tend to be under-estimated. To check this prediction, I estimated bias in every participant using a simulation with 2,000 iterations. I assumed that the full sample was the population, from which we can compute population means and population medians. Because the Non-Word condition is the least skewed, I used it as the reference condition, which always had 200 trials. The Word condition had 10 to 200 trials, with 10 trial increments. In the simulation, single RT were sampled with replacements among the roughly 1,000 trials available per condition and participant, so that each iteration is equivalent to a fake experiment.

Let’s look at the results for the median. The figure below shows the bias in the long run estimation of the difference between medians (Non-Word – Word), as a function of sample size in the Word condition. The Non-Word condition always had 200 trials. All participants are superimposed and shown as coloured traces. The average across participants is shown as a thicker black line.

figure_flp_bias_diff_md

As expected, bias tended to be negative with small sample sizes. For the smallest sample size, the average bias was -11 ms. That’s probably substantial enough to seriously distort estimation in some experiments. Also, variability is high, with a 80% highest density interval of [-17.1, -2.6] ms. Bias decreases rapidly with increasing sample size. For n=60, it is only 1 ms.

But inter-participant variability remains high, so we should be cautious interpreting results with large numbers of trials but few participants. To quantify the group uncertainty, we could measure the probability of being wrong, given a level of desired precision, as demonstrated here for instance.

After bootstrap bias correction (with 200 bootstrap resamples), the average bias drops to roughly zero for all sample sizes:

figure_flp_bias_diff_md_bc

Bias correction also reduced inter-participant variability.

As we saw in the previous post, the sampling distribution of the median is skewed, so the standard measure of bias (taking the mean across simulation iterations) does not provide a good indication of the bias we can expect in a typical experiment. If instead of the mean, we compute the median bias, we get the following results:

figure_flp_mdbias_diff_md

Now, at the smallest sample size, the average bias is only -2 ms, and it drops to near zero for n=20. This result is consistent with the simulations reported in the previous post and confirms that in the typical experiment, the average bias associated with the median is negligible.

What happens with the mean?

figure_flp_bias_diff_m

The average bias of the mean is near zero for all sample sizes. Individual bias values are also much less variable than median values. This difference in bias variability does not reflect a difference in variability among participants for the two estimators of central tendency. In fact, the distributions of differences between Non-Word and Word conditions are very similar for the mean and the median.

figure_flp_all_p_diff

Estimates of spread are also similar between distributions:

IQR: mean RT = 78; median RT = 79

MAD: mean RT = 57; median RT = 54

VAR: mean RT = 4507; median RT = 4785

This suggests that the inter-participant bias differences are due to the shape differences observed in the first two figures of this post.

Finally, let’s consider the median bias of the mean.

figure_flp_mdbias_diff_m

For the smallest sample size, the average bias across participants is 7 ms. This positive bias can be explained easily from the simulation results of post 3: because of the larger skewness in the Word condition, the sampling distribution of the mean was more positively skewed for small samples in that condition compared to the Non-Word condition, with the bulk of the bias estimates being negative. As a result, the mean tended to be more under-estimated in the Word condition, leading to larger Non-Word – Word differences in the typical experiment.

I have done a lot more simulations and was planning even more, using other datasets, but it’s time to move on! Of particular note, it appears that in difficult visual search tasks, skewness can differ dramatically among set size conditions – see for instance data posted here.

Concluding remarks

The data-driven simulations presented here confirm results from our previous simulations:

if we use the standard definition of bias, for small sample sizes, mean estimates are not biased, median estimates are biased;
however, in the typical experiment (median bias), mean estimates can be more biased than median estimates;
bootstrap bias correction can be an effective tool to reduce bias.

Given the large differences in inter-participant variability between the mean and the median, an important question is how to spend your money: more trials or more participants (Rouder & Haaf 2018)? An answer can be obtained by running simulations, either data-driven or assuming generative distributions (for instance exGaussian distributions for RT data). Simulations that take skewness into account are important to estimate bias and power. Assuming normality can have disastrous consequences.

Despite the potential larger bias and bias variability of the median compared to the mean, for skewed distributions I would still use the median as a measure of central tendency, because it provides a more informative description of the typical observations. Large sample sizes will reduce both bias and estimation variability, such that high-precision single-participant estimation should be easy to obtain in many situations involving non-clinical samples. For group estimations, much larger samples than commonly used are probably required to improve the precision of our inferences.

Although the bootstrap bias correction seems to work very well in the long run, for a single experiment there is no guarantee it will get you closer to the truth. One possibility is to report results with and without bias correction.

For group inferences on the median, traditional techniques use incorrect estimations of the standard error, so consider modern parametric or non-parametric techniques instead (Wilcox & Rousselet, 2018).

References

Miller, J. (1988) A warning about median reaction time. J Exp Psychol Hum Percept Perform, 14, 539-543.

Rouder, J.N. & Haaf, J.M. (2018) Power, Dominance, and Constraint: A Note on the Appeal of Different Design Traditions. Advances in Methods and Practices in Psychological Science, 1, 19-26.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

Cohen’s d is biased

Leave a reply

The R notebook associated with this post is available on github.

Cohen’s d is a popular measure of effect size. In the one-sample case, d is simply computed as the mean divided by the standard deviation (SD). For repeated measures, the same formula is applied to difference scores (see detailed presentation and explanation of variants in Lakens, 2013).

Because d relies on a non-robust measure of central tendency (the mean), and a non-robust measure of dispersion (SD), it is a non-robust measure of effect size, meaning that a single observation can have a dramatic effect on its value, as explained here. Cohen’s d also makes very strict assumptions about the data, so it is only appropriate in certain contexts. As a consequence, it should not be used as the default measure of effect size, and more powerful and informative alternatives should be considered – see a few examples here. For comparisons across studies and meta-analyses, nothing will beat data-sharing though.

Here we look at another limitation of Cohen’s d: it is biased when we draw small samples. Bias is covered in detail in another post. In short, in the one-sample case, when Cohen’s d is estimated from a small sample, in the long run it tends to be larger than the population value. This over-estimation is due to a bias of SD, which tends to be lower than the population’s SD. Because the mean is not biased, when divided by an under-estimated SD, it leads to an over-estimated measure of effect size. The bias of SD is explained in intro stat books, in the section describing Student’s t. Not surprisingly it is never mentioned in the discussions of small n studies, as a limitation of effect size estimation…

In this demonstration, we sample with replacement 10,000 times from the ex-Gaussian distributions below, for various sample sizes, as explained here:

figure_miller_distributions

The table below shows the population values for each distribution. For comparison, we also consider a robust equivalent to Cohen’s d, in which the mean is replaced by the median, and SD is replaced by the percentage bend mid-variance (pbvar, Wilcox, 2017). As we will see, this robust alternative is also biased – there is no magic solution I’m afraid.

m: 600 600 600 600 600 600 600 600 600 600 600 600

md: 509 512 524 528 540 544 555 562 572 579 588 594

m-md: 92 88 76 72 60 55 45 38 29 21 12 6

m.den: 301 304 251 255 201 206 151 158 102 112 54 71

md.den: 216 224 180 190 145 157 110 126 76 95 44 68

m.es: 2.0 2.0 2.4 2.4 3.0 2.9 4.0 3.8 5.9 5.4 11.1 8.5

md.es: 2.4 2.3 2.9 2.8 3.7 3.5 5.0 4.5 7.5 6.1 13.3 8.8

m = mean

md = median

den = denominator

es = effect size

m.es = Cohen’s d

md.es = md / pbvar

Let’s look at the behaviour of d as a function of skewness and sample size.

figure_es_m_es

Effect size d tends to decrease with increasing skewness, because SD tends to increase with skewness. Effect size also increases with decreasing sample size. This bias is stronger for samples from the least skewed distributions. This is counterintuitive, because one would think estimation tends to get worse with increased skewness. Let’s find out what’s going on.

Computing the bias normalises the effect sizes across skewness levels, revealing large bias differences as a function of skewness. Even with 100 observations, the bias (mean of 10,000 simulation iterations) is still slightly larger than zero for the least skewed distributions. This bias is not due to the mean, because we the sample mean is an unbiased estimator of the population mean.

figure_es_m_es_bias

Let’s check to be sure:

figure_es_m_num

So the problem must be with the denominator:

figure_es_m_den

Unlike the mean, the denominator of Cohen’s d, SD, is biased. Let’s look at bias directly.

figure_es_m_den_bias

SD is most strongly biased for small sample sizes and bias increases with skewness. Negative values indicate that sample SD tends to under-estimate the population values. This is because the sampling distribution of SD is increasingly skewed with increasing skewness and decreasing sample sizes. This can be seen in this plot of the 80% highest density intervals (HDI) for instance:

figure_m_den_hdi80

The sampling distribution of SD is increasingly skewed and variable with increasing skewness and decreasing sample sizes. As a result, the sampling distribution of Cohen’s d is also skewed. The bias is strongest in absolute term for the least skewed distributions because the sample SD is overall smaller for these distributions, resulting in overall larger effect sizes. Although SD is most biased for the most skewed distributions, SD is also overall much larger for them, resulting in much smaller effect sizes than those obtained for less skewed distributions. This strong attenuation of effect sizes with increasing skewness swamps the absolute differences in SD bias. This explains the counter-intuitive lower d bias for more skewed distributions.

As we saw previously, bias can be corrected using a bootstrap approach. Applied, to Cohen’s d, this technique does reduce bias, but it still remains a concern:

figure_es_m_es_bias_after_bc

Finally, let’s look at the behaviour of a robust equivalent to Cohen’s d, the median normalised by the percentage bend mid-variance.

figure_es_md_es

The median effect size shows a similar profile to the mean effect size. It is overall larger than the mean effect size because it uses a robust measure of spread, which is less sensitive to the long right tails of the skewed distributions we sample from.

figure_es_md_bias

The bias disappears quickly with increasing sample sizes, and quicker than for the mean effect size.

However, unlike what we observed for d, in this case the bias correction does not work for small samples, because the repetition of the same observations in some bootstrap samples leads to very large values of the denominator. It’s ok for n>=15, for which bias is relatively small anyway, so at least based on these simulations, I wouldn’t use bias correction for this robust effect size.

figure_es_md_bias_after_bc

Conclusion

Beware of small sample sizes: they are associated with increased variability (see discussion in a clinical context here) and can accentuate the bias of some effect size estimates. If effect sizes tend to be reported more often if they pass some arbitrary threshold, for instance p < 0.05, then the literature will tend to over-estimate them (see demonstration here), a phenomenon exacerbated by small sample sizes (Button et al. 2013).

Can’t say it enough: small n is bad for science if the goal is to provide accurate estimates of effect sizes.

To determine how the precision and accuracy of your results depend on sample size, the best approach is to perform simulations, providing some assumptions about the shape of the population distributions.

References

Lakens, D. (2013) Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front Psychol, 4, 863.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing. Academic Press, 4th edition., San Diego, CA.

basic statistics

simple steps to improve statistical analyses in neuroscience & psychology

Category Archives: summary

All your p-values are wrong

Data-generating process

Independence

Fixed sample size?

Conditional analyses

No measurement error?

Conclusion

Code

References

You get the symptoms of a replication crisis even when there isn’t one: considering power

References

How to test for variance homogeneity

Simulation 1: normality

Simulation 2: skewness

Simulation 3: false positives as a function of skewness

Simulation 4: true positives as a function of skewness

Conclusion

References

Why normality tests are great…

References

Planning for measurement precision, an alternative to power analyses

Code

Shiny app by Malcolm Barrett (@malco_barrett)

References

Power estimation for correlation analyses

Conclusion

References

Small n correlations + p values = disaster

Small n correlations cannot be trusted

Reaction times and other skewed distributions: problems with the mean and the median (part 4/4)

Concluding remarks

References

Cohen’s d is biased

Conclusion

References