All your p-values are wrong

Or they don’t mean what you think, or they are not interpretable in most situations (Wagenmakers, 2007; Kruschke, 2013). Why is that? Let’s consider how a p-value is calculated. For simplicity, we focus on a one-sample one-sided t-test. Imagine we collected this sample of 50 observations.

The mean is indicated by the vertical solid line. Imagine our hypothesis is a mean 70% correct. The t value is 1.52, and the p-value is 0.0670. We obtain the p-value by comparing our observed t value to a hypothetical distribution of t values obtained from imaginary experiments we will never carry out. The default approach is to assume a world in which there is no effect and we sample from a normal distribution. In each imaginary experiment, we get a sample of n=50 observations from our null distribution, and calculate t. Over an infinite number of imaginary experiments, we get this sampling distribution:

The p-value is the tail area highlighted in red, corresponding to the probability to observe imaginary t values at least as extreme as our observed t value under our model. Essentially, a p-value is a measure of surprise, which can be expressed as a s-value in bits (Greenland, 2019). For a p-value of 0.05, the s-value = -log2(0.05) = 4.32. That’s equivalent to flipping a coin 4 times and getting 4 heads in a row. A p-value can also be described as a continuous measure of compatibility between our model and our data, ranging from 0 for complete incompatibility, to 1 for complete compatibility (Greenland et al. 2016). This is key: p-values are not absolute, they change with the data and the model, and even with the experimenter’s intentions.

Let’s unpack this fundamental property of p-values. Our sample of n=50 scores is associated with a p-value of 0.067 under our model. This model includes:

  • a hypothesis of 70% correct;
  • sampling from a normal distribution;
  • independent samples;
  • fixed sample size of n=50;
  • a fixed statistical model, that is a t-test is applied every time.

(The full model includes other assumptions, for instance that our sample is unbiased, that we have precise measurements, that our measure of interest is informative in the context of a causal model linking data and theory (Meehl, 1997), but we will ignore these aspects here.)

In practice some of these assumptions are incorrect, making the interpretation of p-values difficult.

Data-generating process

The scores in our sample do not come from a normal distribution. Like any proportion data, they follow a beta distribution. Here is the population our data came from:

A boxplot suggests the presence of one outlier:

A Q-Q plot suggests some deviations from normality, but a Shapiro-Wilk test fails to reject:

Should we worry or rely on the reassuring call to the central limit theorem and some hand-wavy statement about the robustness of t-test and ANOVA? In practice, it is a bad idea to assume that empirical t distributions will match theoretical ones because skewness and outliers can significantly mess things up, even for relatively large sample sizes (Wilcox, 2022). In the one-sample case, ignoring skewness and outliers can lead to inflated false positives and low power (Wilcox & Rousselet, 2023; Rousselet & Wilcox, 2020).

In our case, we can simulate the t distribution under normality–Normal (sim) in the figure below, and compare it to the t distribution obtained when sampling from our beta population–Beta (sim). As a reference, the figure also shows the theoretical, non-simulated t distribution–Normal (theory). The simulation involved 100,000 iterations with n=50 samples.

Let’s zoom in to better compare the right side of the distributions:

The simulated t distribution under normality is a good approximation of the theoretical distribution. The simulated t when sampling from our beta population is not accessible to the user, because we typically don’t know exactly how the data were generated. Here we have full knowledge, so we can derive the correct t distribution for our data. Remember that the p-value from the standard t-test was 0.0670. Using our simulation under normality the p-value is 0.0675. When using the t distribution obtained by sampling from the correct beta distribution, now the p-value is 0.0804.

In most situations, p-values are calculated using inappropriate theoretical sampling distributions of t values. This might not affect the observed p-value much, but the correct p-value is unknown.

Independence

The independence assumption is violated whenever data-dependent exclusion is applied to the sample. For instance, it is very common for outliers to be identified and removed before applying a t-test or other frequentist inferential test. This is often done using a non-robust method, such as flagging observations more than 2 SD from the mean. A more robust method could also be used, such as a boxplot rule or a MAD-median rule (Wilcox & Rousselet, 2023). Whatever the method, if the outliers are identified using the sample we want to analyse, the remaining observations are no longer independent, which affects the standard error of the test. This is well documented in the case of inferences about trimmed means (Tukey & McLaughlin, 1963; Yuen, 1974; Wilcox, 2022). Trimmed means are robust estimators of central tendency that can boost statistical power in the presence of skewness and outliers. To calculate a 20% trimmed mean, we sort the data, and remove the lowest 20% and the highest 20% (so 40% of observations in total), and average the remaining observations. This introduces a dependency among the remaining observations, which is taken into account in the calculation of the standard error. In other words, removing observations in a data-dependent manner, and then using a t-test as if the new, lower, sample size was the one intended is inappropriate. To illustrate the problem, we can do a simulation in which we sample from a normal or a beta population, each time take a sample of n=50, trim 20% of observations from each end of the distribution, and either apply the incorrect t-test to the remaining n=30 observations, or apply the t formula from Tukey & McLaughlin (1963; Wilcox, 2022) to the full sample. Here are the results:

T values computed on observations left after trimming are far too large. The discrepancy depends on the amount of trimming. Elegantly, the equation of the t-test on trimmed means reverts to the standard equation if we trim 0%. Of course, the amount of trimming should be pre-registered and not chosen after seeing the data.

If we apply a t-test on means to our beta sample after trimming 20%, the (incorrect) p-value is 0.0007. The t-test on trimmed means returns p = 0.0256. That’s a large difference! Using the t distribution from the simulation in which we sampled from the correct beta population, now p = 0.0329. Also, with a t-test on means we fail to reject, whereas we do for an inference on 20% trimmed means. In general, inferences on trimmed means tend to be more powerful relative to means in the presence of skewness or outliers. However, keep in mind that means and trimmed means are not interchangeable: they ask different questions about the populations. Sample means are used to make inferences about population means, and both are non-robust measures of central tendency. Sample trimmed means are used to make inferences about population trimmed means.

Now the problem is more complicated, and somewhat intractable, if instead of trimming a pre-registered amount of data, we apply an outlier detection method. In that case, independence is violated, but correcting the standard error is difficult because the number of removed observations is a random variable: it will change between experiments.

In our sample, we detect one outlier. Removing the outlier and applying a t-test on means, pretending that our sample size was always n-1 is inappropriate, although very common in practice. The standard error could be corrected using a similar equation to that used in the trimmed mean t-test. However, in other experiments we might reject a different number of outliers. Remember that p-values are not about our experiment, they reflect what could happen in other similar experiments that we will never carry out. In our example, we can do a simulation to derive a sampling distribution that match the data generation and analysis steps. For each sample of n=50 observations from the beta distribution, we apply a boxplot rule, remove any outliers, and then compute a t value. If we simply remove the outlier from our sample, the t-test returns p = 0.0213. If instead we compute the p-value by using the simulated sampling distribution, we get p = 0.0866. That p-value reflects the correct data generating process and the fact that, in other experiments, we could have rejected a different number of outliers. Actually, in the simulation the median number of rejected outliers is zero, the 3rd quartile is 1, and the maximum is 9.

In practice, we don’t have access to the correct t sampling distribution. However, we can get a good approximation by using a percentile bootstrap that incorporates the outlier detection and rejection step after sampling with replacement from the full dataset, and before calculating the statistic of interest (Rousselet, Pernet & Wilcox, 2021).

If outliers are expected and common, a good default strategy is to make inferences about trimmed means. Another approach is to make inferences about M-estimators in conjunction with a percentile bootstrap (Wilcox, 2022). M-estimators adjust the amount of trimming based on the data, instead of removing a pre-specified amount. Yet another approach is to fit a distribution with a tail parameter that can account for outliers (Kruschke, 2013). Or it might well be that what looks like an outlier is a perfectly legitimate member of a skewed or heavy-tailed distribution: use more appropriate models that account for rich distributional differences (Rousselet, Pernet, Wilcox, 2017; Farrell & Lewandowsky, 2018; Lindeløv, 2019).

Fixed sample size?

The t-test assumes that the sample size is fixed. This seems obvious, but in practice it is not the case. As we saw in the previous example, sample sizes can depend on outlier rejection, a very common procedure that make p-values uninterpretable. In general, data-dependent analyses will mess up traditional frequentist inferences (Gelman & Loken 2014). Sample sizes can also be affected by certain inclusion criteria. For instance, data are included in the final analyses for participants who scored high enough in a control attention check. Deriving correct p-values would require simulations of sampling distributions that incorporate the inclusion check. In other situations, the sample sizes vary because of reasons outside the experimenters’ control. For instance, data are collected in an online experiment until a deadline. In that case the final sample size is a surprise revealed at the end of the experiment and is thus a random variable. Consequently, deriving a sampling distribution for a statistic of interest requires another sampling distribution of plausible sample sizes that could have been obtained. The sampling distribution, say for a t value, would be calculated by integrating over the sampling distribution of sample sizes, and any other sources of variability, such as different plausible numbers of outliers that could have been removed, even if they were not in our sample. Failure to account for these sources of variability leads to incorrect p-values. It gets even more complicated in some situations: p-values also depend on our sampling intentions.

Imagine this scenario inspired by Kruschke (2013), in which a supervisor asked two research assistants to collect data from n=8 participants in total. They misunderstood the instructions, and instead collected n=8 each, so a total of n=16. The plan was to do a one-sample t-test. What sample size should the research team use to compute the degrees of freedom: 8 or 16? So 7 df or 15 df? Here is a plot of the p-values as a function of the critical t values in the two situations.

The answer depends on the sampling distribution matching the data acquisition process, including the probability that the instructions are misunderstood (Kruschke, 2013). If we assume that a misunderstanding leading to this specific error could occur in 10% of experiments, then the matching curve is the dashed one in the figure below, obtained by mixing the two curves for n=8 and n=16.

That’s right, even though the sample size is n=16, because it was obtained by accident and we intended to collect n=8, the critical t and the p-value are obtained from a distribution that is in-between the two for n=8 and n=16, but closer to n=8. This correct distribution reflects the long-run perspective of conducting imaginary experiments in which the majority would have led to n=8. Again, the p-value is not about the current experiment. This scenario reveals that p-values depend on intentions, which has consequences in many situations. In practice, all the points raised so far demonstrate that p-values in most situations are necessarily inaccurate and very difficult to interpret.

Conditional analyses

Another common way to mess up the interpretation of our analyses is to condition one analysis on another one. For instance, it is common practice to conduct a test of data normality: reject and apply a rank-based test; fail to reject and apply a t-test. Testing for normality of the data is a bad idea for many reasons, including because it makes the subsequent statistical tests conditional on the outcome of the normality test. Again, unless we can simulate the appropriate conditional sampling distribution for our statistic, our p-value will be incorrect. Similarly, anyone tempted to use such an approach would need to justify sample sizes using a power simulation that includes the normality step, and any other step that affects the statistic sampling distribution. In my experience, all conditional steps are typically ignored in power analyses and pre-registrations. It’s not just p-values, all power analyses are wrong too.

No measurement error?

It gets worse. Often, t-tests and similar models are applied to data that have been averaged over repetitions, for instance mean accuracy or reaction times averaged over trials in each condition and participant. In this common situation, the t-test ignores measurement error, because all variability has been wiped out. Obviously, in such situations, mixed-effect (hierarchical) models should be used (DeBruine & Barr, 2021). Using a t-test instead of a mixed-effect model is equivalent to using a mixed-effect model in which the trial level data have been copied and pasted an infinite number of times, such that measurement precision becomes infinite. This is powerfully illustrated here.

Conclusion

In most articles, the p-values are wrong. How they would change using appropriate sampling distributions is hard to determine, and ultimately a futile exercise. Even if the p-values changed very little, the uncertainty makes the obsession for declaring “statistical significance” whenever p<0.05, no matter how close the p-value is to the threshold, all the more ridiculous. So the next time you read in an article that “there was a trend towards significance, p=0.06”, or some other non-sense, in addition to asking the authors if they pre-registered a threshold for a trend, and asking them to also write “a trend towards non-significance, p=0.045”, also point out that the p-value matching their design, analyses, and data generation process is likely to be different from the one reported.

What can we do? A plan of action, from easy to hard:

[1] Take a chill pill, and consider p-values as just one of many outputs, without a special status (Vasishth & Gelman, 2021). Justify your choices in the methods section, unlike the traditional article in which tests pop up out of the blue in the results section, with irrational focus on statistical significance.

[2] Use bootstrap methods to derive more appropriate sampling distributions. Bootstrap methods, combined with robust estimators, can boost statistical power and help you answer more interesting questions. These methods also let you include preprocessing steps in the analyses, unlike standard parametric methods.

[3] Pre-register everything, along with careful justifications of models, pre-processing steps, and matching power simulations.

[4] Abandon the chase for statistical significance. Instead of focusing on finding effects, focus on a model-centric approach (Devezer & Buzbas, 2023). The goal is to contrast models that capture different hypotheses or mechanisms by assessing how they explain or predict data (Farrell & Lewandowsky, 2018; Gelman, Hill & Vehtari, 2020; James et al., 2021; McElreath, 2020; Yarkoni & Westfall, 2017). What is the explanatory power of the models? What is their predictive accuracy?

Code

https://github.com/GRousselet/blog-pwrong

References

DeBruine, L., & Barr, D. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119

Devezer, B., & Buzbas, E. O. (2023). Rigorous exploration in a model-centric science via epistemic iteration. Journal of Applied Research in Memory and Cognition, 12(2), 189–194. https://doi.org/10.1037/mac0000121

Farrell, S., & Lewandowsky, S. (2018). Computational Modeling of Cognition and Behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–465. https://www.jstor.org/stable/43707868

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer US. https://doi.org/10.1007/978-1-0716-1418-1

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603. https://doi.org/10.1037/a0029146

Lindeløv, J. K. (2019). Reaction time distributions: An interactive overview. https://lindeloev.github.io/shiny-rt/

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press. https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46(2), 1738–1748. https://doi.org/10.1111/ejn.13610

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

Yuen, K. K. (1974). The Two-Sample Trimmed t for Unequal Population Variances. Biometrika, 61(1), 165–170. https://doi.org/10.2307/2334299

Pre/Post design: the fallacy of comparing difference scores

Pre/post designs are common in medicine, pre-clinical animal research and in psychology: you measure something at baseline, then randomly allocate participants to 2 or more groups, each receiving different interventions, after which you measure the same thing again. In psychology, a pre/post design could be used to look at the impact of different types of meditation techniques on some measure of well-being, or the impact of different types of leaflets on recycling practices. In brain imaging, we could consider the impact of different types of physical exercise on markers of brain activity or structure.

Screenshot

For the discussion here, assume we measure a continuous or pseudo-continuous variable, like blood pressure, or a score [0-100] from a questionnaire. Here is a classic example from Vickers & Altman (2001) looking at pain scores. I couldn’t find the original dataset so I read the values from their Figure 1 using WebPlotDigitizer. The text mentions 52 patients, but when reading values from the graph I got 56 patients, though the overall pattern is similar. The main point is to illustrate results from a standard experimental design. All figures in this post can be reproduced using the R code on GitHub. The code also contains the analyses reported in Vickers & Altman (2001). I’m skipping the details here.

It is very common for data from such pre-post designs to be analysed by looking at change scores: subtract the baseline score from the post-score and compare these differences between the two groups. Just in the last few days I saw several articles in psychology, animal neuroscience and brain imaging using this approach. Don’t do it. Instead directly compare the post-intervention scores between the two groups, including the pre-treatment scores as a covariate (Vickers & Altman, 2001; Senn, 2006; Clifton & Clifton, 2019). With this ANCOVA approach, we ask the question: post-intervention, by how much do the groups differ, after adjusting for baseline differences?

In the illustration above, this corresponds to fitting two regression lines described by this equation:

post = 44.9 + 0.6 * baseline -13.4 * group

There are several reasons why the difference score approach is problematic (Harrell, 2017). A broad issue is the linearity assumption, which is very common in many fields, particularly in psychology and neuroscience. For instance, a 50 ms reaction time difference could be huge if we are dealing with fast responses in a simple perception task, but it could be small if we are dealing with slow responses in a complex decision task. The same logic applies to participants with differences in baseline measurements: a 50 ms difference could be impressive in fast participants, but not so much in slow participants. It gets worse because of floor and ceiling effects. So, in general, difference scores are not necessarily comparable. A deeper issue is the assumption of a linear mapping between our measurements and some more abstract quantity we are trying to estimate: for instance using the BOLD signal to estimate brain activity, percent correct to estimate memory representation or reaction times to estimate processing speed. There is no reason to assume linear mappings, yet this is the norm (Wagenmakers et al. 2012; Kellen et al. 2021).

Another important reason to avoid inferences on change scores in the context of pre/post designs is purely mathematical: change scores are necessarily correlated with baseline scores (Clifton & Clifton, 2019). As a consequence, imbalance between groups at baseline can lead to spurious effects. Let’s look at an example in which the data are sampled from a bivariate normal population with a correlation of 0.6 between baseline and post scores. Importantly, the marginal means are identical, such that there is no intervention effect.

By construction, baseline and post scores are correlated. Here we get a sample correlation of 0.57; population correlation is 0.6.

No matter the correlation between pre and post scores, there is always a correlation between baseline scores and the difference scores, even though in the data we created, there is no treatment effect (Clifton & Clifton, 2019). Here the correlation is -0.52.

Why is that correlation important? This figure shows the main point of this blog post. Baseline imbalance can lead to spurious group differences in difference scores, which will tend to be more problematic when groups have relatively small sample sizes. This is a form of regression to the mean: high baseline scores tend to be associated with lower post scores, whereas low baseline scores tend to be associated with higher post scores. And the situation gets worse when baseline and post measurements differ in variance, as demonstrated in the next figure. The data generating process is the same as the one used in the previous figure, but now the baseline standard deviation is twice that of the post scores.

Fortunately, the ANCOVA naturally accounts for differences in baseline scores. The code associated with this post also demonstrates that the Bland-Altman plot removes the trend, but only if pre-post variances are equal.

Of course it is more complicated than that, because the standard ANCOVA makes very strong assumptions: the trends in the two groups are linear and both groups share the same slope. Fortunately, we can relax the assumption of equal slopes by including an interaction between the slope and the group (Wan, 2020). We can also relax the linearity assumption by using non-parametric models that fit smooth curves to the data (Mair & Wilcox, 2019; James et al. 2021).

But what about testing for baseline imbalance? Can we not do a t-test on baseline scores to demonstrate that they are comparable between groups? And if the t-test is significant, we could apply the ANCOVA? This approach is inappropriate and misguided for several reasons (Sassenhagen & Alday, 2016; Vanhove, 2014). First, such a test is superfluous because we already know the answer: participants were randomly assigned to the two groups; any group differences are due to random sampling. Second, inferential statistics, like t-tests, are not about the sample at hand, they are about the populations we sampled from. Third, claiming that two populations do not differ based on P>0.05 is a statistical fallacy. In the frequentist realm, demonstrating population equivalence requires equivalence testing (Campbell & Gustafson, 2024; Riesthuis, 2024). Fourth, using a t-test to decide on how to analyse the data leads to conditional sampling distributions that would need to be simulated to derive P values–good luck with that (Rousselet, 2025; Vanhove, 2014). Fifth, using an ANCOVA by default is more powerful than doing an ANCOVA conditional on a baseline check (Vanhove, 2014).

Finally, the most important message in this post is not about using any specific model: instead, whatever model you use, please provide a clear justification in your method section.

References

Campbell, H., & Gustafson, P. (2024). The Bayes factor, HDI-ROPE, and frequentist equivalence tests can all be reverse engineered—Almost exactly—From one another: Reply to Linde et al. (2021). Psychological Methods, 29(3), 613–623. https://doi.org/10.1037/met0000507

Clifton, L., & Clifton, D. A. (2019). The correlation between baseline score and post-intervention score, and its implications for statistical analysis. Trials, 20(1), 43. https://doi.org/10.1186/s13063-018-3108-3

Harrell, F. (2017, April 8). Statistical Errors in the Medical Literature. Statistical Thinking. https://www.fharrell.com/post/errmed/#change

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021) An Introduction to Statistical Learning: With Applications in R, Springer Texts in Statistics. Springer US, New York, NY.

Kellen, D., Davis-Stober, C. P., Dunn, J. C., & Kalish, M. L. (2021). The Problem of Coordination and the Pursuit of Structural Constraints in Psychology. Perspectives on Psychological Science, 16(4), 767-778. https://doi.org/10.1177/1745691620974771

Mair, P., & Wilcox, R. (2019). Robust statistical methods in R using the WRS2 package. Behavior Research Methods. https://doi.org/10.3758/s13428-019-01246-w

Riesthuis, P. (2024). Simulation-Based Power Analyses for the Smallest Effect Size of Interest: A Confidence-Interval Approach for Minimum-Effect and Equivalence Testing. Advances in Methods and Practices in Psychological Science, 7(2), 25152459241240722. https://doi.org/10.1177/25152459241240722

Rousselet, G. A. (2025). Using Simulations to Explore Sampling Distributions: An Antidote to Hasty and Extravagant Inferences. eNeuro, 12(10). https://doi.org/10.1523/ENEURO.0339-25.2025

Sassenhagen, J., & Alday, P. M. (2016). A common misapplication of statistical inference: Nuisance control with null-hypothesis significance tests. Brain and Language, 162, 42–45. https://doi.org/10.1016/j.bandl.2016.08.001

Senn, S. (2006). Change from baseline and analysis of covariance revisited. Statistics in Medicine, 25(24), 4334–4344. https://doi.org/10.1002/sim.2682

Vanhove, J. (2014, September 26). Silly significance tests: Balance tests – Jan Vanhove:: Blog. https://janhove.github.io/posts/2014-09-26-balance-tests/

Vickers, A. J., & Altman, D. G. (2001). Analysing controlled trials with baseline and follow up measurements. BMJ, 323(7321), 1123–1124. https://doi.org/10.1136/bmj.323.7321.1123

Wagenmakers EJ, Krypotos AM, Criss AH, Iverson G. On the interpretation of removable interactions: a survey of the field 33 years after Loftus. Mem Cognit. 2012 Feb;40(2):145-60. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3267935/

Wan, F. (2020). Analyzing pre-post designs using the analysis of covariance models with and without the interaction term in a heterogeneous study population. Statistical Methods in Medical Research, 29(1), 189–204. https://doi.org/10.1177/0962280219827971

Tukey mean-difference plot

Distributions can differ in many ways, not just in central tendency (Rousselet, Pernet & Wilcox, 2017). While obvious, this statement is at odd with common statistical practices that are focused exclusively on mean comparisons. Fortunately, there are many methods to make distributional inferences, for instance by applying generalized linear models to reaction time data (Lindeløv, 2019). Another informative approach is to make inferences about multiple quantiles, essentially an extension of q-q plots (quantile-quantile plots, Wilk & Gnanadesikan, 1968). Quantile inference methods appear to be have been reinvented multiple times (Rousselet, 2018). In this earlier post, I illustrated four very similar types of plots: quantile plots, vincentile plots, delta plots and shift functions. As far as I know, the earliest method is the shift function, introduced by Doksum (1974; 1976). Recently, while reading about Bland-Altman plots (Altman & Bland, 1983), I came across a 5th type of quantile plots, the Tukey mean-difference plot (Cleveland, 1993).

Side story: it is unclear who the Tukey in question is, as Cleveland (1993) offers no reference for the plot. A very similar plot appears in an earlier book by Chambers, Cleveland, Kleiner & Tukey, 1983. The Tukey is the 1983 book is Paul A. Tukey, not the famous John W. Tukey. Others have been down the same historical rabbit hole.

Surprisingly, several sources describe the Bland-Altman plot and the Tukey mean-difference plot as equivalent (Wikipedia; Wolfram). Equating the two methods is surprising because, even though they lead to graphs that can look similar, they ask very different questions about the data. The Bland-Altman plot is a kind of scatterplot used to assess the agreement between two methods (and more generally two paired measurements): the differences between paired observations are plotted as a function of their averages. In contrast, the Tukey mean-difference plot is a type of q-q plot used to assess distributional shape differences: the quantile differences are plotted as a function of the quantile averages. The general difference between a scatterplot and a q-q plot is vividly described in this quote from Chambers, Cleveland, Kleiner & Tukey, 1983:

“It is essential to understand the difference between a quantile-quantile plot and a scatter plot. Basically, a scatter plot is useful for studying such questions as “Is the monthly average temperature in Lincoln systematically related to the temperature in Newark?” or “If Newark has a hot month, is Lincoln likely to have hot weather in the same month?” On the other hand, the quantile-quantile plot is aimed at such questions as “Over a period of time, do the residents of Lincoln experience the same mixture of hot, mild, and cold weather as people living in Newark?” This question would be meaningful even if the two data sets spanned different years, or if we were comparing autumn temperatures in Newark with spring temperatures in Lincoln, or if Newark were in another galaxy. It is the kind of question that a home owner in Lincoln and one in Newark might be interested in if they were concerned about the cost of heating in the winter and air conditioning in the summer at the two places but had no interest in whether they experienced hot and cold spells at exactly the same time.”

In short, a scatterplot is about the relation between paired observations, whereas a q-q plot is about relative shape differences between two distributions of observations. A scatterplot can only be used when dealing with paired observations; it is irrelevant for a q-q plot.

To better understand the two types of graphs, let’s consider a series of illustrations. The next four series of plots reflect situations where the Bland-Altman plot is a useful tool: we try to estimate some ground truth using two types of instruments, and then consider how well the two sets of measurements agree.

Dependent observations

Independent measurement errors, no bias

In this first example, we get two sets of n=200 measurements that are equal to a ground truth + random errors that are independent between the sets. Also, there is no bias. In this situation, the two measurement methods agree with each other. Any pattern is due to sampling variability. Increase the sample size in the code to see what happens.

For the Bland-Altman plot, a superimposed smoother is useful to reveal trends in the scatterplot.
In this example and the following one, the two plots reveal similar trends in differences. We can also see a difference in scale: the Bland-Altman plot shows the raw differences between pairs of observations, whereas the Tukey mean-difference plot shows differences between matching quantiles, or order statistics. In the case of equal sample sizes, these quantiles are simply the sorted observations, such that the smallest observation in one group is compared to the smallest observation in the other group, and so on in rank order.

Correlated measurement errors, no bias

In this situation, the error terms have a 0.5 correlation.

Additive bias

Now in addition to the correlated noise, there is a 0.1 additive bias applied to Y.

Additive and multiplicative bias

Same as above, now with the addition of multiplicative bias.

Independent observations

Now we consider completely independent observations to illustrate that the Tukey mean-difference plot captures shape differences, like other related quantile methods. In this scenario, the Bland-Altman plot makes no sense.

Illustrate populations

Consider pairs of marginal Beta distributions. In each panel, the reference condition is X = Beta(10,10), which is compared to Y = Beta(10,10); Y = Beta(3,3); Y = Beta(8,6); Y = Beta(4,2). Why Beta distributions? That’s the way percent correct data are distributed across participants for instance (Rousselet, 2025).

Simulate data: vary shape

Now imagine we take independent samples from these populations. In that situation the Bland-Altman plot is inappropriate, because any pairing of observations would be completely arbitrary. To make it more interesting, we can also use different sample sizes: n1=100 and n2=200.

Downsample to the deciles

Since we’re plotting quantiles, we don’t need to plot so many. We can make the same graphs using the deciles only.

Downsample to the quartiles

Or even just the quartiles!

And to get confidence intervals in these plots, the percentile bootstrap would work very well, especially when combined with the Harrell-Davis quantile estimator (Rousselet, Pernet & Wilcox, 2017; Wilcox & Rousselet, 2024).

In conclusion, the Tukey mean-difference plot is another great example of a quantile graphical method, and although in some situations it can reveal similar trends as the Bland-Altman plot, the two approaches are certainly not the same.

The R code is on GitHub.

Improving statistical reporting in psychology

Schubert et al. (2025) describe important steps to improve statistical reporting in psychology. I strongly encourage all psychologists (and neuroscientists) to read this goldmine of an article. There is no doubt that implementing their suggestions would improve the typical psychology article. However useful, some of the suggested steps are insufficient, and some of the proposed examples of good practice are at least missed opportunities to educate the community. Here are the main examples of good practice presented in Table 1 and some of the issues with these statements:

Hypothesis and design

“We hypothesized that participants in a high-load visual working memory
condition would have lower recall accuracy (and longer RTs) than those in a
low-load condition.”

Although more specific than just writing that the conditions will differ, it is unclear how that statement translates into a testable hypothesis. The most common but rarely justified approach is to compare group means. However that approach makes a strong assumption about the symmetry of the distributions or stochastic dominance or both. The mean is also not robust and asks a very specific question about the data, often not the most interesting one. And what about individual participants? Is a significant group difference in means necessary and sufficient to support the theory? What if a large proportion of participants do not show the group effect?

Sample size justification

“Our a priori power analysis (80% power, α=0.05) for detecting a medium
effect size (d=0.50) required 52 participants. We oversampled to 60 in
anticipation of attrition.”

Obviously what is missing here is the statistical test for which the sample size is estimated. This is clearly emphasised in the main text of Schubert et al. (2025), a point that should be prominantly featured in Table 1 too. In my experience, most articles that contain some sort of sample size justification omit the statistical test. Actually, statistical tests are often completely absent from the method section. Or sometimes one test is mentioned, but the results report other ones, often including a more complex ANOVA, for which the sample size will be insufficient. Schubert et al. (2025) address this critical point by suggesting “to base the sample size on the test that requires the largest sample to ensure adequate power across all analyses”. Typically that would be the most complex interaction. But in doing so, one must consider the shape of the interaction too:

Sommet N, Weissman DL, Cheutin N, Elliot AJ. How Many Participants Do I Need to Test an Interaction? Conducting an Appropriate Power Analysis and Achieving Sufficient Power to Detect an Interaction. Advances in Methods and Practices in Psychological Science. 2023;6(3).

Schubert et al. (2025) do a great job at flagging other common issues with sample size justifications, including choosing a realistic expected effect size and dealing with multiple statistical tests. However, the notion that “sample size planning is technically straightforward for simple statistical modeling approaches” is misguided. Standard tests and matching power calculators all make assumptions that are necessarily violated by any data we encounter in psychology: all quantities we measure are bounded or skewed, usually both. In addition, ANOVAs are typically applied to data after aggregation across trials, which makes the common yet insane assumption of infinite measurement precision (aka no measurement error at the participant level). And the number of trials is usually not considered in the sample size justification. There is the added difficulty of analysis dependencies. For instance, some form of outlier removal is applied, which affects degrees of freedom, but this is not considered in the analyses or the sample size calculations–removing outliers and applying a t-test as if nothing had happened is inappropriate. For all these reasons, power analyses from standard calculators are wrong. The same goes with p values. For instance, in the example above, in which a sample size of 52 participants was estimated, but 60 participants were sampled, the p values will be affected by that extra over-sampling step. However, the extra step won’t be considered in the analyses, leading to inaccurate p values.

Outlier and missing data

“In line with our preregistered plan, reaction times exceeding 3 SDs from
the condition mean were treated as outliers and removed from subsequent
analyses, affecting 2.5% of trials. Five participants who withdrew mid-
study were excluded from final analyses (final N=55).”

Because SD and the mean are not robust, this rejection rule is not robust: as outliers get larger, they are less likely to be detected. Also, the rule implies a symmetric distribution, which is unlikely. If participants were removed, authors should also explain how they calculated the p values conditional on this outlier removal procedure; the same goes for the power analysis. Good luck calculating a p value that properly reflects all analysis and pre-processing conditional steps:

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804

In practice, every p value you have ever calculated or read is wrong because processing dependencies are ignored.

Schubert et al. (2025) make two excellent points related to the statement above: one, what looks like an outlier might be a legitimate member of a skewed or heavy tailed distribution; two, an alternative to detecting outliers is to use robust methods that can handle them by default, such as making inferences about trimmed means. What Schubert et al. (2025) fail to mention, as mentioned in the previous section, is that removing outliers and applying standard methods to the remaining data is inappropriate because that procedure affects the standard error (their reference 68 covers that important point). Of course, another option is to use an appropriate mixed-effect model, for instance involving shifted lognormal distributions for reaction time data.

Statistical model specification

“A 2 × 2 repeated-measures ANOVA (Condition: high-load vs. low-load;
Time: pre vs. post) on recall accuracy was conducted. The assumption of
normality was examined and met.”

To be complete, that statement should specify that the inference was on means (as opposed to trimmed means, medians or other quantiles). More crucially, ANOVAs on accuracy data are inappropriate. And testing normality assumptions is not a thing: accuracy data are never normally distributed; if you claim normality (p>0.05), you commit a classic statistical fallacy; the list goes on…

Schubert et al. (2025) make this comment in the main text: “If assumptions are violated, researchers should describe how these were addressed—for example, by […] using a nonparametric alternative”. This is a very common strategy, but unless we make strong assumptions about the populations, non-parametric (typically ranked-based) statistics and their parametric counterparts do not test the same hypotheses. Similarly, making inferences about means and trimmed means ask different questions about the data.

Inferential statistics

“A signicant Condition × Time interaction emerged for recall accuracy,
F(1, 54)= 4.37, p= 0.04, partial ω2=0.06 (95% CI [0.00, 0.23]). Post hoc
analyses revealed that accuracy decreased signicantly from pre- to post-
test in the high-load condition (M difference=−0.15, SE= 0.04, p < 0.001,
d= 0.68 (95% CI [0.39, 0.97])), while accuracy did not signicantly differ in
the low-load condition (M difference=−0.02, SE= 0.04, p= 0.630,
d= 0.09 (95% CI [−0.18, 0.36])). This pattern indicates that memory load
impaired performance over time, with a medium-to-large effect size for the
decline in the high-load condition.”

This statement contains several issues, most of which are actually addressed in the article: the claim about the lack of effect from a large p value is a statistical fallacy–use a test of equivalence to support the claim of a negligeable effect; the categorisation of the effect size using a one-size fits all trichotomy is outdated and must be removed.

Other issues are not addressed. Implicitly, the statement suggests that so-called “post-hoc” tests are carried out after finding a significant interaction. In practice, if you have specific expectations, it is perfectly legitimate to pre-register any contrasts of interest you want, without running an omnibus ANOVA. More importantly, graphical representations and extra analyses should be used to support the interpretation of an interaction, to determine if it is removable. Interactions are typically reported assuming implicitly that the measure (here accuracy) maps linearly onto the construct of interest, without addressing the problem of coordination.

Null Results

“We observed no signicant effect of load on RT, F(1, 87)= 0.01, p= 0.92,
partial ω2=0.00 (95% CI [0.00, 0.01]). We further ran an equivalence test
(TOST) using ±0.20 as our smallest effect size of interest, and the 90% CI
for d was fully contained within these bounds, suggesting the effect of load
on RT is practically negligible. Both the lower-bound test, t(87)= 2.08, one-
sided p= 0.020, and the upper-bound test, t(87)=−1.67, one-sided
p=0.049, were statistically signicant.”

This is good practice, although the effect size of .2 would need to be justified. Also, there are more powerful tests of equivalence than TOST. And if using TOST, or the equivalent confidence interval approach, keep in mind that standard methods are not robust.

In the article, Schubert et al. (2025) go on to tackle common reporting errors and misconceptions. Most of the points are excellent, except when they suggest using the Cocor R package to compare correlations, in the context of the classic interaction fallacy. This package is not recommended because it uses parametric methods that assume bivariate normality. Use the percentile bootstrap instead.

Schubert et al. (2025) flag common misinterpretations of confidence intervals and p values. These errors cannot be flagged often enough. However, the description of Bayesian credible intervals as having the probabilistic interpretation often assigns to confidence intervals is misguided: even a credible interval is conditional on the data; there is no magic in the Bayesian world.

Confidence interval definition: wrong again!

Kaiser & Herzog (2025) offer a very useful tutorial on generating distribution-free prediction intervals using cross-validation methods. However, in covering this important topic, they get the definition of a confidence interval wrong. This is all the more annoying because their article appears in AMPPS, an influential methods and stats journal in psychology.

Here are the problematic statements:

[1] “For estimating a population parameter, such as the mean, the sample estimate is often given a confidence interval (CI). Following a probabilistic interpretation of CIs, it can be expected with a certain probability that the population parameter lies within this interval”

[2] “A 95% CI, on the other hand, has a different interpretation: It would indicate the range in which the average job performance for individuals with an IQ of 120 and an integrity score of 40 is likely to fall.”

These statements reflect a common misinterpretation of confidence intervals (Greenland et al. 2016; Hoekstra et al. 2014): the coverage of say 95% does not apply to the interval obtained in that one experiment. This is easier to grasp with an illustration:

The figure shows the outcome of 20 experiments, along the y-axis, each sampling from the same population (a standard normal distribution). Along the x-axis, small dots show individual observations in each experiment. The black disk is the sample mean, and the horizontal line marks the bounds of the confidence interval. The vertical dashed line marks the population mean (zero). Because of sampling variability, the sample means and the associated confidence intervals vary across experiments. Confidence intervals in black include the population mean, those in grey exclude it. For a given experiment, there is no probability associated with the confidence interval: it contains the population value or not, so the outcome is one or zero, nothing in between. The coverage, that is the probability to include the population value, only applies to an infinite number of imaginary experiments carried out in the same way as our experiment. So the coverage is a long run property of a recipe: every time we collect data in a certain way, and then calculate a confidence interval, which contains the population value or not. This last part is obviously unknown in practice, unless we carry out simulations in which we control the population value. Similarly, an experiment doesn’t have statistical power: it detects an effect or not. Power is the long run property of a programme or area of research, considering an infinite number of imaginary experiments we will never actually carry out.

And of course, the coverage of the confidence interval is at the nominal level only under certain circumstances, and the same goes for prediction intervals. In practice, a 95% confidence interval is unlikely to have 95% coverage.

Following Greenland (2019), it is more intuitive to describe confidence intervals as compatibility intervals: such intervals suggest population values highly compatible with the data, given our model. For more on this, there is a very useful discussion about appropriate reporting of frequentist statistics here.

A warning about data-driven simulations

Simulations are essential to plan experiments and to learn about our data and our statistical methods (Rousselet, 2025). However, here I’d like to provide a quick word of caution about data-driven simulations. In this type of simulations, we treat a large sample as a population, from which we resample to create simulated experimental samples–you can see a detailed example of that approach for instance in Rousselet & Wilcox (2020). Using large datasets is great because they contain rich distributional information that we might over-simplify with synthetic data. There is an important limitation with this approach though. As described in Burns et al. (2025), data-driven simulations are affected by the relative size difference between the population and the sample. As sample sizes get closer to the population size, power estimation bias increases, and the sign of the bias depends on the effect size in the population. To better understand this phenomenon, let’s look at some sampling distributions. In Burns et al. (2025), we considered correlation data, but the problem is more general. So here we’ll consider reaction time data from a lexical decision task (Word / Non-Word discrimination task; Ferrand et al. 2010), which have been presented in detail in previous posts:

We start by illustrating the sampling distributions for different combinations of participant population sizes and sample sizes. For each participant, I calculated the 20% trimmed means for the two conditions, and saved the difference between the Non-Word and Word conditions. The full-size population was then defined as the one-sample distribution of 20% trimmed mean differences for all 959 participants. In each simulation iteration, populations of sizes 50, 100, …, 250 were created by sampling without replacement from the full sample size. Then, for each simulated population size, experiments were simulated by sampling 20, 50 or 100 participants with replacement. It might seem strange to sample with replacement 100 participants from a population of 50, but I’ve seen that type of over-sampling in the wild, and it is worth checking in case one has only access to a small dataset. As we will see shortly, it is a bad idea. For each iteration, sample size and population size, we calculate the group 20% trimmed mean. Here are the results:

Sampling distributions of the group 20% trimmed mean differences between Non-Word and Word conditions, as a function of population size. Smaller populations were simulated by sampling without replacement from the full dataset of 959 differences estimated using the 20% trimmed mean. For each simulated smaller population, varying numbers of participants were sampled without replacement. The vertical lines indicate the group difference for the full-size population.


Sampling distributions of the group 20% trimmed mean differences between Non-Word and Word conditions, as a function of population size. Smaller populations were simulated by sampling without replacement from the full dataset of 959 differences estimated using the 20% trimmed mean. For each simulated smaller population, varying numbers of participants were sampled without replacement. The vertical lines indicate the group difference for the full-size population.

The code to reproduce the figures is on GitHub.

As expected (and it is always worth checking), the spread of the sampling distributions varies inversely with the sample size, here the number of participants. The important point here is that for a fixed sample size, we get broader sampling distributions for smaller populations, and the problem is worse if our samples are large relative to the population size. In the figure above, we see a larger difference between population sizes 50 and 250 when taking samples of 100 participants rather than 20 participants. This phenomenon is due to the presence, in some populations, of an over-representation of extreme values, which are themselves more likely to be picked up when sampling with replacement in a simulation. As a result, we get exaggerated tails, with important consequences for power analyses (Burns et al., 2025).

Statistical power was estimated using a simulation with 20,000 iterations and the same procedure described to derive the sampling distributions. A one-sample t-test for 20% trimmed means was used (Tukey & McLaughlin, 1963; Wilcox, 2022), with a null value of 60 ms and the usual arbitrary alpha value of 0.05. By plotting power as a function of population size, separately for each sample size, we immediately see the massive impact of sample size, here the number of participants.

Power simulation showing results as a function of the number of participants and population size. The inference was on the population 20% trimmed mean difference, using a two-sided one-sample t-test equivalent, with a null hypothesis of 60 ms. All the stimulations are based on 20,000 iterations.


But also notice an unexpected pattern: for each sample size, the population size has different effects. It is easier to see what is going on by focusing on the extremes: for the smallest sample size (n=10 participants), increasing the population size lowers power. In other words, for this large reaction time effect, conducting a data-driven simulation using a small sample from a small population will tend to over-estimate statistical power. We get the opposite effect when we consider a larger sample size (n=120), as now a smaller population size leads to power under-estimation.

To put these results in perspective, let’s consider power as a function of the number of participants, plotted separately for each population size.

Same results as in the previous figure, with number of participants along the x-axis. The dashed horizontal line marks the target 83% power. Why 83% power? It is a prime number, as good a justification as any (McElreath, 2020).


The number of participants needed to reach 83% power when the population size is 100 is 93. However, the same power estimation when sampling from a larger population of size 300 suggests that we could reach the same power level with only 81 participants. That’s a 12 participant difference! Using the full dataset of 959 participants, the required number of participants is 78. So when assessing the results of data-driven simulations, we need to carefully consider if the data-set we use is large enough for our purpose (Burns et al., 2025).

References

Burns, C. D. G., Fracasso, A., & Rousselet, G. A. (2025). Bias in data-driven replicability analysis of univariate brain-wide association studies. Scientific Reports, 15(1), 6105. https://doi.org/10.1038/s41598-025-89257-w

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., Augustinova, M., & Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. https://doi.org/10.3758/BRM.42.2.488

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Rousselet, G. (2025). Using simulations to explore sampling distributions: An antidote to hasty and extravagant inferences. OSF. https://doi.org/10.31219/osf.io/f5q7r_v2

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

The Declaration of Helsinki mandates pre-registration

Several times a month, I edit or review an article that contains two contradictory statements:

[1] The experiment was not pre-registered.

[2] The experiment complies with the Declaration of Helsinki.

Often, the two sentences are back-to-back. The problem is that researchers who make these statements haven’t read the DoH, because otherwise they would have noticed article 35, which mandates pre-registration:

Medical research involving human participants must be registered in a publicly accessible database before recruitment of the first participant.

When I point out the contradiction, researchers either remove the statement about the DoH, or point to an earlier version that doesn’t mention public registration. How convenient!

The same researchers then go on to report an avalanche of p values, making unwarranted claims about the presence or absence of effects. They would make us believe that they carefully thought about and planned 20-40+ inferential tests. And they agonise about these p values ever so close to some arbitrary magic threshold. Yet they forget that without pre-registration, p values are essentially not interpretable. More on this topic in these classic references:

Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.

Kruschke, John K. ‘Bayesian Estimation Supersedes the t Test’. Journal of Experimental Psychology. General 142, no. 2 (May 2013): 573–603. https://doi.org/10.1037/a0029146.

In sum, most research in humans is unethical, at least according to the DoH, and researchers agonise over p values that cannot be properly interpreted because of lack of pre-registration, and other issues…

UPDATE: article 21, in the latest (2024) version of the declaration, also mentions rigour and avoiding research waste. More on this topic in this video from Darren Dahly: https://statsepi.substack.com/p/the-declaration-of-helsinki-now-says

Using cluster-based permutation tests to estimate MEG/EEG onsets: How bad is it?

Guillaume A. Rousselet

The answer to this important question is in this article at the European Journal of Neuroscience. It is an extended version of work presented in a blog post published over a year ago. Although the focus is on MEG/EEG timing estimation, the conclusions and the main points about statistical issues apply broadly to brain imaging and many other scientific endeavours. As the article is open access and relatively short, I won’t repeat the content, but there is a good snapshot in the graphical abstract:

I also made an animated version of Figure 1 using gganimate. The conclusion is to avoid FDR correction if the goal is to estimate onsets. More on this topic here. Figure 1a:

Figure 1b:

I also made an animated illustration of the change point algorithm (here limited to differences in means), using code from Rebecca Killick and gganimate:

Finally, a shout out to Benedikt Ehinger, who reproduced some of the results in the article as part of the reviewing process, using his own Julia implementation. Because I had a preprint of the article on bioRxiv, he posted the replication on his blog. One of the many benefits of open science…

Simulation-Based Power Analyses for the Smallest Effect Size of Interest: A Confidence-Interval Approach for Minimum-Effect and Equivalence Testing

Riesthuis (2024) is a very useful article that all psychologists and neuroscientists should read if they are not familiar with the very important notion of non-zero and interval hypotheses. Riesthuis (2024) provides an excellent summary of the many benefits of working with a minimum effect size of interest (MESOI; like others, Riesthuis calls it a “smallest effect size”, but “minimum” is a better term). A MESOI is required to go beyond null-hypothesis significance testing (NHST) and perform tests of inferiority, superiority and equivalence. A MESOI should be defined before the study is conducted, and used in power analyses, which will lead to higher sample sizes–the price to pay to ask more specific questions.

To help understand what we gain by using a MESOI, it is useful to remember some of the classic issues associated with NHST:

  • it cannot be used to support the null (doing so is circular and a statistical fallacy);
  • it is biased against the null, with p values near 0.05 associated with weak support against the null (Wetzels et al. 2011);
  • it is inconsistent (Rouder et al. 2016), meaning that the proportion of false positives (alpha) is constant with sample size–so when there is no effect, we keep rejecting at our alpha level even if we have very precise measurements near zero, afforded by very large sample sizes;
  • because of its inconsistency, as we increase our sample size, we will reject for trivially smaller and smaller effect sizes (Wagenmakers, 2007; Meehl, 1997);
  • as another consequence of inconsistency, a theory might never die, because alpha % of experiments will inevitably find significant effects to support it–more than alpha in practice when using non-robust statistics.

There are several ways to achieve consistency (Rouder et al. 2016), one of which is equivalence testing. If you are skeptical about equivalence testing, it helps to realise that it is equivalent to the Bayes factor, or the ROPE + HDI procedure (Campbell & Gustafson, 2024), though with the usual drawbacks of frequentist inferences (Wagenmakers, 2007; Kruschke, 2013). Equivalence testing simply involves interval hypotheses, instead of the ritualistic point-null: define an interval around zero going from – MESOI to + MESOI, if a 90% compatibility interval (confidence is such a bad term) is not contained in the interval, reject the null; if the CI is contained inside the +/- MESOI interval, declare the results compatible with the null; otherwise remain undecided. Interval hypotheses can also be used to calculate second-generation p values that alleviate many issues with standard p values (Blume et al. 2019).

Riesthuis (2024) provides examples of definitions of MESOI from the literature, and acknowledges that the task can be difficult. However, this is not an excuse to use NHST by default, or to rely on standard, yet often meaningless benchmarks (Götz, Gosling & Rentfrow, 2024). Defining a MESOI requires a cost-benefit analysis, involving an appraisal of the practical / clinical / theoretical significance of effects. Abstract, context-free effect sizes plucked from a hat won’t do. But if “researchers rely on benchmarks because no other information is available, it may be more of an indication that the research is simply not ready yet to test hypotheses and that more exploratory research is necessary (see Scheel et al., 2021).” (Riesthuis, 2024).

In addition to a nice summary of the benefits of using a MESOI, Riesthuis (2024) provides examples and codes to perform simulations using confidence intervals. The main take-home message is the need for larger sample sizes to conduct tests of equivalence or tests of superiority (minimum-effect testing). This is a point well worth reminding researchers, but hardly surprising: you need more evidence to ask more specific questions; or put another way, you need less evidence if you ask a vague question about a point null hypothesis, but you can only expect to get a vague answer (Rouder et al. 2016).

My only negative comments about the simulations is that they assume normality and inferences on means. This is a missed opportunity to educate researchers about robust statistics. But simulations incorporating robust estimators and alternative methods to build confidence intervals could easily be extended to consider a MESOI (Rousselet, Pernet & Wilcox, 2023).

Finally, figure 1 is misleading, because it suggests wrongly that confidence intervals contain distributional information, meaning that values near the middle of the confidence interval are more probable than values near the ends. However, this is incorrect as explained in Kruschke (2013).

References

Blume, J. D., Greevy, R. A., Welty, V. F., Smith, J. R., & Dupont, W. D. (2019). An Introduction to Second-Generation p-Values. The American Statistician, 73(sup1), 157–167.

Campbell, H., & Gustafson, P. (2024). The Bayes factor, HDI-ROPE, and frequentist equivalence tests can all be reverse engineered-Almost exactly-From one another: Reply to Linde et al. (2021). Psychological Methods. https://doi.org/10.1037/met0000507

Götz, F.M., Gosling, S.D. & Rentfrow, P.J. Effect sizes and what to make of them. Nat Hum Behav 8, 798–800 (2024). https://doi.org/10.1038/s41562-024-01858-z

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573-603.

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press.

Riesthuis, P. (2024). Simulation-Based Power Analyses for the Smallest Effect Size of Interest: A Confidence-Interval Approach for Minimum-Effect and Equivalence Testing. Advances in Methods and Practices in Psychological Science, 7(2).

Rouder, J. N., Morey, R. D., Verhagen, J., Province, J. M., & Wagenmakers, E.-J. (2016). Is There a Free Lunch in Inference? Topics in Cognitive Science, 8(3), 520–547.

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2023). An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations. Meta-Psychology, 7. https://doi.org/10.15626/MP.2019.2058

Scheel, A. M., Tiokhin, L., Isager, P. M., & Lakens, D. (2021). Why Hypothesis Testers Should Spend Less Time Testing Hypotheses. Perspectives on Psychological Science, 16(4), 744–755.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804.

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on Psychological Science, 6(3), 291-298.

You get the symptoms of a replication crisis even when there isn’t one: considering power

Many methods have been proposed to assess the success of a replication (Costigan et al. 2024; Cumming & Maillardet 2006; Errington et al. 2021; LeBel et al. 2019; Ly et al. 2019; Mathur & VanderWeele 2020; Muradchanian et al., 2021; Patil et al. 2016; Spence & Stanley 2024; Verhagen & Wagenmakers 2014). The most common method, also used to determine if results from similar experimental designs are consistent across studies, is consistency in statistical significance: do the two studies report a p value less than some (usually) arbitrary threshold? This approach can be misleading for many reasons, for instance when two studies report the same group difference, but different confidence intervals: one including the null, the other one excluding it. Even though the group differences are the same, sampling error combined with statistical significance would lead us to conclude that the two studies disagree. There is a very nice illustration of the issue in Figure 1 of Amrhein, Greenland & McShane (2019).

More generally:

“if the alternative is correct and the actual power of two studies is 80%, the chance that the studies will both show P ≤ 0.05 will at best be only 0.80(0.80) = 64%; furthermore, the chance that one study shows P ≤ 0.05 and the other does not (and thus will be misinterpreted as showing conflicting results) is 2(0.80)0.20 = 32% or about 1 chance in 3.”
Greenland et al. 2016 (see also Amrhein, Trafimow & Greenland 2019)

So, in the long run, even if two studies always sample from the same population (even assuming all unmeasured sources of variability are the same across labs; Gelman et al. 2023), the literature would look like there is a replication crisis when none exists.

Let’s expand the single values from the example by Greenland et al. (2016) and plot the probability of finding consistent and inconsistent results as a function of power:

When deciding about consistency between experiments using the statistical significance criterion, the probability to reach the correct decision depends on power, and unless power is very high, we will often be wrong.

In the previous figure, why consider power as low as 5%? If that seems unrealistic, a search for n=3 or n=4 in Nature and Science magazines will reveal recent experiments carried out with very small sample sizes in the biological sciences. Also, in psychology, interactions require much larger sample sizes than typically used, for instance when comparing correlation coefficients (Rousselet, Pernet & Wilcox, 2023). So very low power is still a real concern.

In practice, the situation is probably worse, because power analyses are typically performed assuming parametric assumptions are met; so the real power of a line of research will be lower than expected — see simulations in Rousselet & Wilcox (2020); Wilcox & Rousselet (2023); Rousselet, Pernet & Wilcox (2023).

To provide an illustration of the effect of skewness on power, and in turn, on replication success based on statistical significance, let’s use g-and-h distributions — see details in Rousselet & Wilcox (2020) and Yan & Genton (2019). Here we consider h=0 and vary g from 0 (normal distribution) to 1 (shifted lognormal distribution):

Now, let’s do a simulation in which we vary g, take samples of n=20, and perform a one-sample t-test on means, 10% trimmed means and 20% trimmed means. The code is on GitHub. To assess power, a constant is added to each sample, assuming a power of 80% when sampling from a standard normal population (g=h=0). Alpha is set to the arbitrary value of 0.05. The simulation includes 100,000 iterations.

Here are the results for false positives, showing a non-linear increase as a function g, with the one-sample t-test much more affected when using means than trimmed means:

And the true positive results, showing lower power for trimmed means under normality, but much more resilience to increasing skewness than the mean.

These results are well known (see for instance Rousselet & Wilcox, 2020).
Now the novelty is to consider in turn the impact on the probability of a positive outcome in both experiments.

If we assume normality and determine our sample size to achieve 80% power in the long run, skewness can considerably lower the probability of observing two studies both showing p<0.05 if we employ a one-sample t-test on means. Trimmed means are much less affected by skewness. Other robust methods will perform even better (Wilcox & Rousselet, 2023).

In the same setting, here is the probability of a positive outcome in one experiment and a negative outcome in the other one:

Let’s consider h = 0.1, so that outliers are more likely than in the previous simulation:

In the presence of outliers, false positives increase even more with g for the mean:

And power is overall reduced for all methods:

This reduction in power leads to even lower probability of consistent results than in the previous simulation:

And here are the results on the probability of observing inconsistent results:

So in the presence of skewness and outliers, the situation is overall even worse than suggested by Greenland et al. (2016). For this and other reasons, consistency in statistical significance should not be used to infer the success of a replication.

References

Amrhein, V., Greenland, S., & McShane, B. (2019). Scientists rise up against statistical significance. Nature, 567(7748), 305. https://doi.org/10.1038/d41586-019-00857-9

Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. The American Statistician, 73(sup1), 262–270. https://doi.org/10.1080/00031305.2018.1543137

Costigan, S., Ruscio, J., & Crawford, J. T. (2024). Performing Small-Telescopes Analysis by Resampling: Empirically Constructing Confidence Intervals and Estimating Statistical Power for Measures of Effect Size. Advances in Methods and Practices in Psychological Science, 7(1), 25152459241227865. https://doi.org/10.1177/25152459241227865

Cumming, G., & Maillardet, R. (2006). Confidence intervals and replication: Where will the next mean fall? Psychological Methods, 11(3), 217–227. https://doi.org/10.1037/1082-989X.11.3.217

Errington, T. M., Mathur, M., Soderberg, C. K., Denis, A., Perfito, N., Iorns, E., & Nosek, B. A. (2021). Investigating the replicability of preclinical cancer biology. eLife, 10, e71601. https://doi.org/10.7554/eLife.71601

Gelman, A., Hullman, J., & Kennedy, L. (2023). Causal Quartets: Different Ways to Attain the Same Average Treatment Effect. The American Statistician. https://www.tandfonline.com/doi/full/10.1080/00031305.2023.2267597

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

LeBel, E. P., Vanpaemel, W., Cheung, I., & Campbell, L. (2019). A Brief Guide to Evaluate Replications. Meta-Psychology, 3. https://doi.org/10.15626/MP.2018.843

Ly, A., Etz, A., Marsman, M., & Wagenmakers, E.-J. (2019). Replication Bayes factors from evidence updating. Behavior Research Methods, 51(6), 2498–2508. https://doi.org/10.3758/s13428-018-1092-x

Mathur, M. B., & VanderWeele, T. J. (2020). New Statistical Metrics for Multisite Replication Projects. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(3), 1145–1166. https://doi.org/10.1111/rssa.12572

Muradchanian, J., Hoekstra, R., Kiers, H., & van Ravenzwaaij, D. (2021). How best to quantify replication success? A simulation study on the comparison of replication success metrics. Royal Society Open Science, 8(5), 201697. https://doi.org/10.1098/rsos.201697

Patil, P., Peng, R. D., & Leek, J. T. (2016). What should we expect when we replicate? A statistical view of replicability in psychological science. Perspectives on Psychological Science : A Journal of the Association for Psychological Science, 11(4), 539–544. https://doi.org/10.1177/1745691616646366

Rousselet, G., Pernet, C. R., & Wilcox, R. R. (2023). An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations. Meta-Psychology, 7. https://doi.org/10.15626/MP.2019.2058

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Spence, J. R., & Stanley, D. J. (2024). Tempered Expectations: A Tutorial for Calculating and Interpreting Prediction Intervals in the Context of Replications. Advances in Methods and Practices in Psychological Science, 7(1), 25152459231217932. https://doi.org/10.1177/25152459231217932

Verhagen, J., & Wagenmakers, E.-J. (2014). Bayesian tests to quantify the result of a replication attempt. Journal of Experimental Psychology. General, 143(4), 1457–1475. https://doi.org/10.1037/a0036731

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yan, Y., & Genton, M. G. (2019). The Tukey g-and-h distribution. Significance, 16(3), 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x