Tag Archives: trimmed mean

All your p-values are wrong

Or they don’t mean what you think, or they are not interpretable in most situations (Wagenmakers, 2007; Kruschke, 2013). Why is that? Let’s consider how a p-value is calculated. For simplicity, we focus on a one-sample one-sided t-test. Imagine we collected this sample of 50 observations.

The mean is indicated by the vertical solid line. Imagine our hypothesis is a mean 70% correct. The t value is 1.52, and the p-value is 0.0670. We obtain the p-value by comparing our observed t value to a hypothetical distribution of t values obtained from imaginary experiments we will never carry out. The default approach is to assume a world in which there is no effect and we sample from a normal distribution. In each imaginary experiment, we get a sample of n=50 observations from our null distribution, and calculate t. Over an infinite number of imaginary experiments, we get this sampling distribution:

The p-value is the tail area highlighted in red, corresponding to the probability to observe imaginary t values at least as extreme as our observed t value under our model. Essentially, a p-value is a measure of surprise, which can be expressed as a s-value in bits (Greenland, 2019). For a p-value of 0.05, the s-value = -log2(0.05) = 4.32. That’s equivalent to flipping a coin 4 times and getting 4 heads in a row. A p-value can also be described as a continuous measure of compatibility between our model and our data, ranging from 0 for complete incompatibility, to 1 for complete compatibility (Greenland et al. 2016). This is key: p-values are not absolute, they change with the data and the model, and even with the experimenter’s intentions.

Let’s unpack this fundamental property of p-values. Our sample of n=50 scores is associated with a p-value of 0.067 under our model. This model includes:

  • a hypothesis of 70% correct;
  • sampling from a normal distribution;
  • independent samples;
  • fixed sample size of n=50;
  • a fixed statistical model, that is a t-test is applied every time.

(The full model includes other assumptions, for instance that our sample is unbiased, that we have precise measurements, that our measure of interest is informative in the context of a causal model linking data and theory (Meehl, 1997), but we will ignore these aspects here.)

In practice some of these assumptions are incorrect, making the interpretation of p-values difficult.

Data-generating process

The scores in our sample do not come from a normal distribution. Like any proportion data, they follow a beta distribution. Here is the population our data came from:

A boxplot suggests the presence of one outlier:

A Q-Q plot suggests some deviations from normality, but a Shapiro-Wilk test fails to reject:

Should we worry or rely on the reassuring call to the central limit theorem and some hand-wavy statement about the robustness of t-test and ANOVA? In practice, it is a bad idea to assume that empirical t distributions will match theoretical ones because skewness and outliers can significantly mess things up, even for relatively large sample sizes (Wilcox, 2022). In the one-sample case, ignoring skewness and outliers can lead to inflated false positives and low power (Wilcox & Rousselet, 2023; Rousselet & Wilcox, 2020).

In our case, we can simulate the t distribution under normality–Normal (sim) in the figure below, and compare it to the t distribution obtained when sampling from our beta population–Beta (sim). As a reference, the figure also shows the theoretical, non-simulated t distribution–Normal (theory). The simulation involved 100,000 iterations with n=50 samples.

Let’s zoom in to better compare the right side of the distributions:

The simulated t distribution under normality is a good approximation of the theoretical distribution. The simulated t when sampling from our beta population is not accessible to the user, because we typically don’t know exactly how the data were generated. Here we have full knowledge, so we can derive the correct t distribution for our data. Remember that the p-value from the standard t-test was 0.0670. Using our simulation under normality the p-value is 0.0675. When using the t distribution obtained by sampling from the correct beta distribution, now the p-value is 0.0804.

In most situations, p-values are calculated using inappropriate theoretical sampling distributions of t values. This might not affect the observed p-value much, but the correct p-value is unknown.

Independence

The independence assumption is violated whenever data-dependent exclusion is applied to the sample. For instance, it is very common for outliers to be identified and removed before applying a t-test or other frequentist inferential test. This is often done using a non-robust method, such as flagging observations more than 2 SD from the mean. A more robust method could also be used, such as a boxplot rule or a MAD-median rule (Wilcox & Rousselet, 2023). Whatever the method, if the outliers are identified using the sample we want to analyse, the remaining observations are no longer independent, which affects the standard error of the test. This is well documented in the case of inferences about trimmed means (Tukey & McLaughlin, 1963; Yuen, 1974; Wilcox, 2022). Trimmed means are robust estimators of central tendency that can boost statistical power in the presence of skewness and outliers. To calculate a 20% trimmed mean, we sort the data, and remove the lowest 20% and the highest 20% (so 40% of observations in total), and average the remaining observations. This introduces a dependency among the remaining observations, which is taken into account in the calculation of the standard error. In other words, removing observations in a data-dependent manner, and then using a t-test as if the new, lower, sample size was the one intended is inappropriate. To illustrate the problem, we can do a simulation in which we sample from a normal or a beta population, each time take a sample of n=50, trim 20% of observations from each end of the distribution, and either apply the incorrect t-test to the remaining n=30 observations, or apply the t formula from Tukey & McLaughlin (1963; Wilcox, 2022) to the full sample. Here are the results:

T values computed on observations left after trimming are far too large. The discrepancy depends on the amount of trimming. Elegantly, the equation of the t-test on trimmed means reverts to the standard equation if we trim 0%. Of course, the amount of trimming should be pre-registered and not chosen after seeing the data.

If we apply a t-test on means to our beta sample after trimming 20%, the (incorrect) p-value is 0.0007. The t-test on trimmed means returns p = 0.0256. That’s a large difference! Using the t distribution from the simulation in which we sampled from the correct beta population, now p = 0.0329. Also, with a t-test on means we fail to reject, whereas we do for an inference on 20% trimmed means. In general, inferences on trimmed means tend to be more powerful relative to means in the presence of skewness or outliers. However, keep in mind that means and trimmed means are not interchangeable: they ask different questions about the populations. Sample means are used to make inferences about population means, and both are non-robust measures of central tendency. Sample trimmed means are used to make inferences about population trimmed means.

Now the problem is more complicated, and somewhat intractable, if instead of trimming a pre-registered amount of data, we apply an outlier detection method. In that case, independence is violated, but correcting the standard error is difficult because the number of removed observations is a random variable: it will change between experiments.

In our sample, we detect one outlier. Removing the outlier and applying a t-test on means, pretending that our sample size was always n-1 is inappropriate, although very common in practice. The standard error could be corrected using a similar equation to that used in the trimmed mean t-test. However, in other experiments we might reject a different number of outliers. Remember that p-values are not about our experiment, they reflect what could happen in other similar experiments that we will never carry out. In our example, we can do a simulation to derive a sampling distribution that match the data generation and analysis steps. For each sample of n=50 observations from the beta distribution, we apply a boxplot rule, remove any outliers, and then compute a t value. If we simply remove the outlier from our sample, the t-test returns p = 0.0213. If instead we compute the p-value by using the simulated sampling distribution, we get p = 0.0866. That p-value reflects the correct data generating process and the fact that, in other experiments, we could have rejected a different number of outliers. Actually, in the simulation the median number of rejected outliers is zero, the 3rd quartile is 1, and the maximum is 9.

In practice, we don’t have access to the correct t sampling distribution. However, we can get a good approximation by using a percentile bootstrap that incorporates the outlier detection and rejection step after sampling with replacement from the full dataset, and before calculating the statistic of interest (Rousselet, Pernet & Wilcox, 2021).

If outliers are expected and common, a good default strategy is to make inferences about trimmed means. Another approach is to make inferences about M-estimators in conjunction with a percentile bootstrap (Wilcox, 2022). M-estimators adjust the amount of trimming based on the data, instead of removing a pre-specified amount. Yet another approach is to fit a distribution with a tail parameter that can account for outliers (Kruschke, 2013). Or it might well be that what looks like an outlier is a perfectly legitimate member of a skewed or heavy-tailed distribution: use more appropriate models that account for rich distributional differences (Rousselet, Pernet, Wilcox, 2017; Farrell & Lewandowsky, 2018; Lindeløv, 2019).

Fixed sample size?

The t-test assumes that the sample size is fixed. This seems obvious, but in practice it is not the case. As we saw in the previous example, sample sizes can depend on outlier rejection, a very common procedure that make p-values uninterpretable. In general, data-dependent analyses will mess up traditional frequentist inferences (Gelman & Loken 2014). Sample sizes can also be affected by certain inclusion criteria. For instance, data are included in the final analyses for participants who scored high enough in a control attention check. Deriving correct p-values would require simulations of sampling distributions that incorporate the inclusion check. In other situations, the sample sizes vary because of reasons outside the experimenters’ control. For instance, data are collected in an online experiment until a deadline. In that case the final sample size is a surprise revealed at the end of the experiment and is thus a random variable. Consequently, deriving a sampling distribution for a statistic of interest requires another sampling distribution of plausible sample sizes that could have been obtained. The sampling distribution, say for a t value, would be calculated by integrating over the sampling distribution of sample sizes, and any other sources of variability, such as different plausible numbers of outliers that could have been removed, even if they were not in our sample. Failure to account for these sources of variability leads to incorrect p-values. It gets even more complicated in some situations: p-values also depend on our sampling intentions.

Imagine this scenario inspired by Kruschke (2013), in which a supervisor asked two research assistants to collect data from n=8 participants in total. They misunderstood the instructions, and instead collected n=8 each, so a total of n=16. The plan was to do a one-sample t-test. What sample size should the research team use to compute the degrees of freedom: 8 or 16? So 7 df or 15 df? Here is a plot of the p-values as a function of the critical t values in the two situations.

The answer depends on the sampling distribution matching the data acquisition process, including the probability that the instructions are misunderstood (Kruschke, 2013). If we assume that a misunderstanding leading to this specific error could occur in 10% of experiments, then the matching curve is the dashed one in the figure below, obtained by mixing the two curves for n=8 and n=16.

That’s right, even though the sample size is n=16, because it was obtained by accident and we intended to collect n=8, the critical t and the p-value are obtained from a distribution that is in-between the two for n=8 and n=16, but closer to n=8. This correct distribution reflects the long-run perspective of conducting imaginary experiments in which the majority would have led to n=8. Again, the p-value is not about the current experiment. This scenario reveals that p-values depend on intentions, which has consequences in many situations. In practice, all the points raised so far demonstrate that p-values in most situations are necessarily inaccurate and very difficult to interpret.

Conditional analyses

Another common way to mess up the interpretation of our analyses is to condition one analysis on another one. For instance, it is common practice to conduct a test of data normality: reject and apply a rank-based test; fail to reject and apply a t-test. Testing for normality of the data is a bad idea for many reasons, including because it makes the subsequent statistical tests conditional on the outcome of the normality test. Again, unless we can simulate the appropriate conditional sampling distribution for our statistic, our p-value will be incorrect. Similarly, anyone tempted to use such an approach would need to justify sample sizes using a power simulation that includes the normality step, and any other step that affects the statistic sampling distribution. In my experience, all conditional steps are typically ignored in power analyses and pre-registrations. It’s not just p-values, all power analyses are wrong too.

No measurement error?

It gets worse. Often, t-tests and similar models are applied to data that have been averaged over repetitions, for instance mean accuracy or reaction times averaged over trials in each condition and participant. In this common situation, the t-test ignores measurement error, because all variability has been wiped out. Obviously, in such situations, mixed-effect (hierarchical) models should be used (DeBruine & Barr, 2021). Using a t-test instead of a mixed-effect model is equivalent to using a mixed-effect model in which the trial level data have been copied and pasted an infinite number of times, such that measurement precision becomes infinite. This is powerfully illustrated here.

Conclusion

In most articles, the p-values are wrong. How they would change using appropriate sampling distributions is hard to determine, and ultimately a futile exercise. Even if the p-values changed very little, the uncertainty makes the obsession for declaring “statistical significance” whenever p<0.05, no matter how close the p-value is to the threshold, all the more ridiculous. So the next time you read in an article that “there was a trend towards significance, p=0.06”, or some other non-sense, in addition to asking the authors if they pre-registered a threshold for a trend, and asking them to also write “a trend towards non-significance, p=0.045”, also point out that the p-value matching their design, analyses, and data generation process is likely to be different from the one reported.

What can we do? A plan of action, from easy to hard:

[1] Take a chill pill, and consider p-values as just one of many outputs, without a special status (Vasishth & Gelman, 2021). Justify your choices in the methods section, unlike the traditional article in which tests pop up out of the blue in the results section, with irrational focus on statistical significance.

[2] Use bootstrap methods to derive more appropriate sampling distributions. Bootstrap methods, combined with robust estimators, can boost statistical power and help you answer more interesting questions. These methods also let you include preprocessing steps in the analyses, unlike standard parametric methods.

[3] Pre-register everything, along with careful justifications of models, pre-processing steps, and matching power simulations.

[4] Abandon the chase for statistical significance. Instead of focusing on finding effects, focus on a model-centric approach (Devezer & Buzbas, 2023). The goal is to contrast models that capture different hypotheses or mechanisms by assessing how they explain or predict data (Farrell & Lewandowsky, 2018; Gelman, Hill & Vehtari, 2020; James et al., 2021; McElreath, 2020; Yarkoni & Westfall, 2017). What is the explanatory power of the models? What is their predictive accuracy?

Code

https://github.com/GRousselet/blog-pwrong

References

DeBruine, L., & Barr, D. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119

Devezer, B., & Buzbas, E. O. (2023). Rigorous exploration in a model-centric science via epistemic iteration. Journal of Applied Research in Memory and Cognition, 12(2), 189–194. https://doi.org/10.1037/mac0000121

Farrell, S., & Lewandowsky, S. (2018). Computational Modeling of Cognition and Behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–465. https://www.jstor.org/stable/43707868

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer US. https://doi.org/10.1007/978-1-0716-1418-1

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603. https://doi.org/10.1037/a0029146

Lindeløv, J. K. (2019). Reaction time distributions: An interactive overview. https://lindeloev.github.io/shiny-rt/

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press. https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46(2), 1738–1748. https://doi.org/10.1111/ejn.13610

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

Yuen, K. K. (1974). The Two-Sample Trimmed t for Unequal Population Variances. Biometrika, 61(1), 165–170. https://doi.org/10.2307/2334299

A warning about data-driven simulations

Simulations are essential to plan experiments and to learn about our data and our statistical methods (Rousselet, 2025). However, here I’d like to provide a quick word of caution about data-driven simulations. In this type of simulations, we treat a large sample as a population, from which we resample to create simulated experimental samples–you can see a detailed example of that approach for instance in Rousselet & Wilcox (2020). Using large datasets is great because they contain rich distributional information that we might over-simplify with synthetic data. There is an important limitation with this approach though. As described in Burns et al. (2025), data-driven simulations are affected by the relative size difference between the population and the sample. As sample sizes get closer to the population size, power estimation bias increases, and the sign of the bias depends on the effect size in the population. To better understand this phenomenon, let’s look at some sampling distributions. In Burns et al. (2025), we considered correlation data, but the problem is more general. So here we’ll consider reaction time data from a lexical decision task (Word / Non-Word discrimination task; Ferrand et al. 2010), which have been presented in detail in previous posts:

We start by illustrating the sampling distributions for different combinations of participant population sizes and sample sizes. For each participant, I calculated the 20% trimmed means for the two conditions, and saved the difference between the Non-Word and Word conditions. The full-size population was then defined as the one-sample distribution of 20% trimmed mean differences for all 959 participants. In each simulation iteration, populations of sizes 50, 100, …, 250 were created by sampling without replacement from the full sample size. Then, for each simulated population size, experiments were simulated by sampling 20, 50 or 100 participants with replacement. It might seem strange to sample with replacement 100 participants from a population of 50, but I’ve seen that type of over-sampling in the wild, and it is worth checking in case one has only access to a small dataset. As we will see shortly, it is a bad idea. For each iteration, sample size and population size, we calculate the group 20% trimmed mean. Here are the results:

Sampling distributions of the group 20% trimmed mean differences between Non-Word and Word conditions, as a function of population size. Smaller populations were simulated by sampling without replacement from the full dataset of 959 differences estimated using the 20% trimmed mean. For each simulated smaller population, varying numbers of participants were sampled without replacement. The vertical lines indicate the group difference for the full-size population.


Sampling distributions of the group 20% trimmed mean differences between Non-Word and Word conditions, as a function of population size. Smaller populations were simulated by sampling without replacement from the full dataset of 959 differences estimated using the 20% trimmed mean. For each simulated smaller population, varying numbers of participants were sampled without replacement. The vertical lines indicate the group difference for the full-size population.

The code to reproduce the figures is on GitHub.

As expected (and it is always worth checking), the spread of the sampling distributions varies inversely with the sample size, here the number of participants. The important point here is that for a fixed sample size, we get broader sampling distributions for smaller populations, and the problem is worse if our samples are large relative to the population size. In the figure above, we see a larger difference between population sizes 50 and 250 when taking samples of 100 participants rather than 20 participants. This phenomenon is due to the presence, in some populations, of an over-representation of extreme values, which are themselves more likely to be picked up when sampling with replacement in a simulation. As a result, we get exaggerated tails, with important consequences for power analyses (Burns et al., 2025).

Statistical power was estimated using a simulation with 20,000 iterations and the same procedure described to derive the sampling distributions. A one-sample t-test for 20% trimmed means was used (Tukey & McLaughlin, 1963; Wilcox, 2022), with a null value of 60 ms and the usual arbitrary alpha value of 0.05. By plotting power as a function of population size, separately for each sample size, we immediately see the massive impact of sample size, here the number of participants.

Power simulation showing results as a function of the number of participants and population size. The inference was on the population 20% trimmed mean difference, using a two-sided one-sample t-test equivalent, with a null hypothesis of 60 ms. All the stimulations are based on 20,000 iterations.


But also notice an unexpected pattern: for each sample size, the population size has different effects. It is easier to see what is going on by focusing on the extremes: for the smallest sample size (n=10 participants), increasing the population size lowers power. In other words, for this large reaction time effect, conducting a data-driven simulation using a small sample from a small population will tend to over-estimate statistical power. We get the opposite effect when we consider a larger sample size (n=120), as now a smaller population size leads to power under-estimation.

To put these results in perspective, let’s consider power as a function of the number of participants, plotted separately for each population size.

Same results as in the previous figure, with number of participants along the x-axis. The dashed horizontal line marks the target 83% power. Why 83% power? It is a prime number, as good a justification as any (McElreath, 2020).


The number of participants needed to reach 83% power when the population size is 100 is 93. However, the same power estimation when sampling from a larger population of size 300 suggests that we could reach the same power level with only 81 participants. That’s a 12 participant difference! Using the full dataset of 959 participants, the required number of participants is 78. So when assessing the results of data-driven simulations, we need to carefully consider if the data-set we use is large enough for our purpose (Burns et al., 2025).

References

Burns, C. D. G., Fracasso, A., & Rousselet, G. A. (2025). Bias in data-driven replicability analysis of univariate brain-wide association studies. Scientific Reports, 15(1), 6105. https://doi.org/10.1038/s41598-025-89257-w

Ferrand, L., New, B., Brysbaert, M., Keuleers, E., Bonin, P., Méot, A., Augustinova, M., & Pallier, C. (2010). The French Lexicon Project: Lexical decision data for 38,840 French words and 38,840 pseudowords. Behavior Research Methods, 42(2), 488–496. https://doi.org/10.3758/BRM.42.2.488

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Rousselet, G. (2025). Using simulations to explore sampling distributions: An antidote to hasty and extravagant inferences. OSF. https://doi.org/10.31219/osf.io/f5q7r_v2

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

When is a 95% confidence interval not a 95% confidence interval?


In previous posts, we saw how skewness and outliers can affect false positives (type I errors) and true positives (power) in one-sample tests. In particular, when making inferences about the population mean, skewness tends to inflate false positives, and skewness and outliers can destroy power. Here we investigate a complementary perspective, looking at how confidence intervals are affected by skewness and outliers.

Spoiler alert: 95% confidence intervals most likely do not have a coverage of 95%. In fact, I’ll show you an example in which a 95% CI for the mean has an 80% coverage…

The R code for this post is on GitHub.


Back to the title of the post. Seems like a weird question? Not if we consider the definition of a confidence interval (CI). Let say we conduct an experiment to estimate quantity x from a sample, where x could be the median or the mean for instance. Then a 95% CI for the population value of x refers to a procedure whose behaviour is defined in the long-run: CIs computed in the same way should contain the population value in 95% of exact replications of the experiment. For a single experiment, the particular CI does or does not contain the population value, there is no probability associated with it. A CI can also be described as the interval compatible with the data given our model — see definitions and common misinterpretations in Greenland et al. (2016).

So 95% refers to the (long-term) coverage of the CI; the exact values of the CI bounds vary across experiments. The CI procedure is associated with a certain coverage probability, in the long-run, given the model. Here the model refers to how we collected data, data cleaning procedures (e.g. outlier removal), assumptions about data distribution, and the methods used to compute the CI. Coverage can differ from the expected one if model assumptions are violated or the model is just plain wrong.

Wrong models are extremely common, for instance when applying a standard t-test CI to percent correct data (Kruschke, 2014; Jaeger, 2008) or Likert scale data (Bürkner & Vuorre, 2019; Liddell & Kruschke, 2019). 

For continuous data, CI coverage is not at the expected, nominal level, for instance when the model expects symmetric distributions and we’re actually sampling from skewed populations (which is the norm, not the exception, when we measure sizes, durations, latencies etc.). Here we explore this issue using g & h distributions that let us manipulate asymmetry.

Illustrate g & h distributions

All g & h distributions have a median of zero. The parameter g controls the asymmetry of the distribution, while the parameter h controls the thickness of the tails (Hoaglin, 1985; Yan & Genton, 2019). Let’s look at some illustrations to make things clear.

Examples in which we vary g from 0 to 1.

As g increases, the asymmetry of the distributions increases. Using negative g values would produce distributions with negative skewness.

Examples in which we vary h from 0 to 0.2.

As h increases, the tails are getting thicker, which means that outliers are more likely. 

Test with normal (g=h=0) distribution

Let’s run simulations to look at coverage probability in different situations and for different estimators. First, we sample with replacement from a normal population (g=h=0) 20,000 times (that’s 20,000 simulated experiments). Each sample has size n=30. Confidence intervals are computed for the mean, the 10% trimmed mean (tm), the 20% trimmed mean and the median using standard parametric methods (see details in the code on GitHub, and references for equations in Wilcox & Rousselet, 2018). The trimmed mean and the median are robust measures of central tendency. To compute a 10% trimmed mean, observations are sorted, the 10% lowest and 10% largest values are discarded (20% in total), and the remaining values are averaged. In this context, the mean is a 0% trimmed mean and the median is a 50% trimmed mean. Trimming the data attenuates the influence of the tails of the distributions and thus the effects of asymmetry and outliers on confidence intervals.

First we look at coverage for the 4 estimators: we look at the proportion of simulated experiments in which the CIs included the population value for each estimator. As expected for the special case of a normal distribution, the coverage is close to nominal (95%) for every method:

Mean 10% tm 20% tm Median
0.949 0.948 0.943 0.947

In addition to coverage, we also look at the width of the CIs (upper bound minus lower bound). Across simulations, we summarise the results using the median width. CIs tends to be larger for trimmed means and median relative to the mean, which implies lower power under normality for these methods (Wilcox & Rousselet, 2018). 

Mean 10% tm 20% tm Median
0.737 0.761 0.793 0.889

For CIs that did not include the population, the distribution is fairly balanced between the left and the right of the population. To see this, I computed a shift index: if the CI was located to the left of the population value, it receives a score of -1, when it was located to the right, it receives a score of 1. The shift index was then computed by averaging the scores only for those CI excluding the population.

Mean 10% tm 20% tm Median
0.046 0.043 0.009 0.013

Illustrate CIs that did not include the population

Out of 20,000 simulated experiments, about 1,000 CI (roughly 5%) did not include the population value for each estimator. About the same number of CIs were shifted to the left and to the right of the population value, which is illustrated in the next figure. In each panel, the vertical line marks the population value (here it’s zero in all conditions because the population is symmetric). The CIs are plotted in the order of occurrence in the simulation. So the figure shows that if we miss the population value, we’re as likely to overshoot than undershoot our estimation.

Across panels, the figure also shows that the more we trim (10%, 20%, median) the larger the CIs get. So for a strictly normal population, we more precisely estimate the mean than trimmed means and the median.

Test with g=1 & h=0 distribution

What happens for a skewed population? Three things happen for the mean:

  • coverage goes down
  • width increases
  • CIs not including the population value tend to be shifted to the left (negative average shift values)

The same effects are observed for the trimmed means, but less so the more we trim, because trimming alleviates the effects of the tails.

Measure Mean 10% tm 20% tm Median
Coverage 0.880 0.936 0.935 0.947
Width 1.253 0.956 0.879 0.918
Shift -0.962 -0.708 -0.661 0.017
# left 2350 1101 1084 521
# right 45 188 221 539

Illustrate CIs that did not include the population

The figure illustrates the strong imbalance between left and right CI shifts. If we try to estimate the mean of a skewed population, our CIs are likely to miss it more than 5% of the time, and when that happens, the CIs are most likely to be shifted towards the bulky part of the distribution (here the left for a right skewed distribution). Also, the right shifted CIs vary a lot in width and can be very large.

As we trim, the imbalance is progressively resolved. With 20% trimming, when CIs do not contain the population value, the distribution of left and right shifts is more balanced, although with still far more left shifts. With the median we have roughly 50% left / 50% right shifts and CIs are narrower than for the mean.

Test with g=1 & h=0.2 distribution

What happens if we sample from a skewed distribution (g=1) in which outliers are likely (h=0.2)?

Measure Mean 10% tm 20% tm Median
Coverage 0.801 0.934 0.936 0.947
Width 1.729 1.080 0.934 0.944
Shift -0.995 -0.797 -0.709 0.018
# left 3967 1194 1086 521
# right 9 135 185 540

The results are similar to those observed for h=0, only exacerbated. Coverage for the mean is even lower, CIs are larger, and the shift imbalance even more severe. I have no idea how often such a situation occur, but I suspect if you study clinical populations that might be rather common. Anyway, the point is that it is a very bad idea to assume the distributions we study are normal, apply standard tools, and hope for the best. Reporting CIs as 95% or some other value, without checking, can be very misleading.

Simulations in which we vary g

We now explore CI properties as a function of g, which we vary from 0 to 1, in steps of 0.1. The parameter h is set to 0 (left column of next figure) or 0.2 (right column). Let’s look at column A first (h=0). For the median, coverage is unaffected by g. For the other estimators, there is a monotonic decrease in coverage with increasing g. The effect is much stronger for the mean than the trimmed means.

For all estimators, increasing g leads to monotonic increases in CI width. The effect is very subtle for the median and more pronounced the less we trim. Under normality, g=0, CIs are the shortest for the mean, explaining the larger power of mean based methods relative to trimmed means in this unusual situation.

In the third panel, the zero line represents an equal proportion of left and right shifts, relative to the population, for CIs that did not include the population value. The values are consistently above zero for the median, with a few more right shifts than left shifts for all values of g. For the other estimators, the preponderance of left shifts increases markedly with g.

Now we look at results in panel B (h=0.2). When outliers are likely, coverage drops faster with g for the mean. Other estimators are resistant to outliers.

When outliers are common, CIs for the population mean are larger than for all other estimators, irrespective of g.

Again, there is a constant over-representation of right shifted CIS for the median. For the other estimators, the left shifted CIs dominate more and more with increasing g. The trend is more pronounced for the mean relative to the h=0 situation, with a sharper monotonic downward trajectory.

Conclusion

The answer to the question in the title is: most of the time! Simply because our models are wrong most of the time. So I would take all published confidence intervals with a pinch of salt. [Some would actually go further and say that if the sampling and analysis plans for an experiment were not clearly stipulated before running the experiment, then confidence interval, like P values, are not even defined (Wagenmakers, 2007). That is, we can compute a CI, but the coverage is meaningless, because exact repeated sampling might be impossible or contingent on external factors that would need to be simulated.] The best way forward is probably not to advocate for the use of trimmed means or the median over the mean in all cases, because different estimators address different questions about the data. And there are more estimators of central tendency than means, trimmed means and medians. There are also more interesting questions to ask about the data than their central tendencies (Rousselet, Pernet & Wilcox, 2017). For these reasons, we need data sharing to be the default, so that other users can ask different questions using different tools. The idea that the one approach used in a paper is the best to address the problem at hand is just silly.

To see what happens when we use the percentile bootstrap or the bootstrap-t to build confidence intervals for the mean, see this more recent post.

References

Bürkner, Paul-Christian, and Matti Vuorre. ‘Ordinal Regression Models in Psychology: A Tutorial’. Advances in Methods and Practices in Psychological Science 2, no. 1 (1 March 2019): 77–101. https://doi.org/10.1177/2515245918823199.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. ‘Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations’. European Journal of Epidemiology 31, no. 4 (1 April 2016): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Jaeger, T. Florian. ‘Categorical Data Analysis: Away from ANOVAs (Transformation or Not) and towards Logit Mixed Models’. Journal of Memory and Language 59, no. 4 (November 2008): 434–46. https://doi.org/10.1016/j.jml.2007.11.007.

Kruschke, John K. Doing Bayesian Data Analysis. 2nd Edition. Academic Press, 2014.

Liddell, Torrin M., and John K. Kruschke. ‘Analyzing Ordinal Data with Metric Models: What Could Possibly Go Wrong?’ Journal of Experimental Social Psychology 79 (1 November 2018): 328–48. https://doi.org/10.1016/j.jesp.2018.08.009.

Rousselet, Guillaume A., Cyril R. Pernet, and Rand R. Wilcox. ‘Beyond Differences in Means: Robust Graphical Methods to Compare Two Groups in Neuroscience’. European Journal of Neuroscience 46, no. 2 (1 July 2017): 1738–48. https://doi.org/10.1111/ejn.13610.

Rousselet, Guillaume A., and Rand R. Wilcox. ‘Reaction Times and Other Skewed Distributions: Problems with the Mean and the Median’. Preprint. PsyArXiv, 17 January 2019. https://doi.org/10.31234/osf.io/3y54r.

Wagenmakers, Eric-Jan. ‘A Practical Solution to the Pervasive Problems of p Values’. Psychonomic Bulletin & Review 14, no. 5 (1 October 2007): 779–804. https://doi.org/10.3758/BF03194105.

Wilcox, Rand R., and Guillaume A. Rousselet. ‘A Guide to Robust Statistical Methods in Neuroscience’. Current Protocols in Neuroscience 82, no. 1 (2018): 8.42.1-8.42.30. https://doi.org/10.1002/cpns.41.

Yan, Yuan, and Marc G. Genton. ‘The Tukey G-and-h Distribution’. Significance 16, no. 3 (2019): 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x.