Tag Archives: R

All your p-values are wrong

Or they don’t mean what you think, or they are not interpretable in most situations (Wagenmakers, 2007; Kruschke, 2013). Why is that? Let’s consider how a p-value is calculated. For simplicity, we focus on a one-sample one-sided t-test. Imagine we collected this sample of 50 observations.

The mean is indicated by the vertical solid line. Imagine our hypothesis is a mean 70% correct. The t value is 1.52, and the p-value is 0.0670. We obtain the p-value by comparing our observed t value to a hypothetical distribution of t values obtained from imaginary experiments we will never carry out. The default approach is to assume a world in which there is no effect and we sample from a normal distribution. In each imaginary experiment, we get a sample of n=50 observations from our null distribution, and calculate t. Over an infinite number of imaginary experiments, we get this sampling distribution:

The p-value is the tail area highlighted in red, corresponding to the probability to observe imaginary t values at least as extreme as our observed t value under our model. Essentially, a p-value is a measure of surprise, which can be expressed as a s-value in bits (Greenland, 2019). For a p-value of 0.05, the s-value = -log2(0.05) = 4.32. That’s equivalent to flipping a coin 4 times and getting 4 heads in a row. A p-value can also be described as a continuous measure of compatibility between our model and our data, ranging from 0 for complete incompatibility, to 1 for complete compatibility (Greenland et al. 2016). This is key: p-values are not absolute, they change with the data and the model, and even with the experimenter’s intentions.

Let’s unpack this fundamental property of p-values. Our sample of n=50 scores is associated with a p-value of 0.067 under our model. This model includes:

  • a hypothesis of 70% correct;
  • sampling from a normal distribution;
  • independent samples;
  • fixed sample size of n=50;
  • a fixed statistical model, that is a t-test is applied every time.

(The full model includes other assumptions, for instance that our sample is unbiased, that we have precise measurements, that our measure of interest is informative in the context of a causal model linking data and theory (Meehl, 1997), but we will ignore these aspects here.)

In practice some of these assumptions are incorrect, making the interpretation of p-values difficult.

Data-generating process

The scores in our sample do not come from a normal distribution. Like any proportion data, they follow a beta distribution. Here is the population our data came from:

A boxplot suggests the presence of one outlier:

A Q-Q plot suggests some deviations from normality, but a Shapiro-Wilk test fails to reject:

Should we worry or rely on the reassuring call to the central limit theorem and some hand-wavy statement about the robustness of t-test and ANOVA? In practice, it is a bad idea to assume that empirical t distributions will match theoretical ones because skewness and outliers can significantly mess things up, even for relatively large sample sizes (Wilcox, 2022). In the one-sample case, ignoring skewness and outliers can lead to inflated false positives and low power (Wilcox & Rousselet, 2023; Rousselet & Wilcox, 2020).

In our case, we can simulate the t distribution under normality–Normal (sim) in the figure below, and compare it to the t distribution obtained when sampling from our beta population–Beta (sim). As a reference, the figure also shows the theoretical, non-simulated t distribution–Normal (theory). The simulation involved 100,000 iterations with n=50 samples.

Let’s zoom in to better compare the right side of the distributions:

The simulated t distribution under normality is a good approximation of the theoretical distribution. The simulated t when sampling from our beta population is not accessible to the user, because we typically don’t know exactly how the data were generated. Here we have full knowledge, so we can derive the correct t distribution for our data. Remember that the p-value from the standard t-test was 0.0670. Using our simulation under normality the p-value is 0.0675. When using the t distribution obtained by sampling from the correct beta distribution, now the p-value is 0.0804.

In most situations, p-values are calculated using inappropriate theoretical sampling distributions of t values. This might not affect the observed p-value much, but the correct p-value is unknown.

Independence

The independence assumption is violated whenever data-dependent exclusion is applied to the sample. For instance, it is very common for outliers to be identified and removed before applying a t-test or other frequentist inferential test. This is often done using a non-robust method, such as flagging observations more than 2 SD from the mean. A more robust method could also be used, such as a boxplot rule or a MAD-median rule (Wilcox & Rousselet, 2023). Whatever the method, if the outliers are identified using the sample we want to analyse, the remaining observations are no longer independent, which affects the standard error of the test. This is well documented in the case of inferences about trimmed means (Tukey & McLaughlin, 1963; Yuen, 1974; Wilcox, 2022). Trimmed means are robust estimators of central tendency that can boost statistical power in the presence of skewness and outliers. To calculate a 20% trimmed mean, we sort the data, and remove the lowest 20% and the highest 20% (so 40% of observations in total), and average the remaining observations. This introduces a dependency among the remaining observations, which is taken into account in the calculation of the standard error. In other words, removing observations in a data-dependent manner, and then using a t-test as if the new, lower, sample size was the one intended is inappropriate. To illustrate the problem, we can do a simulation in which we sample from a normal or a beta population, each time take a sample of n=50, trim 20% of observations from each end of the distribution, and either apply the incorrect t-test to the remaining n=30 observations, or apply the t formula from Tukey & McLaughlin (1963; Wilcox, 2022) to the full sample. Here are the results:

T values computed on observations left after trimming are far too large. The discrepancy depends on the amount of trimming. Elegantly, the equation of the t-test on trimmed means reverts to the standard equation if we trim 0%. Of course, the amount of trimming should be pre-registered and not chosen after seeing the data.

If we apply a t-test on means to our beta sample after trimming 20%, the (incorrect) p-value is 0.0007. The t-test on trimmed means returns p = 0.0256. That’s a large difference! Using the t distribution from the simulation in which we sampled from the correct beta population, now p = 0.0329. Also, with a t-test on means we fail to reject, whereas we do for an inference on 20% trimmed means. In general, inferences on trimmed means tend to be more powerful relative to means in the presence of skewness or outliers. However, keep in mind that means and trimmed means are not interchangeable: they ask different questions about the populations. Sample means are used to make inferences about population means, and both are non-robust measures of central tendency. Sample trimmed means are used to make inferences about population trimmed means.

Now the problem is more complicated, and somewhat intractable, if instead of trimming a pre-registered amount of data, we apply an outlier detection method. In that case, independence is violated, but correcting the standard error is difficult because the number of removed observations is a random variable: it will change between experiments.

In our sample, we detect one outlier. Removing the outlier and applying a t-test on means, pretending that our sample size was always n-1 is inappropriate, although very common in practice. The standard error could be corrected using a similar equation to that used in the trimmed mean t-test. However, in other experiments we might reject a different number of outliers. Remember that p-values are not about our experiment, they reflect what could happen in other similar experiments that we will never carry out. In our example, we can do a simulation to derive a sampling distribution that match the data generation and analysis steps. For each sample of n=50 observations from the beta distribution, we apply a boxplot rule, remove any outliers, and then compute a t value. If we simply remove the outlier from our sample, the t-test returns p = 0.0213. If instead we compute the p-value by using the simulated sampling distribution, we get p = 0.0866. That p-value reflects the correct data generating process and the fact that, in other experiments, we could have rejected a different number of outliers. Actually, in the simulation the median number of rejected outliers is zero, the 3rd quartile is 1, and the maximum is 9.

In practice, we don’t have access to the correct t sampling distribution. However, we can get a good approximation by using a percentile bootstrap that incorporates the outlier detection and rejection step after sampling with replacement from the full dataset, and before calculating the statistic of interest (Rousselet, Pernet & Wilcox, 2021).

If outliers are expected and common, a good default strategy is to make inferences about trimmed means. Another approach is to make inferences about M-estimators in conjunction with a percentile bootstrap (Wilcox, 2022). M-estimators adjust the amount of trimming based on the data, instead of removing a pre-specified amount. Yet another approach is to fit a distribution with a tail parameter that can account for outliers (Kruschke, 2013). Or it might well be that what looks like an outlier is a perfectly legitimate member of a skewed or heavy-tailed distribution: use more appropriate models that account for rich distributional differences (Rousselet, Pernet, Wilcox, 2017; Farrell & Lewandowsky, 2018; Lindeløv, 2019).

Fixed sample size?

The t-test assumes that the sample size is fixed. This seems obvious, but in practice it is not the case. As we saw in the previous example, sample sizes can depend on outlier rejection, a very common procedure that make p-values uninterpretable. In general, data-dependent analyses will mess up traditional frequentist inferences (Gelman & Loken 2014). Sample sizes can also be affected by certain inclusion criteria. For instance, data are included in the final analyses for participants who scored high enough in a control attention check. Deriving correct p-values would require simulations of sampling distributions that incorporate the inclusion check. In other situations, the sample sizes vary because of reasons outside the experimenters’ control. For instance, data are collected in an online experiment until a deadline. In that case the final sample size is a surprise revealed at the end of the experiment and is thus a random variable. Consequently, deriving a sampling distribution for a statistic of interest requires another sampling distribution of plausible sample sizes that could have been obtained. The sampling distribution, say for a t value, would be calculated by integrating over the sampling distribution of sample sizes, and any other sources of variability, such as different plausible numbers of outliers that could have been removed, even if they were not in our sample. Failure to account for these sources of variability leads to incorrect p-values. It gets even more complicated in some situations: p-values also depend on our sampling intentions.

Imagine this scenario inspired by Kruschke (2013), in which a supervisor asked two research assistants to collect data from n=8 participants in total. They misunderstood the instructions, and instead collected n=8 each, so a total of n=16. The plan was to do a one-sample t-test. What sample size should the research team use to compute the degrees of freedom: 8 or 16? So 7 df or 15 df? Here is a plot of the p-values as a function of the critical t values in the two situations.

The answer depends on the sampling distribution matching the data acquisition process, including the probability that the instructions are misunderstood (Kruschke, 2013). If we assume that a misunderstanding leading to this specific error could occur in 10% of experiments, then the matching curve is the dashed one in the figure below, obtained by mixing the two curves for n=8 and n=16.

That’s right, even though the sample size is n=16, because it was obtained by accident and we intended to collect n=8, the critical t and the p-value are obtained from a distribution that is in-between the two for n=8 and n=16, but closer to n=8. This correct distribution reflects the long-run perspective of conducting imaginary experiments in which the majority would have led to n=8. Again, the p-value is not about the current experiment. This scenario reveals that p-values depend on intentions, which has consequences in many situations. In practice, all the points raised so far demonstrate that p-values in most situations are necessarily inaccurate and very difficult to interpret.

Conditional analyses

Another common way to mess up the interpretation of our analyses is to condition one analysis on another one. For instance, it is common practice to conduct a test of data normality: reject and apply a rank-based test; fail to reject and apply a t-test. Testing for normality of the data is a bad idea for many reasons, including because it makes the subsequent statistical tests conditional on the outcome of the normality test. Again, unless we can simulate the appropriate conditional sampling distribution for our statistic, our p-value will be incorrect. Similarly, anyone tempted to use such an approach would need to justify sample sizes using a power simulation that includes the normality step, and any other step that affects the statistic sampling distribution. In my experience, all conditional steps are typically ignored in power analyses and pre-registrations. It’s not just p-values, all power analyses are wrong too.

No measurement error?

It gets worse. Often, t-tests and similar models are applied to data that have been averaged over repetitions, for instance mean accuracy or reaction times averaged over trials in each condition and participant. In this common situation, the t-test ignores measurement error, because all variability has been wiped out. Obviously, in such situations, mixed-effect (hierarchical) models should be used (DeBruine & Barr, 2021). Using a t-test instead of a mixed-effect model is equivalent to using a mixed-effect model in which the trial level data have been copied and pasted an infinite number of times, such that measurement precision becomes infinite. This is powerfully illustrated here.

Conclusion

In most articles, the p-values are wrong. How they would change using appropriate sampling distributions is hard to determine, and ultimately a futile exercise. Even if the p-values changed very little, the uncertainty makes the obsession for declaring “statistical significance” whenever p<0.05, no matter how close the p-value is to the threshold, all the more ridiculous. So the next time you read in an article that “there was a trend towards significance, p=0.06”, or some other non-sense, in addition to asking the authors if they pre-registered a threshold for a trend, and asking them to also write “a trend towards non-significance, p=0.045”, also point out that the p-value matching their design, analyses, and data generation process is likely to be different from the one reported.

What can we do? A plan of action, from easy to hard:

[1] Take a chill pill, and consider p-values as just one of many outputs, without a special status (Vasishth & Gelman, 2021). Justify your choices in the methods section, unlike the traditional article in which tests pop up out of the blue in the results section, with irrational focus on statistical significance.

[2] Use bootstrap methods to derive more appropriate sampling distributions. Bootstrap methods, combined with robust estimators, can boost statistical power and help you answer more interesting questions. These methods also let you include preprocessing steps in the analyses, unlike standard parametric methods.

[3] Pre-register everything, along with careful justifications of models, pre-processing steps, and matching power simulations.

[4] Abandon the chase for statistical significance. Instead of focusing on finding effects, focus on a model-centric approach (Devezer & Buzbas, 2023). The goal is to contrast models that capture different hypotheses or mechanisms by assessing how they explain or predict data (Farrell & Lewandowsky, 2018; Gelman, Hill & Vehtari, 2020; James et al., 2021; McElreath, 2020; Yarkoni & Westfall, 2017). What is the explanatory power of the models? What is their predictive accuracy?

Code

https://github.com/GRousselet/blog-pwrong

References

DeBruine, L., & Barr, D. (2021). Understanding Mixed-Effects Models Through Data Simulation. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920965119. https://doi.org/10.1177/2515245920965119

Devezer, B., & Buzbas, E. O. (2023). Rigorous exploration in a model-centric science via epistemic iteration. Journal of Applied Research in Memory and Cognition, 12(2), 189–194. https://doi.org/10.1037/mac0000121

Farrell, S., & Lewandowsky, S. (2018). Computational Modeling of Cognition and Behavior. Cambridge University Press. https://doi.org/10.1017/CBO9781316272503

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press. https://doi.org/10.1017/9781139161879

Gelman, A., & Loken, E. (2014). The Statistical Crisis in Science. American Scientist, 102(6), 460–465. https://www.jstor.org/stable/43707868

Greenland, S. (2019). Valid P-Values Behave Exactly as They Should: Some Misleading Criticisms of P-Values and Their Resolution With S-Values. The American Statistician, 73(sup1), 106–114. https://doi.org/10.1080/00031305.2018.1529625

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350. https://doi.org/10.1007/s10654-016-0149-3

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: With Applications in R. Springer US. https://doi.org/10.1007/978-1-0716-1418-1

Kruschke, J. K. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology. General, 142(2), 573–603. https://doi.org/10.1037/a0029146

Lindeløv, J. K. (2019). Reaction time distributions: An interactive overview. https://lindeloev.github.io/shiny-rt/

McElreath, R. (2020). Statistical Rethinking: A Bayesian Course with Examples in R and STAN (2nd edn). Chapman and Hall/CRC. https://doi.org/10.1201/9780429029608

Meehl, P. E. (1997). The Problem is Epistemology, Not Statistics: Replace Significance Tests by Confidence Intervals and Quantify Accuracy of Risky Numerical Predictions. In L. L. H. Steiger Stanley A. Mulaik, James H. (Ed.), What If There Were No Significance Tests? Psychology Press. https://meehl.umn.edu/sites/meehl.umn.edu/files/files/169problemisepistemology.pdf

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2017). Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46(2), 1738–1748. https://doi.org/10.1111/ejn.13610

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2021). The Percentile Bootstrap: A Primer With Step-by-Step Instructions in R. Advances in Methods and Practices in Psychological Science, 4(1), 2515245920911881. https://doi.org/10.1177/2515245920911881

Rousselet, G. A., & Wilcox, R. R. (2020). Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychology, 4. https://doi.org/10.15626/MP.2019.1630

Tukey, J. W., & McLaughlin, D. H. (1963). Less Vulnerable Confidence and Significance Procedures for Location Based on a Single Sample: Trimming/Winsorization 1. Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), 25(3), 331–352. JSTOR. https://www.jstor.org/stable/25049278

Vasishth, S., & Gelman, A. (2021). How to embrace variation and accept uncertainty in linguistic and psycholinguistic data analysis. Linguistics, Linguistics, 59(5), 1311–1342. https://doi.org/10.1515/ling-2019-0051

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779–804. https://doi.org/10.3758/BF03194105

Wilcox, R. R. (2022). Introduction to Robust Estimation and Hypothesis Testing (5th edn). Academic Press.

Wilcox, R. R., & Rousselet, G. A. (2023). An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3(3), e719. https://doi.org/10.1002/cpz1.719

Yarkoni, T., & Westfall, J. (2017). Choosing Prediction Over Explanation in Psychology: Lessons From Machine Learning. Perspectives on Psychological Science, 12(6), 1100–1122. https://doi.org/10.1177/1745691617693393

Yuen, K. K. (1974). The Two-Sample Trimmed t for Unequal Population Variances. Biometrika, 61(1), 165–170. https://doi.org/10.2307/2334299

Using cluster-based permutation tests to estimate MEG/EEG onsets: How bad is it?

Guillaume A. Rousselet

The answer to this important question is in this article at the European Journal of Neuroscience. It is an extended version of work presented in a blog post published over a year ago. Although the focus is on MEG/EEG timing estimation, the conclusions and the main points about statistical issues apply broadly to brain imaging and many other scientific endeavours. As the article is open access and relatively short, I won’t repeat the content, but there is a good snapshot in the graphical abstract:

I also made an animated version of Figure 1 using gganimate. The conclusion is to avoid FDR correction if the goal is to estimate onsets. More on this topic here. Figure 1a:

Figure 1b:

I also made an animated illustration of the change point algorithm (here limited to differences in means), using code from Rebecca Killick and gganimate:

Finally, a shout out to Benedikt Ehinger, who reproduced some of the results in the article as part of the reviewing process, using his own Julia implementation. Because I had a preprint of the article on bioRxiv, he posted the replication on his blog. One of the many benefits of open science…

Can we use cluster-based permutation tests to estimate MEG/EEG onsets?

[An updated version of the work in this post appears in this preprint]

Cluster-based statistics are a family of methods that aim to correct for multiple comparisons while preserving power when effects are distributed over time, space and other dimensions (Maris & Oostenveld, 2007; Smith & Nichols, 2009; Pernet et al., 2015; Rosenblatt et al., 2018; Fields & Kuperberg, 2020; Frossard & Renaud, 2022). These methods capitalise on the correlation of effects across measurements to control the family-wise error rate (FWER, probability of at least one false positive): neighbouring effects that pass a threshold are lumped together, and compared to a reference distribution, typically a null distribution established using a permutation or bootstrap approach (Pernet et al., 2015). These cheap multivariate techniques (I say cheap because even if they take into account the distribution of effects over at least one dimension, there is no explicit modelling required) are particularly popular in fMRI, EEG and MEG research (for instance, according to Google Scholar, Smith & Nichols (2009) has been cited 4945 times, Maris & Oostenveld (2007) 7053 times). Cluster-based methods have the potential to be applied to a wide array of problems, including simpler ones involving only a few comparisons, as I’ve argued in another post (see for instance application to TMS latencies in Dugué et al. (2011). I’ve also used cluster-based statistics extensively in my EEG and behavioural research (Rousselet et al., 2014; Bieniek et al., 2016; Jaworska et al., 2020).

However, cluster-based methods are not without issues. Most notably, Sassenhagen & Draschkow (2019) reminded the community about problems with the interpretation of cluster-based inferences and suggested guidelines on how to report such results. This useful reminder has been heard, as demonstrated by the number of citations: 196 times according to the article’s publisher page and 320 times according to Google Scholar. Essentially, the main problem with cluster-based methods is that they provide p values at the cluster level only, or even for collection of clusters for hierarchical/meta-analysis methods such as TFCE (Smith & Nichols, 2009). There are no p values at individual testing points, and no error control for onsets and offsets. Consequently, cluster-based methods do not provide direct inferences about the localisation of effects. But does that mean we cannot use cluster-based methods to localise, for instance to estimate the onsets of effects?

Sassenhagen & Draschkow (2019) go on to report results from a simulation, which suggest that onset estimation using cluster-based inferences are very noisy. They used a truncated Gaussian as a template, to define a real onset to be estimated. Then, they compared 50 trials of noise only, to 50 trials of noise + template, using cluster-sum F statistics – see cluster-sum explanation here. These cluster-sums were compared to a permutation distribution to determine statistical significance. For each simulation iteration, the first statistically significant point was declared the onset of the effect. Here is their figure of results, showing the distribution of estimated onsets across 10,000 simulation iterations:

The distribution of onsets reveals two important patterns:

  • Estimated onsets appear positively biased, with the mode higher than the real onset.
  • The distribution is asymmetric, with thicker left tail than right tail.

Sassenhagen & Draschkow (2019) reported that “on >20% of runs, the effect onset was estimated too early; divergence of 40 ms or more were found at >10% of runs.”
No bias estimate was reported.

So, how bad are the results? We don’t know because no other method was used to estimate onsets, so we cannot put the results in perspective. Imagine a study that aimed to test the prediction of larger variance in EEG signals in population X relative to population Y, but only tested participants from population X. Having observed large absolute variance in X, what can we conclude without measurements in Y?

Less importantly, Sassenhagen & Draschkow (2019) used EEG data from one human participant as noise to add to their template in the simulations. The data illustrated in supplementary information look heavily contaminated by alpha noise – I could be wrong, but having recorded from hundreds of participants, it looks like the data are from a very sleepy participant. That’s a problem, because the claim is that there is a general issue with cluster-based inferences, not one specific to a particular participant.

Let’s address the comparison problem by contrasting cluster-based inferences with inferences from two other popular methods, the false discovery rate correction for multiple comparisons (FDR, Benjamini & Hochberg, 1995; Nichols & Hayasaka, 2003) and the maximum statistics correction for multiple comparisons (MAX, Holmes et al., 1996; Nichols & Holmes, 2002). FDR methods provide a weak control of the FWER by controlling the number of false positives among the positive tests; when there is no effect, an FDR correction is equivalent to controlling the FWER; when effects are present, it is more powerful than controlling the FWER (Benjamini & Hochberg, 1995; Fields & Kuperberg, 2020). The MAX correction achieves strong control of the FWER by comparing each statistic, say a t or F value, to a distribution of maximum statistics obtained under the null. These methods, like cluster-based methods, do not explicitly estimate onsets (they were not designed for that purpose) and do not provide error control for onsets, but are routinely used to localise effects.

Other methods have been proposed, for instance defining the onset as the time at which a certain proportion of the peak amplitude has been reached (Liesefeld, 2018; Fahrenfort, 2020). However, such methods depend on potentially subjective peak identification (especially when applied to individual participants), and they are guaranteed, by definition, to lead to positively biased onset estimates. These methods are not considered here but maybe they have some merit.

Another recent method, cluster depth tests, combines cluster inferences while delivering p values at every testing point, thus solving the main issue of cluster-based inferences as implemented so far (Frossard & Renaud, 2022). But this approach, like previous cluster-based methods, and alternative ways to correct for multiple comparisons such as FDR or MAX, do not solve a fundamental issue that has been ignored so far: whether a p value is available at each testing point or not is irrelevant, as none of the methods routinely used to localise MEG and EEG effects explicitly assess onsets. That’s because onsets are second order properties of the data. As such, they require to establish the lack of effects in some areas, the presence of effects in others, and a test of the interaction. However, none of the cluster-based methods, or any other standard methods used to localise in EEG/MEG/fMRI and other fields, explicitly test for the interaction. The interaction is an inference performed pretty much by eyeballing maps of thresholded statistics. So localisation attempts typically rely on an interaction fallacy, a problem well documented in neuroscience (Nieuwenhuis et al., 2011) and other fields (Gelman & Stern, 2006).

More explicit estimation of onsets could be achieved by using methods that go beyond mass-univariate tests and consider time-courses. For instance, there is a vast family of algorithms aimed at detecting change points in time-series that could be used to estimate onsets. Many packages in the R programming language (R Core Team, 2021) offer such algorithms, for instance RBeast (Zhao et al., 2019), mcp (Lindeløv, 2020), changepoint (Killick & Eckley, 2014). There is a very useful survey of such packages here. The Beast algorithm (Zhao et al., 2013) is also available in python and Matlab and I’m sure there are many others. Matlab offers several native solutions, including the findchangepts function. For comparison, here I used a basic change point detection method that relies on differences in mean and variance, as implemented in the changepoint package in R, and the built-in findchangepts function in Matlab.

So, let’s revisit the simulation results from Sassenhagen & Draschkow (2019) by performing our own simulations. Following their approach, I compared a noise only condition to a noise + signal condition. The signal was a truncated Gaussian defining an objective onset at 160 ms.

This approach is very similar to that of Sassenhagen & Draschkow (2019), except that I simulated only one electrode, with cluster inferences only in the time domain. There is no need to simulate extra dimensions (electrodes, time-frequencies), because the argument against cluster-based inferences is not specific to one particular dimension. Also, I used synthetic noise instead of using data from one participant. The noise was generated by adding a collection of sinusoids at different frequencies (Yeung et al., 2004). Using instead 1/f noise gave very similar results.

The simulations involved 50 trials per condition, 10,000 iterations per simulation, and the null distributions were estimated using a permutation with 2,000 iterations.
The simulations were performed twice, in R and in Matlab, with the code available on GitHub.

Here is an example of noise trials:

And 50 trials of signal + noise (grey), with the mean superimposed (black):

And an example of averages for the noise only and noise + signal conditions:

The vertical black line marks the 160 ms true onset. The two conditions were compared using two-sample t-tests with unequal variances. Here is the time-course of squared t values matching the previous figure.

We consider four methods to correct for multiple comparisons:

  • [FDR] The standard 1995 FDR method.
  • [MAX] The maximum statistics correction.
  • [CS] Cluster-based + permutation inference, using a cluster-sum statistic of squared t values.
  • [CP] Change point detection based on differences in mean and variance, applied to the time-course of t2 values. The number of change points was set to 2.

Here are the onsets estimated by each method, using the data from the previous example:

In grey is the permutation distribution of squared t values. The solid vertical line indicates the true onset. The dotted vertical line marks the estimated onset. The horizontal dashed lines are permutation thresholds.

After 10,000 iterations, we get these distributions of onsets for our four methods, with the vertical line marking the true onset:

Each distribution has a shape similar to that reported by Sassenhagen & Draschkow (2019): overall a positive bias and some amount of under-estimation. Let’s quantify the results.

[1] Bias was quantified by subtracting the true onset from the median of each distribution. Because the distributions are asymmetric, the median provides a better reflection of the typical onset than the mean (Rousselet & Wilcox, 2020).

Bias:

  • FDR = 30 ms
  • MAX = 40 ms
  • CS = 30 ms
  • CP = 10 ms

Confirming our visual inspection of the distributions, all methods are biased, the least biased being CP, the most biased being MAX. This is not surprising, as MAX is by definition conservative, the price to pay for strong FWER control.

[2] The mean absolute error (MAE) indicated how close to the true onset each method gets.

MAE:

  • FDR = 42 ms
  • MAX = 44 ms
  • CS = 30 ms
  • CP = 19 ms

Again, CS doesn’t perform worse than other methods: CP was best, MAX was worst.

[3] As Sassenhagen & Draschkow (2019) did, let’s look at the proportion of onsets earlier than the true onset.

% early onsets:

  • FDR = 17.5%
  • MAX = 1.6%
  • CS = 0.5%
  • CP = 13.4%

[4] And the proportion of onsets at least 40 ms earlier than the true onset.

% <40 ms onsets:

  • FDR = 14.7%
  • MAX = 1.3%
  • CS = 0.1%
  • CP = 3.7%

Again, there is nothing wrong with the cluster-sum approach at all: other methods that provide p values at individual testing points performed worst.

Using 1/f noise gave similar results, but the absolute results change substantially whether white noise, pink noise or red noise was used – see full simulations on GitHub.

So far we have considered results from one simulated participant. A typical experiment involves more than one participant. What happens if we try to estimate onsets from 20 participants instead? To find out, let’s assume that each participant has a random onset of 150, 160 or 170 ms (uniform distribution), so that the population onset remains 160 ms. There are 50 trials per condition. In each simulation iteration, after estimating the onset in each of the 20 participants, we use the median onset as our group estimate. Here are the results:

Again, there is nothing wrong with the cluster-sum results: they are very similar to the FDR results, better than MAX, worse than change point.

In numbers:

Bias:

  • FDR = 30
  • MAX = 40
  • CS = 30
  • CP = 10

Mean absolute error:

  • FDR = 27
  • MAX = 43
  • CS = 29
  • CP = 14

Proportion too early:

  • FDR = 0.04%
  • MAX = 0%
  • CS = 0%
  • CP = 0.08%

Note the very low proportion of under-estimation once results are combined across participants.

Of course, another way to reduce bias is to increase sample size. Here are the results from one participant and sample sizes ranging from 20 to 150 trials.

Bias can be reduced considerably by using larger sample sizes, but the MAX method remains overly conservative. Here is what happens for the mean absolute error:

And the proportion of early onsets:

Overall, the cluster-sum method appears to perform well, and the change point approach looks very promising.

Discussion

Onsets are inherently noisy because they correspond to the testing points with the lowest signal to noise ratio. But estimation variability can be reduced by increasing the number of trials and pooling estimates across participants. And based on the present simulation results, there is no rationale for doubting more localisation results obtained using cluster-based inferences than other methods. Certainly, I’m not going to ditch the estimates of face ERP onsets I’ve published so far. They are estimates after all, obtained in a larger number of participants, and with promising test-retest reliability obtained from independent EEG sessions (Bieniek et al., 2016).

From the simulations, it seems that one should be more worried about onsets estimated using FDR and MAX corrections. And the change point approach seems very promising. Certainly here I’ve used a relatively simple algorithm and it remains to be determined what can be achieved using modern methods, for instance Bayesian approaches that allow hierarchical pooling of estimates (Lindeløv, 2020).

Currently, no method routinely used in MEG/EEG explicitly estimates onsets, whether a p value is available at every testing point or not. Whether correction for multiple comparison is achieved using cluster-based statistics, FDR, MAX or any related method, even the new cluster depth tests (Frossard & Renaud, 2022), onset estimation relies on a statistical fallacy, because the implied interaction is never explicitly tested. Does it mean that non-sensical results are obtained? The simulations presented here clearly indicate that meaningful results can be obtained, and that cluster-based inferences can provide more accurate results than alternative methods.

Obviously, one experiment provides no guarantee, and that’s why it is essential to consider sampling distributions, like the ones presented here, to get an idea of how much variability to expect in the long run (Rousselet, 2019). But do not fall victim to the statistical orthodoxy: plenty of methods are used to estimate aspects of the data that they were not developed to handle. In doubt, simulations are essential to understand how well a job any of your tools can achieve – never trust the labels.

In addition to the onset estimation approach, it is worth considering other methodological aspects that can affect onsets. The most important factor is probably filtering, with high-pass filters introducing potentially severe signal distortions (Rousselet, 2012; Widmann & Schröger, 2012; Widmann et al., 2015; van Driel et al., 2021). If the goal is to estimate onsets, I recommend using causal filters to remove low frequencies and to avoid backpropagation of effects. On the stimulus front, screen luminance can strongly influence onsets (Bieniek et al., 2013).

In conclusion, can we use cluster-based permutation tests to estimate MEG/EEG onsets? Absolutely, yes, but:

  • Instead of estimating onsets at the group level, measure onsets in each participant and report group distributions.
  • Assess the test-retest reliability of the onsets, either using the same data (cross-validation) or a different one.
  • Report results from more than one method, to assess computational consistency.

More advanced steps:

  • Use large datasets to compare onset estimation methods, including techniques, so far ignored, that explicitly consider time-series.
  • Use large datasets to provide benchmarks in your field.
  • Use simulations to compare methods. To go beyond the simple approach presented here, one could use algorithms that generate more realistic time-series (Barzegaran et al., 2019).

References

Barzegaran, E., Bosse, S., Kohler, P.J., & Norcia, A.M. (2019) EEGSourceSim: A framework for realistic simulation of EEG scalp data using MRI-based forward models and biologically plausible signals and noise. J. Neurosci. Methods, 328, 108377.

Benjamini, Y. & Hochberg, Y. (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300.

Bieniek, M., Frei, L., & Rousselet, G. (2013) Early ERPs to faces: aging, luminance, and individual differences. Front. Psychol., 4.

Bieniek, M.M., Bennett, P.J., Sekuler, A.B., & Rousselet, G.A. (2016) A robust and representative lower bound on object processing speed in humans. Eur. J. Neurosci., 44, 1804–1814.

Dugué, L., Marque, P., & VanRullen, R. (2011) Transcranial Magnetic Stimulation Reveals Attentional Feedback to Area V1 during Serial Visual Search. PLOS ONE, 6, e19712.

Fahrenfort, J.J. (2020) Multivariate Methods to Track the Spatiotemporal Profile of Feature-Based Attentional Selection Using EEG. In Pollmann, S. (ed), Spatial Learning and Attention Guidance, Neuromethods. Springer US, New York, NY, pp. 129–156.

Fields, E.C. & Kuperberg, G.R. (2020) Having your cake and eating it too: Flexibility and power with mass univariate statistics for ERP data. Psychophysiology, 57, e13468.

Frossard, J. & Renaud, O. (2022) The cluster depth tests: Toward point-wise strong control of the family-wise error rate in massively univariate tests with application to M/EEG. NeuroImage, 247, 118824.

Gelman, A. & Stern, H. (2006) The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. Am. Stat., 60, 328–331.

Holmes, A.P., Blair, R.C., Watson, J.D., & Ford, I. (1996) Nonparametric analysis of statistic images from functional mapping experiments. J. Cereb. Blood Flow Metab. Off. J. Int. Soc. Cereb. Blood Flow Metab., 16, 7–22.

Jaworska, K., Yi, F., Ince, R.A.A., van Rijsbergen, N.J., Schyns, P.G., & Rousselet, G.A. (2020) Healthy aging delays the neural processing of face features relevant for behavior by 40 ms. Hum. Brain Mapp., 41, 1212–1225.

Killick, R. & Eckley, I.A. (2014) changepoint: An R Package for Changepoint Analysis. J. Stat. Softw., 58, 1–19.

Liesefeld, H.R. (2018) Estimating the Timing of Cognitive Operations With MEG/EEG Latency Measures: A Primer, a Brief Tutorial, and an Implementation of Various Methods. Front. Neurosci., 12.

Lindeløv, J.K. (2020) mcp: An R Package for Regression With Multiple Change Points. OSF Prepr.,.

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. J. Neurosci. Methods, 164, 177–190.

Nichols, T.E. & Holmes, A.P. (2002) Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp., 15, 1–25.

Nichols, T. & Hayasaka, S. (2003) Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res, 12, 419–446.

Nieuwenhuis, S., Forstmann, B.U., & Wagenmakers, E.-J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat. Neurosci., 14, 1105–1107.

Pernet, C.R., Latinus, M., Nichols, T.E., & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 85–93.

R Core Team (2021) R: A Language and Environment for Statistical Computing.

Rosenblatt, J.D., Finos, L., Weeda, W.D., Solari, A., & Goeman, J.J. (2018) All-Resolutions Inference for brain imaging. NeuroImage, 181, 786–796.

Rousselet, G. (2012) Does Filtering Preclude Us from Studying ERP Time-Courses? Front. Psychol., 3.

Rousselet, G. (2019) Using simulations to explore sampling distributions: an antidote to hasty and extravagant inferences.

Rousselet, G.A., Ince, R.A.A., van Rijsbergen, N.J., & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J. Vis., 14, 7.

Rousselet, G.A. & Wilcox, R.R. (2020) Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychol., 4.

Sassenhagen, J. & Draschkow, D. (2019) Cluster-based permutation tests of MEG/EEG data do not establish significance of effect latency or location. Psychophysiology, 56, e13335.

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage, 44, 83–98.

Stevens, M.H.H. (2009) A Primer of Ecology with R, 2nd Printing. Springer, New York.

van Driel, J., Olivers, C.N.L., & Fahrenfort, J.J. (2021) High-pass filtering artifacts in multivariate classification of neural time series data. J. Neurosci. Methods, 352, 109080.

Widmann, A. & Schröger, E. (2012) Filter Effects and Filter Artifacts in the Analysis of Electrophysiological Data. Front. Psychol., 3.

Widmann, A., Schröger, E., & Maess, B. (2015) Digital filter design for electrophysiological data – a practical approach. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 34–46.

Yeung, N., Bogacz, R., Holroyd, C.B., & Cohen, J.D. (2004) Detection of synchronized oscillations in the electroencephalogram: An evaluation of methods. Psychophysiology, 41, 822–832.

Zhao, K., Valle, D., Popescu, S., Zhang, X., & Mallick, B. (2013) Hyperspectral remote sensing of plant biochemistry using Bayesian model averaging with variable and band selection. Remote Sens. Environ., 132, 102–119.

Zhao, K., Wulder, M.A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y., Toman, E., Mallick, B., Zhang, X., & Brown, M. (2019) Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sens. Environ., 232, 111181.

How to test for variance homogeneity

It’s simple: don’t do it! (Or do it but for other reasons — see below)

“A Levene’s test was used to test if the distributions have equal variances.”

“To establish normality, we used a Shapiro-Wilk test with p > 0.05; and for equal variances we used Levene’s test.”

This approach of checking assumptions and then deciding on the test to perform suffers from the limitations described in my previous post. In that post, I described why you should avoid normality tests, and instead could use them in a statistics class to cover a lot of core topics. The same goes for tests of variance homogeneity, which are often used in conjunction with normality tests. In particular, methods to detect differences in variance between groups can have low power and inaccurate type I error rates (false positives), which are the topics of this post. But how to capture variance differences is worth looking at, because in some situations a legitimate question is whether distributions differ in spread, or more specifically in variance. There is a wide range of applications of test of variance differences in fields such as neuroscience, economics, genetics, quality control (Coroner et al. 1982; Li et al. 2015; Ritchie et al. 2018; Patil & Kulkarni, 2022). Thus, it is useful to consider the false positive rate and power of these methods in different situations.

Before exploring error rates for methods aimed at detecting variance differences, it’s worth pointing out a paper by Zimmerman (2004), which looked at the error rates for a combo approach, in which a test of variance homogeneity is performed first, followed by a standard t-test or a Welsh t-test depending on the outcome. This approach of conducting a preliminary check is not recommended: false positives (type I errors) and power of the combo depend on the relative and absolute sample sizes, as well as the magnitude of the variance difference between groups, leading to poor performance in realistic situations. What works best is to use methods for unequal variances by default (heteroscedastic methods); a recommendation echoed in more recent work (Delacre et al. 2017). In the presence of skewness or outliers, or both, power can be boosted by using trimmed means in conjunction with parametric or bootstrap methods (Wilcox & Rousselet, 2023). In general, it is a bad idea to rely on methods that make strong assumptions about the sampled populations and to hope for the best.

Now let’s look at power and type I error rates for a few methods aimed at comparing variances. There are so many methods available it is hard to decide on a few candidate approaches. For instance, Conover et al. (1981) compared 56 tests of variance homogeneity. And many more tests have been proposed since then. Conover et al. (1981) recommended several tests: the Brown-Forsythe test, which is the same as Levene’s test, but using the median instead of the mean, and the Fligner-Killeen test. So we’ll use these three tests here. Zimmerman (2004) used Levene’s test. The three tests belong to a family of tests in which absolute or squared distances between observations and a measure of central tendency are compared using parametric (t-tests, ANOVAs) or rank-based methods (Conover et al. 1981; 2018). Levene’s test uses the mean to centre the distributions, whereas the Brown-Forsythe and Fligner-Killeen tests use the median. In addition, we’ll look at Bartlett’s test, which is known to be very sensitive to departures from normality, and a percentile bootstrap method to compare variances (Wilcox 2002). As we will see, all these tests perform poorly in the presence of skewness, for reasons explained in Conover et al. (2018), who go on to suggest better ones.

The code to reproduce the simulations and the figures is available on GitHub. The simulations involved 10,000 iterations, with 1,000 samples for the bootstrap method.

Simulation 1: normality

Let’s first look at what happens under normality, an unlikely situation in many fields, but a useful benchmark. We consider only 2 groups. One population has a standard deviation of 1; the other population has a standard deviation that varies from 1 to 2 in steps of 0.1. Sample size varies from 10 to 100, in steps of 10. Here are the populations we sample from:

The results for the 5 methods are summarised in the next figure:

The results for SD=1 correspond to the type I error rate (false positives), which should be around 5%, because that is the arbitrary threshold I chose here. It is the case for Bartlett, but notice how Levene overshoots, and BF and FK undershoot at the lowest sample sizes. The percentile bootstrap method is systematically under 0.05 at all sample sizes. This is a reminder that your alpha is not necessarily what you think. The differences among methods for SD=1 (false positives) are easier to see in this figure:

As the standard deviation in the second population increases, power increases for all methods, as expected. Notice the large sample sizes required to detect variance differences. How do the methods compare for the largest population difference? Bartlett performed best, FK came last:

Simulation 2: skewness

Now, what happens when the populations are skewed? Here we consider g-and-h distributions, with parameters g=1 and h=0, which gives the same shape as a lognormal distribution, but with median zero. The g-and-h distributions are nicely described and illustrated in Yan & Genton (2019). We vary the standard deviation from 1 to 2, leading to these populations:

This is not to suggest that such distributions will often be encountered in applied work. Rather, methods that can maintain false positive rates near nominal level and high power when dealing with these distributions should be able to handle a large variety of situations and can therefore be recommended. Also, fun fact, the distributions above have the same median (0) and skewness, but differ in mean and variance.

Here is what we get when we sample from skewed populations:

Compared to results obtained under normality, the maximum power is lower (more illustrations of that later), but look at what happens to the false positives (SD=1): they skyrocket for Bartlett and Levene, and increase less dramatically for FK. The BF test handles skewness very well, whereas the percentile bootstrap is a bit liberal. Let’s compare the false positives of the different methods in one figure:

The same for maximum power (SD=2):

Obviously, the higher power of Bartlett and FK cannot be trusted given their huge false positive rates.

Simulation 3: false positives as a function of skewness

Here we sample from g-and-h distributions that vary from g=0 to g=1. So we explore the space between the two previous simulations parametrically. First, we consider false positives (no difference in variance). Here are the populations we sample from:

And the results for the 5 methods:

Comparison of the 5 methods for g=1:

BF and the percentile bootstrap tests are clearly more robust to skewness than the other methods. Bartlett is useless, but why is this test available in stat packages? In R, bartlett.test doesn’t even come with a warning.

Simulation 4: true positives as a function of skewness

We proceed as in simulation 3, but now the populations always differ by one standard deviation:

All the methods are strongly affected by skewness, with power dropping more for methods that are better at preserving the type I error rate at the nominal level (BF and percentile bootstrap):

A reminder that if you rely on power calculators that assume normality, your power is probably lower than you think.

Conclusion

None of the methods considered here were satisfactory. Only the Brown-Forsythe test and Wilcox’s bootstrap method controlled the type I error rate under non-normality, but their power was strongly affected by skewness. Conover et al. (2018) have proposed alternative methods to maintain high power in the presence of skewness. They recommend methods in which distributions are centred using the global mean or median (across groups), a simple step that improves performance considerably over the subtraction of the mean or median separately in each group (as used here and by default in R). See their discussion for an explanation of the lack of robustness to skewness of standard tests when individual group means or medians are used instead of the global ones. Conover et al. (2018) also considered the lognormal distribution, which corresponds to the g-and-h distribution with g=1 studied here, so their proposed methods should perform much better than the ones we considered. And there are plenty more tests on the market. For instance, Patil & Kulkarni (2022) have proposed a new method that promises high power in a range of situations. Please post a comment if you know of R packages that implement modern robust methods.

Finally, comparing variances is a very specific question. More broadly, one might be interested in differences in spread between distributions. For this more general question, other tools are available, relying on robust measures of scale (Wilcox, 2017, chapter 5) and quantile approaches (Rousselet et al. 2017). The distinction between variance and spread is important, because differences in variance could be driven by or masked by outliers or skewness, which might not affect a robust estimator of scale.

References

Conover, W.J., Johnson, M.E., & Johnson, M.M. (1981) A Comparative Study of Tests for Homogeneity of Variances, with Applications to the Outer Continental Shelf Bidding Data. Technometrics, 23, 351–361.

Conover, W.J., Guerrero-Serrano, A.J., & Tercero-Gómez, V.G. (2018) An update on ‘a comparative study of tests for homogeneity of variance.’ Journal of Statistical Computation and Simulation, 88, 1454–1469.

Delacre, M., Lakens, D., & Leys, C. (2017) Why Psychologists Should by Default Use Welch’s t-test Instead of Student’s t-test. International Review of Social Psychology, 30(1).

Li, X., Qiu, W., Morrow, J., DeMeo, D.L., Weiss, S.T., Fu, Y., & Wang, X. (2015) A Comparative Study of Tests for Homogeneity of Variances with Application to DNA Methylation Data. PLoS One, 10, e0145295.

Patil, K.P. & Kulkarni, H.V. (2022) An uniformly superior exact multi-sample test procedure for homogeneity of variances under location-scale family of distributions. Journal of Statistical Computation and Simulation, 92, 3931–3957.

Ritchie, S.J., Cox, S.R., Shen, X., Lombardo, M.V., Reus, L.M., Alloza, C., Harris, M.A., Alderson, H.L., Hunter, S., Neilson, E., Liewald, D.C.M., Auyeung, B., Whalley, H.C., Lawrie, S.M., Gale, C.R., Bastin, M.E., McIntosh, A.M., & Deary, I.J. (2018) Sex Differences in the Adult Human Brain: Evidence from 5216 UK Biobank Participants. Cereb Cortex, 28, 2959–2975.

Rousselet, G.A., Pernet, C.R., & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. European Journal of Neuroscience, 46, 1738–1748.

Wilcox, R.R. (2002) Comparing the variances of two independent groups. Br J Math Stat Psychol, 55, 169–175.

Wilcox, R.R. (2017) Introduction to Robust Estimation and Hypothesis Testing, 4th edition. edn. Academic Press.

Wilcox, R.R. & Rousselet, G.A. (2023) An Updated Guide to Robust Statistical Methods in Neuroscience. Current Protocols, 3, e719.

Yan, Y. and Genton, M.G. (2019), The Tukey g-and-h distribution. Significance, 16: 12-13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Zimmerman, D.W. (2004) A note on preliminary tests of equality of variances. Br J Math Stat Psychol, 57, 173–181.

Comparing two independent Pearson’s correlations

This post demonstrates two important facts: 

(1) the Fisher’s z method to compare two independent correlations can give very inaccurate results when sampling from distributions that are skewed (asymmetric) or heavy tailed (high probability of outliers) and the population correlation rho differs from zero;

(2) even when sampling from normal distributions, Fisher’s z as well as bootstrap methods require very large sample sizes to detect differences, much larger than commonly used in neuroscience & psychology.

We start with an example and then consider the results of a few simulations. The R code is available on GitHub. It is part of a larger project and I’ll report in other blog posts on the rest of the simulations I’m currently running.

UPDATE: the bootcorci R package implements bootstrap methods to compare correlation coefficients.

Example

Let’s look at an example of correlation analysis reported in Davis et al. (2008). This article had 898 citations on June 11th 2019 according to Google scholar. In their figure 3, the authors reported 4 correlations: 2 correlations in 2 independent groups of participants. Across panels A and B, the frontal activity variable appears twice, so for each group of participants (old and young), the correlations in A and B are dependent and overlapping. Code on how to handle this case was covered in a previous post. Here we’re going to focus on the comparison of two independent correlations, which is like considering A or B and its own.

davis2008

There are several problems with the analysis from Davis et al.:

  • within-participant variability was removed by averaging over trials and other variables, so equal weight is wrongly attributed to every observation;

  • the analyses are split into different correlations instead of attempting mixed-effect modelling with the groups of participants considered together;



  • association is assessed using Pearson’s correlation, which is not robust and suffers from many issues (it is sensitive to outliers, the magnitude of the slope around which points are clustered, curvature, the magnitude of the residuals, restriction of range, and heteroscedasticity — see details in Wilcox 2017);



  • sample sizes are very small (far too small to say anything meaningful actually, as described in this previous post);



  • if a correction for multiple comparisons is applied, nothing passes the magic 0.05 threshold;



  • the authors commit a classic interaction fallacy by reporting one P<0.05 correlation, the other not, without directly comparing them (Gelman & Stern 2006; Nieuwenhuis et al. 2011).


There is nothing special about the correlation analysis in Davis et al. (2008): in neuroscience and psychology these problems are very common. In the rest of this post we’re going to tackle only two of them: how to compare 2 independent Pearson’s correlations, and what sample sizes are required to tease them apart in the long-run.

Procedure 

The most common approach to compare 2 independent correlations is to use the Fisher’s r-to-z approach. Here is a snippet of R code for Fisher’s z, given r1 and r2 the correlations in group 1 and group 2, and n1 and n2 the corresponding sample sizes:

z1 <- atanh(r1)

z2 <- atanh(r2)

zobs <- (z1-z2) / sqrt( 1 / (n1-3) + 1 / (n2-3) )

pval <- 2 * pnorm(-abs(zobs))

This approach is implemented in the cocor R package for instance (Diedenhofen & Musch, 2015). The problem with this approach is its normality assumption: it works well when sampling from a bivariate normal distribution or when the two distributions have a population correlation rho of zero, but it misbehaves when sampling from asymmetric distributions, or from distributions with heavy tails, when rho differs from zero (e.g. Duncan & Layard, 1973; Berry & Mielke, 2000). And all methods based on some variant of the z transform suffer the same issues.

Simulations

Simulations can be used to understand how a statistical procedure works in the long-run, over many identical experiments. Let’s imagine that we sample from bivariate g & h distributions, in which parameter g controls the asymmetry of the distributions and h controls the thickness of the tails  (Hoaglin, 1985). You can see univariate g & h distributions illustrated here. Below you can see bivariate samples with n = 5,000, for combinations of g and h parameters, for a population correlation rho of 0:

and for a population correlation rho of 0.5:

Using g=h=0 gives a normal bivariate distribution.

When h=0.2, the tails are thicker, which means that outliers are more likely.

There is a large parameter space to cover, so here we only look at a subset of results, using simulations with 4,000 iterations. We’ll see examples in which we vary rho, the population correlation, or we vary g for a given rho.

Vary rho

First, we consider false positives, before turning to true positives (power). 

False positives

For false positives, 2 independent random samples are taken from the same bivariate population, so on average the two correlations should not differ. Here are the results for different rhos and sample sizes for g=h=0, using Fisher’s z test and an alpha of 0.05:

g0h0_rho0_fish

So all is fine when sampling from normal distributions: the type I error rate, or proportion of false positives, is at the nominal 0.05 level (marked by a horizontal black line). The values are not exactly 0.05 because of the limited number of simulations, but they’re close enough.

What happens if we sample from slightly asymmetric populations instead (g=0.2)? Relative to the g=0 condition, the number of false positives increases a bit over 0.05 on average, and more so with increasing rho.

g02h0_rho0_fish

Things get worse if we sample from symmetric distributions in which outliers are likely (g=0, h=0.2):

g0h02_rho0_fish

If rho = 0 then the proportion of false positives is still around 0.05, bit it increases rapidly with rho, an effect exacerbated by sample size (a pattern already reported in Duncan & Layard, 1973). That’s a very bad feature, because, as we will see below, very large sample sizes are required to detect differences between correlation coefficients. Unless of course our experimental samples are from normal distributions, but that’s unlikely and usually unknown, so why make such a gamble? (A cursory look at the literature suggests that skewed distributions and outliers are frequent in correlation analyses —Rousselet & Pernet, 2012).

If we sample from asymmetric distributions in which outliers are likely (g=h=0.2), it just gets slightly worse than in the previous example:

g02h02_rho0_fish

What’s the alternative? According to Wilcox (2009), a percentile bootstrap can be used to compare Pearson’s correlations, leading to satisfactory proportions of false positives. If instead of Fisher’s z we use a percentile bootstrap to compare Pearson’s correlations when g=h=0.2, we get the following results:

g02h02_rho0_boot

Still above the expected 0.05 but much better than Fisher’s z!

To compare the two correlation coefficients using the percentile bootstrap, we proceed like this:

  • sample participants with replacement, independently in each group;

  • compute the two correlation coefficients based on the bootstrap samples;



  • save the difference between correlations;



  • execute the previous steps many times;



  • use the distribution of bootstrap differences to derive a confidence interval. 


Usually a 95% confidence interval is defined as the 2.5th and 97.5th quantiles of the bootstrap distribution, but for Pearson’s correlation different quantiles are used depending on the sample size. This adjusted procedure is implemented in the R function twopcor() from Rand Wilcox. To compare two robust correlation coefficients, the adjustment is not necessary, a procedure implemented in twocor(). If for instance we use Spearman’s correlation when g=h=0.2, we get these results:

g02h02_rho0_spea

The proportion of false positives is at the nominal level and doesn’t change with rho! I’ll report simulations using the percentage bend correlation and the winsorised correlation, two other robust estimators, in another post.

True positives

To assess true positives, we first consider differences of 0.1: that is, if rho1 is 0, rho2 is 0.1, and so on for different rhos. Here are the results for g=h=0 using Fisher’s z:

diff01_g0h0_varyrho_fish

For a difference of 0.1, power is overall very low, and it depends strongly on rho. The dependence of power on rho follows logically from the reduced sampling variability with increasing rho, which is demonstrated in a previous post. That post also illustrates the asymmetry of the sampling distributions when rho is different from zero. More generally, bounded distributions tend to be asymmetric, which implies that normality assumptions are routinely violated in psychology.

Changing g or h to 0.2 lowers power. Here is what happens with h=0.2 for instance:

diff01_g0h02_varyrho_fish

But it’s difficult to make comparisons across figures. I would need to make plots of power differences, which is getting tedious for a blog post. So instead we can directly investigate the effect of g and h for fixed rho values. Here we only consider g.

Vary g

False positives

First, we consider the case where rho1 = rho2 = 0.3, we vary g and keep h=0. Again, asymmetry has a strong effect on false positives from Fisher’s z test.

rho03_diff00_h0_varyg_fish

Using the bootstrap gives better results, but still with inflated false positives.

rho03_diff00_h0_varyg_boot

Spearman + bootstrap is better behaved:

rho03_diff00_h0_varyg_spea

True positives

What about power? Let’s consider rho1 = 0.3 and rho2 = 0.5 for Fisher’s z. Increasing asymmetry reduces power.

rho03_diff02_h0_varyg_fish

The percentile bootstrap gives even worse results:

rho03_diff02_h0_varyg_boot

Spearman + bootstrap is not affected by asymmetry at all it seems:

rho03_diff02_h0_varyg_spea

But again, notice the large number of trials required to detect an effect. Here we have a population difference of 0.2, and even assuming normality, 80% power requires well over 250 observations! And that’s if you’re ok to be wrong 20% of the time when there is an effect — seems quite a gamble. For 90% power you will need over 350 observations! But again that’s assuming normality… I haven’t looked at the effect of h on power in this scenario yet.

Finally, let’s consider a simulation in which the rhos are set to the values reported in Figure 3A of Davis et al. (2008): rho1 = 0, rho2 = 0.6. That’s a massive and unlikely difference, but let’s look at how many observations we need even in this improbable case.

rho0_diff06_h0_varyg_fish

To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 36 observations. For 90% power, the minimum sample size is 47 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 61 observations; to achieve at least 90% power the minimum sample size is 91 observations.

Contrast these results with the results obtained using a combination of Spearman + bootstrap:

rho0_diff06_h0_varyg_spea

To achieve at least 80% power given an expected population difference of 0.6 and g=0, the minimum sample size is 43 observations. For 90% power, the minimum sample size is 56 observations.

Now with g=1, to achieve at least 80% power the minimum sample size is 42 observations; to achieve at least 90% power the minimum sample size is 55 observations.

So under normality, Pearson + Fisher is more powerful than Spearman + bootstrap, but power can drop a lot for asymmetric distributions. Spearman is very resilient to asymmetry.

Pearson + bootstrap doesn’t fare well:

rho0_diff06_h0_varyg_boot

A good reminder that the bootstrap is not a magic recipe that can handle all situations well (Rousselet, Pernet & Wilcox, 2023).


UPDATE: 2019-07-23

For the normal case (g=h=0), we look at the proportion of true positives (power) for the difference between Pearsons’ correlations using Fisher’s r-to-z transform. We vary systematically the sampling size n, rho1 and the difference between rho1 and rho2. The title of each facet indicates rho1. The difference between rho1 and rho2 is colour coded. For instance, for rho1=0, a 0.1 difference means that rho2=0.1. For rho1=0.8, only a difference of 0.1 is considered, meaning that rho2=0.9. So, unless we deal with large differences, we need very large numbers of trials to detect differences between independent correlation coefficients.



Conclusion

Given the many problems associated with Pearson’s correlation, it’s probably wise to choose some other, robust alternative (Pernet et al. 2012). Whatever methods you choose, it seems that very large sample sizes are required to detect effects, certainly much larger than I expected before I ran the simulations. I’ll report on the behaviour of other methods in another post. Meanwhile, consider published correlation analyses from small samples with a grain of salt!

Given the present results, I would recommend to use Spearman + bootstrap to compare correlation coefficients. You could also consider a skipped Spearman, which is robust to multivariate outliers, but with long computation times relative to the other techniques mentioned above (Pernet et al. 2012).

References

Berry, K.J. & Mielke, P.W. (2000) A Monte Carlo investigation of the Fisher Z transformation for normal and nonnormal distributions. Psychol Rep, 87, 1101-1114.

Davis, S.W., Dennis, N.A., Daselaar, S.M., Fleck, M.S. & Cabeza, R. (2008) Que PASA? The posterior-anterior shift in aging. Cereb Cortex, 18, 1201-1209.

Diedenhofen, B. & Musch, J. (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLoS One, 10, e0121945.

Duncan, G.T. & Layard, M.W.J. (1973) Monte-Carlo Study of Asymptotically Robust Tests for Correlation-Coefficients. Biometrika, 60, 551-558.

Gelman, A. & Stern, H. (2006) The difference between “significant” and “not significant” is not itself statistically significant. Am Stat, 60, 328-331.

Hoaglin, David C. ‘Summarizing Shape Numerically: The g-and-h Distributions’. In Exploring Data Tables, Trends, and Shapes, 461–513. John Wiley & Sons, Ltd, 1985. https://doi.org/10.1002/9781118150702.ch11.

Nieuwenhuis, S., Forstmann, B.U. & Wagenmakers, E.J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci, 14, 1105-1107.

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Rousselet, G. A., Pernet, C. R., & Wilcox, R. R. (2023). An introduction to the bootstrap: A versatile method to make inferences by using data-driven simulations. Meta-Psychology7https://doi.org/10.15626/MP.2019.2058

Wilcox, R.R. (2009) Comparing Pearson Correlations: Dealing with Heteroscedasticity and Nonnormality. Commun Stat-Simul C, 38, 2220-2234.

Wilcox, R.R. & Rousselet, G.A. (2018) A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci, 82, 8 42 41-48 42 30.

R functions for the hierarchical shift function

The hierarchical shift function presented in the previous post is now available in the `rogme` R package. Here is a short demo.

Get the latest version of `rogme`:

# install.packages("devtools")
devtools::install_github("GRousselet/rogme")
library(rogme)
library(tibble)

Load data and compute hierarchical shift function:

df <- flp # get reaction time data - for details `help(flp)`
# Compute shift functions for all participants
out <- hsf(df, rt ~ condition + participant)

unnamed-chunk-21-1

Because of the large number of participants, the confidence intervals are too narrow to be visible. So let’s subset a random sample of participants to see what can happen with a more smaller sample size:

set.seed(22) # subset random sample of participants
id <- unique(df$participant) 
df <- subset(df, flp$participant %in% sample(id, 50, replace = FALSE))
out <- hsf(df, rt ~ condition + participant) 
plot_hsf(out)

unnamed-chunk-25-1

Want to estimate the quartiles only?

out <- hsf(df, rt ~ condition + participant, qseq = c(.25, .5, .75))
plot_hsf(out)

unnamed-chunk-27-1

Want to reverse the comparison?

out <- hsf(df, rt ~ condition + participant, todo = c(2,1))
plot_hsf(out)

unnamed-chunk-26-1

P values are here:

out$pvalues

P values adjusted for multiple comparisons using Hochberg’s method:

out$adjusted_pvalues 

Percentile bootstrap version:

set.seed(8899)
out <- hsf_pb(df, rt ~ condition + participant)

Plot bootstrap highest density intervals – default:

plot_hsf_pb(out) 

unnamed-chunk-40-1

Plot distributions of bootstrap samples of group differences. Bootstrap distributions are shown in orange. Black dot marks the mode. Vertical black lines mark the 50% and 90% highest density intervals.

plot_hsf_pb_dist(out)

 

unnamed-chunk-41-1

For more examples, a vignette is available on GitHub.

Feedback would be much appreciated: don’t hesitate to leave a comment or to get in touch directly.

Hierarchical shift function: a powerful alternative to the t-test

In this post I introduce a simple yet powerful method to compare two dependent groups: the hierarchical shift function. The code is on GitHub. More details are in Rousselet & Wilcox (2019), with a reproducibility package on figshare.

Let’s consider different situations in a hierarchical setting: we’ve got trials from 2 conditions in several participants. Imagine we collected data from one participant and the results look like this:

unnamed-chunk-3-1

These fake reaction time data were created by sampling from ex-Gaussian distributions. Here the two populations are shifted by a constant, so we expect a uniform shift between the two samples. Later we’ll look at examples showing  differences most strongly in early responses, late responses, and in spread.

To better understand how the distributions differ, let’s look at a shift function, in which the difference between the deciles of the two conditions are plotted as a function of the deciles in condition 1 – see details in Rousselet et al. (2017). The decile differences are all negative, showing stochastic dominance of condition 2 over condition 1. The function is not flat because of random sampling and limited sample size. 

unnamed-chunk-4-1

Now, let’s say we collected 100 trials per condition from 30 participants. How do we proceed? There are a variety of approaches available to quantify distribution differences. Ideally, such data would be analysed using a multi-level model, including for instance ex-Gaussian fits, random slopes and intercepts for participants, item analyses… This can be done using the lme4 or brms R packages. However, in my experience, in neuroscience and psychology articles, the most common approach is to collapse the variability across trials into a single number per participant and condition to be able to perform a paired t-test: typically, the mean is computed across trials for each condition and participant, then the means are subtracted, and the distribution of mean differences is entered into a one-sample t-test. Obviously, this strategy throws away a huge amount of information! And the results of such second-tier t-tests are difficult to interpret: a positive test leaves us wondering exactly how the distributions differ; a negative test is ambiguous – beside avoiding the ‘absence of evidence is not evidence of absence’ classic error, we also need to check if the distributions do not differ in other aspects than the mean. So what can we do?

Depending on how conditions differ, looking at other aspects of the data than the mean can be more informative. For instance, in Rousselet & Wilcox (2019), we consider group comparisons of individual medians. Considering that the median is the second quartile, looking at the other quartiles can be of theoretical interest to investigate effects in early or later parts of distributions. This could be done in several ways, for instance by making inferences on the first quartile (Q1) or the third quartile (Q3). If the goal is to detect differences anywhere in the distributions, a more systematic approach consists in quantifying differences at multiple quantiles. Here we consider the case of the deciles, but other quantiles could be used. First, for each participant and each condition, the sample deciles are computed over trials. Second, for each participant, condition 2 deciles are subtracted from condition 1 deciles – we’re dealing with a within-subject (repeated-measure) design. Third, for each decile, the distribution of differences is subjected to a one-sample test. Fourth, a correction for multiple comparisons is applied across the 9 one-sample tests. I call this procedure a hierarchical shift function. There are many options available to implement this procedure and the example used here is not the definitive answer: the goal is simply to demonstrate that a relatively simple procedure can be much more powerful and informative than standard approaches.

In creating a hierarchical shift function we need to make three choices: a quantile estimator, a statistical test to assess quantile differences across participants, and a correction for multiple comparisons technique. The deciles were estimated using type 8 from the base R quantile() function (see justification in Rousselet & Wilcox, 2019). The group comparisons were performed using a one-sample t-test on 20% trimmed means, which performs well in many situations, including in the presence of outliers. The correction for multiple comparisons employed Hochberg’s strategy (Hochberg, 1988), which guarantees that the probability of at least one false positive will not exceed the nominal level as long as the nominal level is not exceeded for each quantile. 

In Rousselet & Wilcox (2019), we consider power curves for the hierarchical shift function (HSF) and contrast them to other approaches: by design, HSF is sensitive to more types of differences than any standard approach using the mean or a single quantile. Another advantage of HSF is that the location of the distribution difference can be interrogated, which is impossible if inferences are limited to a single value.

Here is what the hierarchical shift function looks like for our uniform shift example:

unnamed-chunk-7-1

The decile differences between conditions are plotted for each participant (colour coded) and the group 20% trimmed means are superimposed in black. Differences are pretty constant across deciles, suggesting a uniform shift. Most participants have shift functions entirely negative – a case of stochastic dominance of one condition over the other. There is growing uncertainty as we consider higher deciles, which is expected from measurements of right skewed distributions.

We can add confidence intervals:

unnamed-chunk-9-1

P values are available in the GitHub code.

Instead of standard parametric confidence intervals, we can also consider percentile bootstrap confidence intervals (or highest density intervals), as done here:

unnamed-chunk-14-1

Distributions of bootstrap estimates can be considered cheap Bayesian posterior distributions. They also contain useful information not captured by simply reporting confidence intervals.

Here we plot them using geom_halfeyeh() from tidybayes. 

unnamed-chunk-15-1

The distributions of bootstrap estimates of the group 20% trimmed means are shown in orange, one for each decile. Along the base of each distribution, the black dot marks the mode and the vertical lines mark the 50% and 90% highest density intervals.

Nice hey?! Reporting a figure like that is dramatically more informative than reporting a P value and a confidence interval from a t-test!

A bootstrap approach can also be used to perform a cluster correction for multiple comparisons – see details on GitHub. Preliminary simulations suggest that the approach can provide substantial increase in power over the Hochberg’s correction – more on that in another post.

Let’s look at 3 more examples, just for fun…

Example 2: early difference

Example participant:

unnamed-chunk-17-1

Shift function:

unnamed-chunk-18-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-22-1

Percentile bootstrap estimate densities:

unnamed-chunk-28-1

Example 3: difference in spread

Example participant:

unnamed-chunk-29-1

Shift function:

unnamed-chunk-30-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-34-1

Percentile bootstrap estimate densities:

unnamed-chunk-40-1

Example 4: late difference

Example participant:

unnamed-chunk-41-1

Shift function:

unnamed-chunk-42-1

Hierarchical shift function with confidence intervals:

unnamed-chunk-46-1

Percentile bootstrap estimate densities:

unnamed-chunk-52-1

Conclusion

The hierarchical shift function can be used to achieve two goals: 

  • to screen data for potential distribution differences using p values, without limiting the exploration to a single statistics like the mean;
  • to illustrate and quantify how distributions differ.

I think of the hierarchical shift function as the missing link between t-tests and multi-level models. I hope it will help a few people make sense of their data and maybe nudge them towards proper hierarchical modelling.

R functions for the parametric hierarchical shift function are available in the rogme package. I also plan bootstrap functions. Then I’ll tackle the case of 2 independent groups, which requires a third level quantifying differences of differences.

 

Cluster correction for multiple dependent comparisons

In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:

  • different groups of rats are tested with increasing concentrations of a molecule;
  • different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);
  • pain thresholds are measured on contiguous patches of skin;
  • insects are sampled from neighbouring fields;
  • participants undergo a series of training sessions. 

In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power. 

These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…  

Cluster-based statistics

To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right  panel.

fig1

We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group). 

fig2

What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated. 

fig3

The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.

When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).

In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.

fig4_maris2007

Figure 1 from Maris & Oostenveld, 2007.

If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.

Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!

fig6

So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance. 

fig5

There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not. 

Simulations

From the description above, it is clear that using cluster-based statistics require a few choices:

  • a method to estimate the null distribution;
  • a method to form clusters;
  • a choice of cluster statistics;
  • a choice of statistic to form the null distribution (max across clusters for instance);
  • a number of resamples…

Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…

5 dependent groups

First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.

fig7_fwer

With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes. 

The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.

Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment   where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.

fig8_5group_example

If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:

fig9_power_all

Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 50 observations for Hochberg;
  • 57 observations for Bonferroni.

7 dependent groups

If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.

fig10_7_fwer

As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 56 observations for Hochberg;
  • 59 observations for Bonferroni.

fig11_power_7_all

7 independent groups

Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:

fig12_7ind_fwer

The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.

To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:

c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)

And here is an example of a random sample with n = 70 measurements per group:

fig13_7ind_group_example

In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:

  • 50 observations for the cluster test;
  • 67 observations for Hochberg;
  • 81 observations for Bonferroni.

fig14_power_7ind_all

Conclusion

I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself. 

Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). See a clear explanation in this blog post by Benedikt Ehinger. However, TFCE like the cluster methods presented here suffer from non-trivial inference issues. In the words of @ten_photos:

“I’m instead rather drawn to a pragmatic approach […] using a concrete interpretation of the conclusions drawn from rejecting one individual test:

– voxel-wise inference, reject null at a voxel, conclude signal present at that voxel;

– cluster-wise, reject null for that cluster, concluding that signal present at one or more voxels in the cluster;

– TFCE inference, reject the null at a voxel, conclude there exists a cluster containing that voxel for which we reject the null, concluding that signal present at one or more voxels in that contributing cluster.”

Code

Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.

R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.

References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x. 

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.

McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset. 

https://doi.org/10.6084/m9.figshare.3501728.v1

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.

Illustration of continuous distributions using quantiles

In this post I’m going to show you a few simple steps to illustrate continuous distributions. As an example, we consider reaction time data, which are typically positively skewed and can differ in different ways. Reaction time distributions are also a rich source of information to constrain cognitive theories and models. So unless the distributions are at least illustrated, this information is lost (which is typically the case when distributions are summarised using a single value like the mean). Other approaches not covered here include explicit mathematical models of decision making and fitting functions to model the shape of the distributions (Balota & Yap, 2011).

For our current example, I made up data for 2 independent groups with four patterns of differences:

  • no clear differences;

  • uniform shift between distributions;

  • mostly late differences;

  • mostly early differences.

The R code is on GitHub.

Scatterplots

For our first visualisation, we use geom_jitter() from ggplot2. The 1D scatterplots give us a good idea of how the groups differ but they’re not the easiest to read. The main reason is probably that we need to estimate local densities of points in different regions and compare them between groups.

figure_scatter

For the purpose of this exercise, each group (g1 and g2) is composed of 1,000 observations, so the differences in shapes are quite striking. With smaller sample sizes the evaluation of these graphs could be much more challenging.

Kernel density plots

Relative to scatterplots, I find that kernel density plots make the comparisons between groups much easier.

figure_kde

Improved scatterplots

Scatterplots and kernel density plots can be combined by using beeswarm plots. Here we create scatterplots shaped by local density using the geom_quasirandom() function from the ggbeeswarm package. Essentially, the function creates violin plots in which the constituent points are visible. 

figure_scat_quant

To make the plots even more informative, I’ve superimposed quantiles – here deciles computed using the Harrell-Davis quantile estimator. The deciles are represented by vertical black lines, with medians shown with thicker lines. Medians are informative about the location of the bulk of the observations and comparing the lower to upper quantiles let us appreciate the amount of asymmetry within distributions. Comparing quantiles between groups give us a sense of the amount of relative compression/expansion on each side of the distributions. This information would be lost if we only compared the medians. 

Quantile plots

If we remove the scatterplots and only show the quantiles, we obtain quantile plots, which provide a compact description of how distributions differ (please post a comment if you know of older references using quantile plots). Because the quantiles are superimposed, they are easier to compare than in the previous scatterplots. To help with the group comparisons, I’ve also added plots of the quantile differences, which emphasise the different patterns of group differences.

figure_qplot

Vincentile plots

An alternative to quantiles are Vincentiles, which are computed by sorting the data and splitting them in equi-populated bins (there is the same number of observations in each bin). Then the mean is computed for each bin (Balota et al. 2008; Jiang et al. 2004). Below means were computed for 9 equi-populated bins. As expected from the way they are computed, quantile plots and Vincentile plots look very similar for our large samples from continuous variables.

figure_vinc

Group quantile and Vincentile plots can be created by averaging quantiles and Vincentiles across participants (Balota & Yap, 2011; Ratcliff, 1979). This will be the topic of another post.

Delta plots

Related to quantile plots and Vincentile plots, delta plots show the difference between conditions, bin by bin (for each Vincentile) along the y-axis, as a function of the mean across conditions for each bin along the x-axis (De Jong et al., 1994). Not surprisingly, these plots have very similar shapes to the quantile difference plots we considered earlier. 

figure_delta

Negative delta plots (nDP, delta plots with a negative slope) have received particular attention because of their theoretical importance (Ellinghaus & Miller, 2018; Schwarz & Miller, 2012).

Shift function

Delta plots are related to the shift function, a powerful tool introduced in the 1970s: it consists in plotting the difference between the quantiles of two groups as a function of the quantiles in one group, with some measure of uncertainty around the difference (Doksum, 1974; Doksum & Sievers, 1976; Doksum, 1977). It was later refined by Rand Wilcox (Rousselet et al. 2017). This modern version is shown below, with deciles estimated using the Harrell-Davis quantile estimator, and percentile bootstrap confidence intervals of the quantile differences. The sign of the difference is colour-coded (purple for negative, orange for positive).

figure_shift

Unlike other graphical quantile techniques presented here, the shift function affords statistical inferences because of it’s use of confidence intervals (the shift function also comes in a few Bayesian flavours). It is probably one of the easiest ways to compare entire distributions, without resorting to explicit models of the distributions. But the shift function and the other graphical methods demonstrated in this post are not meant to compete with hierarchical models. Instead, they can be used to better understand data patterns within and between participants, before modelling attempts. They also provide powerful alternatives to the mindless application of t-tests and bar graphs, helping to nudge researchers away from the unique use of the mean (or the median) and towards considering the rich information available in continuous distributions.

References

Balota, D.A. & Yap, M.J. (2011) Moving Beyond the Mean in Studies of Mental Chronometry: The Power of Response Time Distributional Analyses. Curr Dir Psychol Sci, 20, 160-166.

Balota, D.A., Yap, M.J., Cortese, M.J. & Watson, J.M. (2008) Beyond mean response latency: Response time distributional analyses of semantic priming. J Mem Lang, 59, 495-523.

Clarke, E. & Sherrill-Mix, S. (2016) ggbeeswarm: Categorical Scatter (Violin Point) Plots.

De Jong, R., Liang, C.C. & Lauber, E. (1994) Conditional and Unconditional Automaticity – a Dual-Process Model of Effects of Spatial Stimulus – Response Correspondence. J Exp Psychol Human, 20, 731-750.

Doksum, K. (1974) Empirical Probability Plots and Statistical Inference for Nonlinear Models in the two-Sample Case. Ann Stat, 2, 267-277.

Doksum, K.A. (1977) Some graphical methods in statistics. A review and some extensions. Statistica Neerlandica, 31, 53-68.

Doksum, K.A. & Sievers, G.L. (1976) Plotting with Confidence – Graphical Comparisons of 2 Populations. Biometrika, 63, 421-434.

Ellinghaus, R. & Miller, J. (2018) Delta plots with negative-going slopes as a potential marker of decreasing response activation in masked semantic priming. Psychol Res, 82, 590-599.

Jiang, Y., Rouder, J.N. & Speckman, P.L. (2004) A note on the sampling properties of the Vincentizing (quantile averaging) procedure. J Math Psychol, 48, 186-195.

Ratcliff, R. (1979) Group Reaction-Time Distributions and an Analysis of Distribution Statistics. Psychol Bull, 86, 446-461.

Rousselet, G.A., Pernet, C.R. & Wilcox, R.R. (2017) Beyond differences in means: robust graphical methods to compare two groups in neuroscience. The European journal of neuroscience, 46, 1738-1748.

Schwarz, W. & Miller, J. (2012) Response time models of delta plots with negative-going slopes. Psychon B Rev, 19, 555-574.

Power estimation for correlation analyses

Following the previous posts on small n correlations [post 1][post 2][post 3], in this post we’re going to consider power estimation (if you do not care about power, but you’d rather focus on estimation, this post is for you). 

To get started, let’s look at examples of n=1000 samples from bivariate populations with known correlations (rho), with rho increasing from 0.1 to 0.9 in steps of 0.1. For each rho, we draw a random sample and plot Y as a function of X. The variances of the two correlated variables are independent – there is homoscedasticity. Later we will look at heteroscedasticity, when the variance of Y varies with X.

demo_homo_dist

For the same distributions illustrated in the previous figure, we compute the proportion of positive Pearson’s correlation tests for different sample sizes. This gives us power curves (here based on simulations with 50,000 samples). We also include rho = 0 to determine the proportion of false positives.

figure_power_homo

Power increases with sample size and with rho. When rho = 0, the proportion of positive tests is the proportion of false positives. It should be around 0.05 for a test with alpha = 0.05. This is the case here, as Pearson’s correlation is well behaved for bivariate normal data.

For a given expected population correlation and a desired long run power value, we can use interpolation to find out the matching sample size.

To achieve at least 80% power given an expected population rho of 0.4, the minimum sample size is 46 observations.

To achieve at least 90% power given an expected population rho of 0.3, the minimum sample size is 118 observations.

figure_power_homo_arrows

Alternatively, for a given sample size and a desired power, we can determine the minimum effect size we can hope to detect. For instance, given n = 40 and a desired power of at least 90%, the minimum effect size we can detect is 0.49.

So far, we have only considered situations where we sample from bivariate normal distributions. However, Wilcox (2012 p. 444-445) describes 6 aspects of data that affect Pearson’s r:

  • outliers

  • the magnitude of the slope around which points are clustered

  • curvature

  • the magnitude of the residuals

  • restriction of range

  • heteroscedasticity

The effect of outliers on Pearson’s and Spearman’s correlations is described in detail in Pernet et al. (2012) and Rousselet et al. (2012).

Next we focus on heteroscedasticity. Let’s look at Wilcox’s heteroscedasticity example (2012, p. 445). If we correlate variable X with variable Y, heteroscedasticity means that the variance of Y depends on X. Wilcox considers this example:

X and Y have normal distributions with both means equal to zero. […] X and Y have variance 1 unless |X|>0.5, in which case Y has standard deviation |X|.”

Here is an example of such data:

demo_wilcox_dist

Next, Wilcox (2012) considers the effect of this heteroscedastic situation on false positives. We superimpose results for the homoscedastic case for comparison. In the homoscedastic case, as expected for a test with alpha = 0.05, the proportion of false positives is very close to 0.05 at every sample size. In the heteroscedastic case, instead of 5%, the number of false positives is between 12% and 19%. The number of false positives actually increases with sample size! That’s because the standard T statistics associated with Pearson’s correlation assumes homoscedasticity, so the formula is incorrect when there is heteroscedasticity.

figure_power_hetero_wilcox

As a consequence, when Pearson’s test is positive, it doesn’t always imply the existence of a correlation. There could be dependence due to heteroscedasticity, in the absence of a correlation.

Let’s consider another heteroscedastic situation, in which the variance of Y increases linearly with X. This could correspond for instance to situations in which cognitive performance or income are correlated with age – we might expect the variance amongst participants to increase with age.

We keep rho constant at 0.4 and increase the maximum variance from 1 (homoscedastic case) to 9. That is, the variance of Y linear increases from 1 to the maximum variance as a function of X.

demo_hetero_dist

For rho = 0, we can compute the proportion of false positives as a function of both sample size and heteroscedasticity. In the next figure, variance refers to the maximum variance. 

figure_power_hetero_rho0

From 0.05 for the homoscedastic case (max variance = 1), the proportion of false positives increases to 0.07-0.08 for a max variance of 9. This relatively small increase in the number of false positives could have important consequences if 100’s of labs are engaged in fishing expeditions and they publish everything with p<0.05. However, it seems we shouldn’t worry much about linear heteroscedasticity as long as sample sizes are sufficiently large and we report estimates with appropriate confidence intervals. An easy way to build confidence intervals when there is heteroscedasticity is to use the percentile bootstrap (see Pernet et al. 2012 for illustrations and Matlab code).

Finally, we can run the same simulation for rho = 0.4. Power progressively decreases with increasing heteroscedasticity. Put another way, with larger heteroscedasticity, larger sample sizes are needed to achieve the same power.

figure_power_hetero_rho04

We can zoom in:

figure_power_hetero_rho04_zoom

The vertical bars mark approximately a 13 observation increase to keep power at 0.8 between a max variance of 0 and 9. This decrease in power can be avoided by using the percentile bootstrap or robust correlation techniques, or both (Wilcox, 2012).

Conclusion

The results presented in this post are based on simulations. You could also use a sample size calculator for correlation analyses – for instance this one.

But running simulations has huge advantages. For instance, you can compare multiple estimators of association in various situations. In a simulation, you can also include as much information as you have about your target populations. For instance, if you want to correlate brain measurements with response times, there might be large datasets you could use to perform data-driven simulations (e.g. UK biobank), or you could estimate the shape of the sampling distributions to draw samples from appropriate theoretical distributions (maybe a gamma distribution for brain measurements and an exGaussian distribution for response times).

Simulations also put you in charge, instead of relying on a black box, which most likely will only cover Pearson’s correlation in ideal conditions, and not robust alternatives when there are outliers or heteroscedasticity or other potential issues.

The R code to reproduce the simulations and the figures is on GitHub.

References

Pernet, C.R., Wilcox, R. & Rousselet, G.A. (2012) Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front Psychol, 3, 606.

Rousselet, G.A. & Pernet, C.R. (2012) Improving standards in brain-behavior correlation analyses. Frontiers in human neuroscience, 6, 119.

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.