Tag Archives: Matlab

Can we use cluster-based permutation tests to estimate MEG/EEG onsets?

[An updated version of the work in this post appears in this preprint]

Cluster-based statistics are a family of methods that aim to correct for multiple comparisons while preserving power when effects are distributed over time, space and other dimensions (Maris & Oostenveld, 2007; Smith & Nichols, 2009; Pernet et al., 2015; Rosenblatt et al., 2018; Fields & Kuperberg, 2020; Frossard & Renaud, 2022). These methods capitalise on the correlation of effects across measurements to control the family-wise error rate (FWER, probability of at least one false positive): neighbouring effects that pass a threshold are lumped together, and compared to a reference distribution, typically a null distribution established using a permutation or bootstrap approach (Pernet et al., 2015). These cheap multivariate techniques (I say cheap because even if they take into account the distribution of effects over at least one dimension, there is no explicit modelling required) are particularly popular in fMRI, EEG and MEG research (for instance, according to Google Scholar, Smith & Nichols (2009) has been cited 4945 times, Maris & Oostenveld (2007) 7053 times). Cluster-based methods have the potential to be applied to a wide array of problems, including simpler ones involving only a few comparisons, as I’ve argued in another post (see for instance application to TMS latencies in Dugué et al. (2011). I’ve also used cluster-based statistics extensively in my EEG and behavioural research (Rousselet et al., 2014; Bieniek et al., 2016; Jaworska et al., 2020).

However, cluster-based methods are not without issues. Most notably, Sassenhagen & Draschkow (2019) reminded the community about problems with the interpretation of cluster-based inferences and suggested guidelines on how to report such results. This useful reminder has been heard, as demonstrated by the number of citations: 196 times according to the article’s publisher page and 320 times according to Google Scholar. Essentially, the main problem with cluster-based methods is that they provide p values at the cluster level only, or even for collection of clusters for hierarchical/meta-analysis methods such as TFCE (Smith & Nichols, 2009). There are no p values at individual testing points, and no error control for onsets and offsets. Consequently, cluster-based methods do not provide direct inferences about the localisation of effects. But does that mean we cannot use cluster-based methods to localise, for instance to estimate the onsets of effects?

Sassenhagen & Draschkow (2019) go on to report results from a simulation, which suggest that onset estimation using cluster-based inferences are very noisy. They used a truncated Gaussian as a template, to define a real onset to be estimated. Then, they compared 50 trials of noise only, to 50 trials of noise + template, using cluster-sum F statistics – see cluster-sum explanation here. These cluster-sums were compared to a permutation distribution to determine statistical significance. For each simulation iteration, the first statistically significant point was declared the onset of the effect. Here is their figure of results, showing the distribution of estimated onsets across 10,000 simulation iterations:

The distribution of onsets reveals two important patterns:

  • Estimated onsets appear positively biased, with the mode higher than the real onset.
  • The distribution is asymmetric, with thicker left tail than right tail.

Sassenhagen & Draschkow (2019) reported that “on >20% of runs, the effect onset was estimated too early; divergence of 40 ms or more were found at >10% of runs.”
No bias estimate was reported.

So, how bad are the results? We don’t know because no other method was used to estimate onsets, so we cannot put the results in perspective. Imagine a study that aimed to test the prediction of larger variance in EEG signals in population X relative to population Y, but only tested participants from population X. Having observed large absolute variance in X, what can we conclude without measurements in Y?

Less importantly, Sassenhagen & Draschkow (2019) used EEG data from one human participant as noise to add to their template in the simulations. The data illustrated in supplementary information look heavily contaminated by alpha noise – I could be wrong, but having recorded from hundreds of participants, it looks like the data are from a very sleepy participant. That’s a problem, because the claim is that there is a general issue with cluster-based inferences, not one specific to a particular participant.

Let’s address the comparison problem by contrasting cluster-based inferences with inferences from two other popular methods, the false discovery rate correction for multiple comparisons (FDR, Benjamini & Hochberg, 1995; Nichols & Hayasaka, 2003) and the maximum statistics correction for multiple comparisons (MAX, Holmes et al., 1996; Nichols & Holmes, 2002). FDR methods provide a weak control of the FWER by controlling the number of false positives among the positive tests; when there is no effect, an FDR correction is equivalent to controlling the FWER; when effects are present, it is more powerful than controlling the FWER (Benjamini & Hochberg, 1995; Fields & Kuperberg, 2020). The MAX correction achieves strong control of the FWER by comparing each statistic, say a t or F value, to a distribution of maximum statistics obtained under the null. These methods, like cluster-based methods, do not explicitly estimate onsets (they were not designed for that purpose) and do not provide error control for onsets, but are routinely used to localise effects.

Other methods have been proposed, for instance defining the onset as the time at which a certain proportion of the peak amplitude has been reached (Liesefeld, 2018; Fahrenfort, 2020). However, such methods depend on potentially subjective peak identification (especially when applied to individual participants), and they are guaranteed, by definition, to lead to positively biased onset estimates. These methods are not considered here but maybe they have some merit.

Another recent method, cluster depth tests, combines cluster inferences while delivering p values at every testing point, thus solving the main issue of cluster-based inferences as implemented so far (Frossard & Renaud, 2022). But this approach, like previous cluster-based methods, and alternative ways to correct for multiple comparisons such as FDR or MAX, do not solve a fundamental issue that has been ignored so far: whether a p value is available at each testing point or not is irrelevant, as none of the methods routinely used to localise MEG and EEG effects explicitly assess onsets. That’s because onsets are second order properties of the data. As such, they require to establish the lack of effects in some areas, the presence of effects in others, and a test of the interaction. However, none of the cluster-based methods, or any other standard methods used to localise in EEG/MEG/fMRI and other fields, explicitly test for the interaction. The interaction is an inference performed pretty much by eyeballing maps of thresholded statistics. So localisation attempts typically rely on an interaction fallacy, a problem well documented in neuroscience (Nieuwenhuis et al., 2011) and other fields (Gelman & Stern, 2006).

More explicit estimation of onsets could be achieved by using methods that go beyond mass-univariate tests and consider time-courses. For instance, there is a vast family of algorithms aimed at detecting change points in time-series that could be used to estimate onsets. Many packages in the R programming language (R Core Team, 2021) offer such algorithms, for instance RBeast (Zhao et al., 2019), mcp (Lindeløv, 2020), changepoint (Killick & Eckley, 2014). There is a very useful survey of such packages here. The Beast algorithm (Zhao et al., 2013) is also available in python and Matlab and I’m sure there are many others. Matlab offers several native solutions, including the findchangepts function. For comparison, here I used a basic change point detection method that relies on differences in mean and variance, as implemented in the changepoint package in R, and the built-in findchangepts function in Matlab.

So, let’s revisit the simulation results from Sassenhagen & Draschkow (2019) by performing our own simulations. Following their approach, I compared a noise only condition to a noise + signal condition. The signal was a truncated Gaussian defining an objective onset at 160 ms.

This approach is very similar to that of Sassenhagen & Draschkow (2019), except that I simulated only one electrode, with cluster inferences only in the time domain. There is no need to simulate extra dimensions (electrodes, time-frequencies), because the argument against cluster-based inferences is not specific to one particular dimension. Also, I used synthetic noise instead of using data from one participant. The noise was generated by adding a collection of sinusoids at different frequencies (Yeung et al., 2004). Using instead 1/f noise gave very similar results.

The simulations involved 50 trials per condition, 10,000 iterations per simulation, and the null distributions were estimated using a permutation with 2,000 iterations.
The simulations were performed twice, in R and in Matlab, with the code available on GitHub.

Here is an example of noise trials:

And 50 trials of signal + noise (grey), with the mean superimposed (black):

And an example of averages for the noise only and noise + signal conditions:

The vertical black line marks the 160 ms true onset. The two conditions were compared using two-sample t-tests with unequal variances. Here is the time-course of squared t values matching the previous figure.

We consider four methods to correct for multiple comparisons:

  • [FDR] The standard 1995 FDR method.
  • [MAX] The maximum statistics correction.
  • [CS] Cluster-based + permutation inference, using a cluster-sum statistic of squared t values.
  • [CP] Change point detection based on differences in mean and variance, applied to the time-course of t2 values. The number of change points was set to 2.

Here are the onsets estimated by each method, using the data from the previous example:

In grey is the permutation distribution of squared t values. The solid vertical line indicates the true onset. The dotted vertical line marks the estimated onset. The horizontal dashed lines are permutation thresholds.

After 10,000 iterations, we get these distributions of onsets for our four methods, with the vertical line marking the true onset:

Each distribution has a shape similar to that reported by Sassenhagen & Draschkow (2019): overall a positive bias and some amount of under-estimation. Let’s quantify the results.

[1] Bias was quantified by subtracting the true onset from the median of each distribution. Because the distributions are asymmetric, the median provides a better reflection of the typical onset than the mean (Rousselet & Wilcox, 2020).

Bias:

  • FDR = 30 ms
  • MAX = 40 ms
  • CS = 30 ms
  • CP = 10 ms

Confirming our visual inspection of the distributions, all methods are biased, the least biased being CP, the most biased being MAX. This is not surprising, as MAX is by definition conservative, the price to pay for strong FWER control.

[2] The mean absolute error (MAE) indicated how close to the true onset each method gets.

MAE:

  • FDR = 42 ms
  • MAX = 44 ms
  • CS = 30 ms
  • CP = 19 ms

Again, CS doesn’t perform worse than other methods: CP was best, MAX was worst.

[3] As Sassenhagen & Draschkow (2019) did, let’s look at the proportion of onsets earlier than the true onset.

% early onsets:

  • FDR = 17.5%
  • MAX = 1.6%
  • CS = 0.5%
  • CP = 13.4%

[4] And the proportion of onsets at least 40 ms earlier than the true onset.

% <40 ms onsets:

  • FDR = 14.7%
  • MAX = 1.3%
  • CS = 0.1%
  • CP = 3.7%

Again, there is nothing wrong with the cluster-sum approach at all: other methods that provide p values at individual testing points performed worst.

Using 1/f noise gave similar results, but the absolute results change substantially whether white noise, pink noise or red noise was used – see full simulations on GitHub.

So far we have considered results from one simulated participant. A typical experiment involves more than one participant. What happens if we try to estimate onsets from 20 participants instead? To find out, let’s assume that each participant has a random onset of 150, 160 or 170 ms (uniform distribution), so that the population onset remains 160 ms. There are 50 trials per condition. In each simulation iteration, after estimating the onset in each of the 20 participants, we use the median onset as our group estimate. Here are the results:

Again, there is nothing wrong with the cluster-sum results: they are very similar to the FDR results, better than MAX, worse than change point.

In numbers:

Bias:

  • FDR = 30
  • MAX = 40
  • CS = 30
  • CP = 10

Mean absolute error:

  • FDR = 27
  • MAX = 43
  • CS = 29
  • CP = 14

Proportion too early:

  • FDR = 0.04%
  • MAX = 0%
  • CS = 0%
  • CP = 0.08%

Note the very low proportion of under-estimation once results are combined across participants.

Of course, another way to reduce bias is to increase sample size. Here are the results from one participant and sample sizes ranging from 20 to 150 trials.

Bias can be reduced considerably by using larger sample sizes, but the MAX method remains overly conservative. Here is what happens for the mean absolute error:

And the proportion of early onsets:

Overall, the cluster-sum method appears to perform well, and the change point approach looks very promising.

Discussion

Onsets are inherently noisy because they correspond to the testing points with the lowest signal to noise ratio. But estimation variability can be reduced by increasing the number of trials and pooling estimates across participants. And based on the present simulation results, there is no rationale for doubting more localisation results obtained using cluster-based inferences than other methods. Certainly, I’m not going to ditch the estimates of face ERP onsets I’ve published so far. They are estimates after all, obtained in a larger number of participants, and with promising test-retest reliability obtained from independent EEG sessions (Bieniek et al., 2016).

From the simulations, it seems that one should be more worried about onsets estimated using FDR and MAX corrections. And the change point approach seems very promising. Certainly here I’ve used a relatively simple algorithm and it remains to be determined what can be achieved using modern methods, for instance Bayesian approaches that allow hierarchical pooling of estimates (Lindeløv, 2020).

Currently, no method routinely used in MEG/EEG explicitly estimates onsets, whether a p value is available at every testing point or not. Whether correction for multiple comparison is achieved using cluster-based statistics, FDR, MAX or any related method, even the new cluster depth tests (Frossard & Renaud, 2022), onset estimation relies on a statistical fallacy, because the implied interaction is never explicitly tested. Does it mean that non-sensical results are obtained? The simulations presented here clearly indicate that meaningful results can be obtained, and that cluster-based inferences can provide more accurate results than alternative methods.

Obviously, one experiment provides no guarantee, and that’s why it is essential to consider sampling distributions, like the ones presented here, to get an idea of how much variability to expect in the long run (Rousselet, 2019). But do not fall victim to the statistical orthodoxy: plenty of methods are used to estimate aspects of the data that they were not developed to handle. In doubt, simulations are essential to understand how well a job any of your tools can achieve – never trust the labels.

In addition to the onset estimation approach, it is worth considering other methodological aspects that can affect onsets. The most important factor is probably filtering, with high-pass filters introducing potentially severe signal distortions (Rousselet, 2012; Widmann & Schröger, 2012; Widmann et al., 2015; van Driel et al., 2021). If the goal is to estimate onsets, I recommend using causal filters to remove low frequencies and to avoid backpropagation of effects. On the stimulus front, screen luminance can strongly influence onsets (Bieniek et al., 2013).

In conclusion, can we use cluster-based permutation tests to estimate MEG/EEG onsets? Absolutely, yes, but:

  • Instead of estimating onsets at the group level, measure onsets in each participant and report group distributions.
  • Assess the test-retest reliability of the onsets, either using the same data (cross-validation) or a different one.
  • Report results from more than one method, to assess computational consistency.

More advanced steps:

  • Use large datasets to compare onset estimation methods, including techniques, so far ignored, that explicitly consider time-series.
  • Use large datasets to provide benchmarks in your field.
  • Use simulations to compare methods. To go beyond the simple approach presented here, one could use algorithms that generate more realistic time-series (Barzegaran et al., 2019).

References

Barzegaran, E., Bosse, S., Kohler, P.J., & Norcia, A.M. (2019) EEGSourceSim: A framework for realistic simulation of EEG scalp data using MRI-based forward models and biologically plausible signals and noise. J. Neurosci. Methods, 328, 108377.

Benjamini, Y. & Hochberg, Y. (1995) Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodol., 57, 289–300.

Bieniek, M., Frei, L., & Rousselet, G. (2013) Early ERPs to faces: aging, luminance, and individual differences. Front. Psychol., 4.

Bieniek, M.M., Bennett, P.J., Sekuler, A.B., & Rousselet, G.A. (2016) A robust and representative lower bound on object processing speed in humans. Eur. J. Neurosci., 44, 1804–1814.

Dugué, L., Marque, P., & VanRullen, R. (2011) Transcranial Magnetic Stimulation Reveals Attentional Feedback to Area V1 during Serial Visual Search. PLOS ONE, 6, e19712.

Fahrenfort, J.J. (2020) Multivariate Methods to Track the Spatiotemporal Profile of Feature-Based Attentional Selection Using EEG. In Pollmann, S. (ed), Spatial Learning and Attention Guidance, Neuromethods. Springer US, New York, NY, pp. 129–156.

Fields, E.C. & Kuperberg, G.R. (2020) Having your cake and eating it too: Flexibility and power with mass univariate statistics for ERP data. Psychophysiology, 57, e13468.

Frossard, J. & Renaud, O. (2022) The cluster depth tests: Toward point-wise strong control of the family-wise error rate in massively univariate tests with application to M/EEG. NeuroImage, 247, 118824.

Gelman, A. & Stern, H. (2006) The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. Am. Stat., 60, 328–331.

Holmes, A.P., Blair, R.C., Watson, J.D., & Ford, I. (1996) Nonparametric analysis of statistic images from functional mapping experiments. J. Cereb. Blood Flow Metab. Off. J. Int. Soc. Cereb. Blood Flow Metab., 16, 7–22.

Jaworska, K., Yi, F., Ince, R.A.A., van Rijsbergen, N.J., Schyns, P.G., & Rousselet, G.A. (2020) Healthy aging delays the neural processing of face features relevant for behavior by 40 ms. Hum. Brain Mapp., 41, 1212–1225.

Killick, R. & Eckley, I.A. (2014) changepoint: An R Package for Changepoint Analysis. J. Stat. Softw., 58, 1–19.

Liesefeld, H.R. (2018) Estimating the Timing of Cognitive Operations With MEG/EEG Latency Measures: A Primer, a Brief Tutorial, and an Implementation of Various Methods. Front. Neurosci., 12.

Lindeløv, J.K. (2020) mcp: An R Package for Regression With Multiple Change Points. OSF Prepr.,.

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. J. Neurosci. Methods, 164, 177–190.

Nichols, T.E. & Holmes, A.P. (2002) Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum. Brain Mapp., 15, 1–25.

Nichols, T. & Hayasaka, S. (2003) Controlling the familywise error rate in functional neuroimaging: a comparative review. Stat Methods Med Res, 12, 419–446.

Nieuwenhuis, S., Forstmann, B.U., & Wagenmakers, E.-J. (2011) Erroneous analyses of interactions in neuroscience: a problem of significance. Nat. Neurosci., 14, 1105–1107.

Pernet, C.R., Latinus, M., Nichols, T.E., & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 85–93.

R Core Team (2021) R: A Language and Environment for Statistical Computing.

Rosenblatt, J.D., Finos, L., Weeda, W.D., Solari, A., & Goeman, J.J. (2018) All-Resolutions Inference for brain imaging. NeuroImage, 181, 786–796.

Rousselet, G. (2012) Does Filtering Preclude Us from Studying ERP Time-Courses? Front. Psychol., 3.

Rousselet, G. (2019) Using simulations to explore sampling distributions: an antidote to hasty and extravagant inferences.

Rousselet, G.A., Ince, R.A.A., van Rijsbergen, N.J., & Schyns, P.G. (2014) Eye coding mechanisms in early human face event-related potentials. J. Vis., 14, 7.

Rousselet, G.A. & Wilcox, R.R. (2020) Reaction Times and other Skewed Distributions: Problems with the Mean and the Median. Meta-Psychol., 4.

Sassenhagen, J. & Draschkow, D. (2019) Cluster-based permutation tests of MEG/EEG data do not establish significance of effect latency or location. Psychophysiology, 56, e13335.

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: Addressing problems of smoothing, threshold dependence and localisation in cluster inference. NeuroImage, 44, 83–98.

Stevens, M.H.H. (2009) A Primer of Ecology with R, 2nd Printing. Springer, New York.

van Driel, J., Olivers, C.N.L., & Fahrenfort, J.J. (2021) High-pass filtering artifacts in multivariate classification of neural time series data. J. Neurosci. Methods, 352, 109080.

Widmann, A. & Schröger, E. (2012) Filter Effects and Filter Artifacts in the Analysis of Electrophysiological Data. Front. Psychol., 3.

Widmann, A., Schröger, E., & Maess, B. (2015) Digital filter design for electrophysiological data – a practical approach. J. Neurosci. Methods, Cutting-edge EEG Methods, 250, 34–46.

Yeung, N., Bogacz, R., Holroyd, C.B., & Cohen, J.D. (2004) Detection of synchronized oscillations in the electroencephalogram: An evaluation of methods. Psychophysiology, 41, 822–832.

Zhao, K., Valle, D., Popescu, S., Zhang, X., & Mallick, B. (2013) Hyperspectral remote sensing of plant biochemistry using Bayesian model averaging with variable and band selection. Remote Sens. Environ., 132, 102–119.

Zhao, K., Wulder, M.A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y., Toman, E., Mallick, B., Zhang, X., & Brown, M. (2019) Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sens. Environ., 232, 111181.

Cluster correction for multiple dependent comparisons

In this post I explain the benefits of applying cluster based statistics, developed for brain imaging applications, to other experimental designs, in which tests are correlated. Here are some examples of such designs:

  • different groups of rats are tested with increasing concentrations of a molecule;
  • different groups of humans or the same humans are tested with stimulations of different intensities or durations (e.g. in neuro/psych it could be TMS, contrast, luminance, priming, masking, SOA);
  • pain thresholds are measured on contiguous patches of skin;
  • insects are sampled from neighbouring fields;
  • participants undergo a series of training sessions. 

In these examples, whatever is measured leads to statistical tests that are correlated in one or a combination of factors: time, space, stimulus parameters. In the frequentist framework, if the outcome of the family of tests is corrected for multiple comparisons using standard procedures (Bonferroni, Hochberg etc.), power will decrease with the number of tests. Cluster based correction for multiple comparison methods can keep false positives at the nominal level (say 0.05), without compromising power. 

These types of dependencies can also be explicitly modelled using Gaussian processes (for a Bayesian example, see McElreath, 2018, chapter 13). Cluster-based statistics are much simpler to use, but they do not provide the important shrinkage afforded by hierarchical methods…  

Cluster-based statistics

To get started, let’s consider an example involving a huge number of correlated tests. In this example, measurements are made at contiguous points in space (y axis) and time (x axis). The meaning of the axes is actually irrelevant – what matters is that the measurements are contiguous. In the figure below, left panel, we define our signal, which is composed of 2 clusters of high intensities among a sea of points with no effect (dark blue = 0). Fake measurements are then simulated by adding white noise to each point. By doing that 100 times, we obtain 100 noisy maps. The mean of these noisy maps is shown in the right  panel.

fig1

We also create 100 maps composed entirely of noise. Then we perform a t-test for independent groups at each point in the map (n=100 per group). 

fig2

What do we get? If we use a threshold of 0.05, we get two nice clusters of statistically significant tests where they are supposed to be. But we also get many false positives. If we try to get rid off the false positives by changing the thresholds, it works to some extent, but at the cost of removing true effects. Even with a threshold of 0.0005, there are still many false positives, and the clusters of true signal have been seriously truncated. 

fig3

The problem is that lowering the alpha is a brute force technique that does not take into account information we have about the data: measurement points are correlated. There is a family of techniques that can correct for multiple comparisons by taking these dependencies into account: cluster based statistics (for an introduction, see Maris & Oostenveld, 2007). These techniques control the family-wise error rate but maintain high power. The family-wise error rate (FWER) is the probably to obtain at least one significant test among a family of tests, when the null hypothesis is true.

When we use a frequentist approach and perform a family of tests, we increase the probably of reporting false positives. The multiple comparison problem is difficult to tackle in many situations because of the need to balance false positives and false negatives. Probably the best known and most widely applied correction for multiple comparison technique is Bonferroni, in which the alpha threshold is divided by the number of comparisons. However, this procedure is notoriously conservative, as it comes at the cost of lower power. Many other techniques have been proposed (I don’t know of a good review paper on this topic – please add a comment if you do).

In the example below, two time-courses are compared point-by-point. Panel a shows the mean time-courses across participants. Panel b shows the time-course of the t-test for 2 dependent groups (the same participants were tested in each condition). Panel c shows time-points at which significant t-tests were observed. Without correction, a large cluster of significant points is observed, but also a collection of smaller clusters. We know from physiology that some of these clusters are too small to be true so they are very likely false positives.

fig4_maris2007

Figure 1 from Maris & Oostenveld, 2007.

If we change the significance threshold using the Bonferroni correction for multiple comparisons, in these examples we remove all significant clusters but the largest one. Good job?! The problem is that our large cluster has been truncated: it now looks like the effect starts later and ends sooner. The cluster-based inferences do not suffer from this problem.

Applied to our 2D example with two clusters embedded in noise, the clustering technique identifies 17,044 clusters of significant t-tests. After correction, only 2 clusters are significant!

fig6

So how do we compute cluster-based statistics? The next figure illustrates the different steps. At the top, we start with a time-course of F-values, from a series of point-by-point ANOVAs. Based on some threshold, say the critical F values for alpha = 0.05, we identify 3 clusters. The clusters are formed based on contiguity. For each cluster we then compute a summary statistics: it could be its duration (length), its height (maximum), or its sum. Here we use the sum. Now we ask a different question: for each cluster, is it likely to obtain that cluster sum by chance? To answer this question, we use non-parametric statistics to estimate the distribution expected by chance. 

fig5

There are several ways to achieve this goal using permutation, percentile bootstrap or bootstrap-t methods (Pernet et al., 2015). Whatever technique we use, we simulate time-courses of F values expected by chance, given the data. For each of these simulated time-courses, we apply a threshold, identify clusters, take the sum of each cluster and save the maximum sum across clusters. If we do that 1,000 times, we end up with a collection of 1,000 cluster sums (shown in the top right corner of the figure). We then sort these values and identify a quantile of interest, say the 0.95 quantile. Finally, we use this quantile as our cluster-based threshold: each original cluster sum is then compared to that threshold. In our example, out of the 3 original clusters, the largest 2 are significant after cluster-based correction for multiple comparisons, whereas the smallest one is not. 

Simulations

From the description above, it is clear that using cluster-based statistics require a few choices:

  • a method to estimate the null distribution;
  • a method to form clusters;
  • a choice of cluster statistics;
  • a choice of statistic to form the null distribution (max across clusters for instance);
  • a number of resamples…

Given a set of choices, we need to check that our method does what it’s supposed to do. So let’s run a few simulations…

5 dependent groups

First we consider the case of 5 dependent groups. The 5 measurements are correlated in time or space or some other factor, such that clusters can be formed by simple proximity: 2 significant tests are grouped in a cluster if they are next to each other. Data are normally distributed, the population SD is 1, and the correlation between repeated measures is 0.75. Here is the FWER after 10,000 simulations, in which we perform 5 one-sample t-tests on means.

fig7_fwer

With correction for multiple comparisons, the probability to get at least one false positive is well above the nominal level (here 0.05). The grey area marks Bradley’s (1978) satisfactory range of false positives (between 0.025 and 0.075). Bonferroni’s and Hochberg’s corrections dramatically reduce the FWER, as expected. For n = 10, the FWER remains quite high, but drops within the acceptable range for higher sample sizes. But these corrections tend to be conservative, leading to FWER systematically under 0.05 from n = 30. Using a cluster-based correction, the FWER is near the nominal level at all sample sizes. 

The cluster correction was done using a bootstrap-t procedure, in which the original data are first mean-centred, so that the null hypothesis is true, and t distributions expected by chance are estimated by sampling the centred data with replacement 1,000 times, and each time computing a series of t-test. For each bootstrap, a max cluster sum statistics was saved and the 95th quantile of this distribution was used to threshold the original clusters.

Next we consider power. We sample from a population with 5 dependent conditions: there is no effect in conditions 1 and 5 (mean = 0), the mean is 1 for condition 3, and the mean is 0.5 for conditions 2 and 4. We could imagine a TMS experiment   where participants first receive a sham stimulation, then stimulation of half intensity, full, half, and sham again… Below is an illustration of a random sample of data from 30 participants.

fig8_5group_example

If we define power as the probability to observe a significant t-test simultaneously in conditions 3, 4 and 5, we get these results:

fig9_power_all

Maximum power is always obtain in the condition without correction, by definition. The cluster correction always reaches maximum possible power, except for n = 10. In contrast, Bonferroni and Hochberg lead to lower power, with Bonferroni being the most conservative. For a desired long run power value, we can use interpolation to find out the matching sample size. To achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 50 observations for Hochberg;
  • 57 observations for Bonferroni.

7 dependent groups

If we run the same simulation but with 7 dependent groups instead of 5, the pattern of results does not change, but the FWER increases if we do not apply any correction for multiple comparisons.

fig10_7_fwer

As for power, if we keep a cluster of effects with means 0.5, 1, 0.5 for conditions 3, 4 and 5, and zero effect for conditions 1, 2, 6 and 7, the power advantage of the cluster test increases. Now, to achieve at least 80% power, the minimum sample size is:

  • 39 observations for the cluster test;
  • 56 observations for Hochberg;
  • 59 observations for Bonferroni.

fig11_power_7_all

7 independent groups

Finally, we consider a situation with 7 independent groups. For instance, measurements were made in 7 contiguous fields. So the measurements are independent (done at different times), but there is spatial dependence between fields, so that we would expect that if a measurement is high in one field, it is likely to be high in the next field too. Here are the FWER results, showing a pattern similar to that in the previous examples:

fig12_7ind_fwer

The cluster correction does the best job at minimising false positives, whereas Bonferroni and Hochberg are too liberal for sample sizes 10 and 20.

To look at power, I created a simulation with a linear pattern: there is no effect in position 1, then a linear increase from 0 to a maximum effect size of 2 at position 7. Here is the sequence of effect sizes:

c(0, 0, 0.4, 0.8, 1.2, 1.6, 2)

And here is an example of a random sample with n = 70 measurements per group:

fig13_7ind_group_example

In this situation, again the cluster correction dominates the other methods in terms of power. For instance, to achieve at least 80% power, the minimum sample size is:

  • 50 observations for the cluster test;
  • 67 observations for Hochberg;
  • 81 observations for Bonferroni.

fig14_power_7ind_all

Conclusion

I hope the examples above have convinced you that cluster-based statistics could dramatically increase your statistical power relative to standard techniques used to correct for multiple comparisons. Let me know if you use a different correction method and would like to see how they compare. Or you could re-use the simulation code and give it a go yourself. 

Limitations: cluster-based methods make inferences about clusters, not about individual tests. Also, these methods require a threshold to form clusters, which is arbitrary and not convenient if you use non-parametric tests that do not come with p values. An alternative technique eliminates this requirement, instead forming a statistic that integrates across many potential cluster thresholds (TFCE, Smith & Nichols, 2009; Pernet et al. 2015). See a clear explanation in this blog post by Benedikt Ehinger. However, TFCE like the cluster methods presented here suffer from non-trivial inference issues. In the words of @ten_photos:

“I’m instead rather drawn to a pragmatic approach […] using a concrete interpretation of the conclusions drawn from rejecting one individual test:

– voxel-wise inference, reject null at a voxel, conclude signal present at that voxel;

– cluster-wise, reject null for that cluster, concluding that signal present at one or more voxels in the cluster;

– TFCE inference, reject the null at a voxel, conclude there exists a cluster containing that voxel for which we reject the null, concluding that signal present at one or more voxels in that contributing cluster.”

Code

Matlab code for ERP analyses is available on figshare and as part of the LIMO EEG toolbox. The code can be used for other purposes – just pretend you’re dealing with one EEG electrode and Bob’s your uncle.

R code to reproduce the simulations is available on github. I’m planning to develop an R package to cover different experimental designs, using t-tests on means and trimmed means. In the meantime, if you’d like to apply the method but can’t make sense of my code, don’t hesitate to get in touch and I’ll try to help.

References

Bradley, J. V. (1978). Robustness? British Journal of Mathematical and Statistical Psychology, 31, 144–152. doi: 10.1111/j.2044-8317.1978.tb00581.x. 

Maris, E. & Oostenveld, R. (2007) Nonparametric statistical testing of EEG- and MEG-data. Journal of neuroscience methods, 164, 177-190.

McElreath, R. (2018) Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Oostenveld, R., Fries, P., Maris, E. & Schoffelen, J.M. (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci, 2011, 156869.

Pernet, C.R., Chauveau, N., Gaspar, C. & Rousselet, G.A. (2011) LIMO EEG: a toolbox for hierarchical LInear MOdeling of ElectroEncephaloGraphic data. Comput Intell Neurosci, 2011, 831409.

Pernet, C.R., Latinus, M., Nichols, T.E. & Rousselet, G.A. (2015) Cluster-based computational methods for mass univariate analyses of event-related brain potentials/fields: A simulation study. Journal of neuroscience methods, 250, 85-93.

Rousselet, Guillaume (2016): Introduction to robust estimation of ERP data. figshare. Fileset. 

https://doi.org/10.6084/m9.figshare.3501728.v1

Smith, S.M. & Nichols, T.E. (2009) Threshold-free cluster enhancement: addressing problems of smoothing, threshold dependence and localisation in cluster inference. Neuroimage, 44, 83-98.

Trimmed means

The R code for this post is on github.

Trimmed means are robust estimators of central tendency. To compute a trimmed mean, we remove a predetermined amount of observations on each side of a distribution, and average the remaining observations. If you think you’re not familiar with trimmed means, you already know one famous member of this family: the median. Indeed, the median is an extreme trimmed mean, in which all observations are removed except one or two.

Using trimmed means confers two advantages:

  • trimmed means provide a better estimation of the location of the bulk of the observations than the mean when sampling from asymmetric distributions;
  • the standard error of the trimmed mean is less affected by outliers and asymmetry than the mean, so that tests using trimmed means can have more power than tests using the mean.

Important point: if we use a trimmed mean in an inferential test (see below), we make inferences about the population trimmed mean, not the population mean. The same is true for the median or any other measure of central tendency. So each robust estimator is a tool to answer a specific question, and this is why different estimators can return different answers…

Here is how we compute a 20% trimmed mean.

Let’s consider a sample of 20 observations:

39 92 75 61 45 87 59 51 87 12  8 93 74 16 32 39 87 12 47 50

First we sort them:

8 12 12 16 32 39 39 45 47 50 51 59 61 74 75 87 87 87 92 93

The number of observations to remove is floor(0.2 * 20) = 4. So we trim 4 observations from each end:

(8 12 12 16) 32 39 39 45 47 50 51 59 61 74 75 87 (87 87 92 93)

And we take the mean of the remaining observations, such that our 20% trimmed mean = mean(c(32,39,39,45,47,50,51,59,61,74,75,87)) = 54.92

Let’s illustrate the trimming process with a normal distribution and 20% trimming:

normdist

We can see how trimming gets rid of the tails of the distribution, to focus on the bulk of the observations. This behaviour is particularly useful when dealing with skewed distributions, as shown here:

fdist

In this skewed distribution (it’s an F distribution), there is more variability on the right side, which appears as stretched compared to the left side. Because we trim the same amount on each side, trimming removes a longer chunk of the distribution on the right side than the left side. As a consequence, the mean of the remaining points is more representative of the location of the bulk of the observations. This can be seen in the following examples.

figure_tm_demo

Panel A shows the kernel density estimate of 100 observations sampled from a standard normal distribution (MCT stands for measure of central tendency). By chance, the distribution is not perfectly symmetric, but the mean, 20% trimmed mean and median give very similar estimates, as expected. In panel B, however, the sample is from a lognormal distribution. Because of the asymmetry of the distribution, the mean is dragged towards the right side of the distribution, away from the bulk of the observations. The 20% trimmed mean is to the left of the mean, and the median further to the left, closer to the location of most observations. Thus, for asymmetric distributions, trimmed means provide more accurate information about central tendency than the mean.

**Q: “By trimming, don’t we loose information?”**

I have heard that question over and over. The answer depends on your goal. Statistical methods are only tools to answer specific questions, so it always depends on your goal. I have never met anyone with a true interest in the mean: the mean is always used, implicitly or explicitly, as a tool to indicate the location of the bulk of the observations. Thus, if your goal is to estimate central tendency, then no, trimming doesn’t discard information, it actually increases the quality of the information about central tendency.

I have also heard that criticism: “I’m interested in the tails of the distributions and that’s why I use the mean, trimming gets rid of them”. Tails certainly have interesting stories to tell, but the mean is absolutely not the tool to study them because it mingles all observations into one value, so we have no way to tell why means differ among samples. If you want to study entire distributions, they are fantastic graphical tools available (Rousselet, Pernet & Wilcox 2017).

Implementation

Base R has trimmed means built in:

mean can be used by changing the trim argument to the desired amount of trimming:

mean(x, trim = 0.2) gives a 20% trimmed mean.

In Matlab, try the tm function available here.

In Python, try the scipy.stats.tmean function. More Python functions are listed here.

Inferences

There are plenty of R functions using trimmed means on Rand Wilcox’s website.

We can use trimmed means instead of means in t-tests. However, the calculation of the standard error is different from the traditional t-test formula. This is because after trimming observations, the remaining observations are no longer independent. The formula for the adjusted standard error was originally proposed by Karen Yuen in 1974, and it involves winsorization. To winsorize a sample, instead of removing observations, we replace them with the remaining extreme values. So in our example, a 20% winsorized sample is:

32 32 32 32 32 39 39 45 47 50 51 59 61 74 75 87 87 87 87 87

Taking the mean of the winsorized sample gives a winsorized mean; taking the variance of the winsorized sample gives a winsorized variance etc. I’ve never seen anyone using winsorized means, however the winsorized variance is used to compute the standard error of the trimmed mean (Yuen 1974). There is also a full mathematical explanation in Wilcox (2012).

You can use all the functions below to make inferences about means too, by setting tr=0. How much trimming to use is an empirical question, depending on the type of distributions you deal with. By default, all functions set tr=0.2, 20% trimming, which has been studied a lot and seems to provide a good compromise. Most functions will return an error with an alternative function suggestion if you set tr=0.5: the standard error calculation is inaccurate for the median and often the only satisfactory solution is to use a percentile bootstrap.

**Q: “With trimmed means, isn’t there a danger of users trying different amounts of trimming and reporting the one that give them significant results?”**

This is indeed a possibility, but dishonesty is a property of the user, not a property of the tool. In fact, trying different amounts of trimming could be very informative about the nature of the effects. Reporting the different results, along with graphical representations, could help provide a more detailed description of the effects.

The Yuen t-test performs better than the t-test on means in many situations. For even better results, Wilcox recommends to use trimmed means with a percentile-t bootstrap or a percentile bootstrap. With small amounts of trimming, the percentile-t bootstrap performs better; with at least 20% trimming, the percentile bootstrap is preferable. Details about these choices are available for instance in Wilcox (2012) and Wilcox & Rousselet (2017).

Yuen’s approach

1-alpha confidence interval for the trimmed mean: trimci(x,tr=.2,alpha=0.05)

Yuen t-test for 2 independent groups: yuen(x,y,tr=.2)

Yuen t-test for 2 dependent groups: yuend(x,y,tr=.2)

Bootstrap percentile-t method

One group: trimcibt(x,tr=.2,alpha=.05,nboot=599)

Two independent groups: yuenbt(x,y,tr=.2,alpha=.05,nboot=599)

Two dependent groups: ydbt(x,y,tr=.2,alpha=.05,nboot=599)

Percentile bootstrap approach

One group: trimpb(x,tr=.2,alpha=.05,nboot=2000)

Two independent groups: trimpb2(x,y,tr=.2,alpha=.05,nboot=2000)

Two dependent groups: dtrimpb(x,y=NULL,alpha=.05,con=0,est=mean)

Matlab

There are some Matlab functions here:

tm – trimmed mean

yuen – t-test for 2 independent groups

yuend – t-test for 2 dependent groups

winvar – winsorized variance

winsample – winsorized sample

wincov – winsorized covariance

These functions can be used with several estimators including  trimmed means:

pb2dg – percentile bootstrap for 2 dependent groups

pb2ig– percentile bootstrap for 2 independent groups

pbci– percentile bootstrap for 1 group

Several functions for trimming large arrays and computing confidence intervals are available in the LIMO EEG toolbox.

References

Karen K. Yuen. The two-sample trimmed t for unequal population variances, Biometrika, Volume 61, Issue 1, 1 April 1974, Pages 165–170, https://doi.org/10.1093/biomet/61.1.165

Rousselet, Guillaume; Pernet, Cyril; Wilcox, Rand (2017): Beyond differences in means: robust graphical methods to compare two groups in neuroscience. figshare. https://doi.org/10.6084/m9.figshare.4055970.v7

Rand R. Wilcox, Guillaume A. Rousselet. A guide to robust statistical methods in neuroscience bioRxiv 151811; doi: https://doi.org/10.1101/151811

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press, San Diego, CA.

Matlab code for the shift function: a powerful tool to compare two entire marginal distributions

Recently, I presented R code for the shift function, a powerful tool to compare two entire marginal distributions.

The Matlab code is now available on github.

shifthd has the same name as its R version, which was originally programmed by Rand Wilcox and first documented in 1995 (see details ). It computes a shift function for independent groups, using a percentile bootstrap estimation of the SE of the quantiles to compute confidence intervals.

shiftdhd is the version for dependent groups.

More recently, Wilcox introduced a new version of the shift function in which a straightforward percentile bootstrap is used to compute the confidence intervals, without estimation of the SE of the quantiles. This is implemented in Matlab as shifthd_pbci for independent groups (equivalent to qcomhd in R); as shiftdhd_pbci for dependent groups (equivalent to Dqcomhd in R).

A demo file shift_function_demo is available here, along with the function shift_fig and dependencies cmu and UnivarScatter.

For instance, if we use the ozone data covered in the previous shift function post, a call to shifthd looks like this:

[xd, yd, delta, deltaCI] = shifthd(control,ozone,200,1);

producing this figure:

figure1

The output of shifthd, or any of the other 3 sf functions, can be used as input into shift_fig:

shift_fig(xd, yd, delta, deltaCI,control,ozone,1,5);

producing this figure:

figure2

This is obviously work in progress, and shift_fig is meant as a starting point.

Have fun exploring how your distributions differ!

And if you have any question, don’t hesitate to get in touch.