Tag Archives: processing speed

The Harrell-Davis quantile estimator

Quantiles are robust and useful descriptive statistics. They belong to the family of L-estimators, which is to say that they are based on the linear combination of order statistics. They are several ways to compute quantiles. For instance, in R, the function quantile has 9 options. In Matlab, the quantile & prctile functions offer only 1 option. Here I’d like to introduce briefly yet another option: the Harrell-Davis quantile estimator (Harrell & Davis, 1982). It is the weighted average of all the order statistics (Figure 2). And, in combination with the percentile bootstrap, it is a useful tool to derive confidence intervals of quantiles (Wilcox 2012), as we will see quickly in this post. It is also a useful tool to derive confidence intervals of the difference between quantiles of two groups, as we will see in another post. As discussed previously in the percentile bootstrap post, to make accurate confidence intervals, we need to combine an estimator with a particular confidence interval building procedure, and the right combo is not obvious depending on the data at hand.

Before we motor on, a quick google search suggests that there is recent work to try to improve the Harrell-Davis estimator, so this not to say that this estimator is the best in all situations. But according to Rand Wilcox it works well in many situations, and we do use it a lot in the lab…

Let’s look at data from a paper on visual processing speed estimation (Bieniek et al. 2015). We consider ERP onsets from 120 participants aged 18 to 81.

The sorted ages are:

18 18 19 19 19 19 20 20 21 21 21 21 21 21 21 22 22 22 22 22 22 22 22 23 23 23 24 24 24 25 26 28 28 29 29 30 30 31 31 32 32 32 33 34 34 35 35 36 37 38 40 40 41 41 42 42 43 43 44 45 45 45 45 48 49 49 50 51 54 54 55 56 58 59 59 60 60 61 62 62 62 63 63 63 64 64 64 64 65 65 66 66 66 66 66 66 67 67 67 67 68 68 68 68 68 69 70 70 70 71 72 72 72 75 76 77 78 79 81 81

Fig1-age distribution

Figure 1. Age distribution.

The Matlab code to reproduce all the figures in this post is available on github. There is also a list of R functions from Rand Wilcox’s toolbox.

How do we compute Harrell-Davis quantiles of the age distribution? Figure 2 shows the Harrell-Davis weights for the deciles of the age distribution.

Fig2-weights

Figure 2. Decile weights.

The deciles are obtained by multiplying the sorted ages by the weights in Figure 2, which gives us:

21.1, 23.3, 29.7, 37.0, 45.3, 56.1, 63.3, 66.6, 70.4

For comparison, the age deciles from Matlab’s prctile function are:

21, 23, 30, 36, 45, 57, 64, 66, 70

Now, we can update the scatterplot in Figure 1 with the deciles:

Fig3-age deciles

Figure 3. Scatterplot + age deciles. The thick vertical black line marks the 50th quantiles.

We can also compute a confidence interval for a Harrell-Davis quantile. There are two ways to do that:

  • using a percentile bootstrap of the quantile (pbci approach);
  • using a percentile bootstrap estimate of the standard error of the quantile, which is then plugged into a confidence interval formula (pbse approach).

Using the code available with this post, we can try the two approaches on the median:

  • pbci approach gives 45.31 [35.89, 54.73]
  • pbse approach gives 45.31 [38.49, 54.40]

The two methods return similar upper bounds, but quite different lower bounds. Because they are both based on random resampling with replacement, running the same analysis several times will each time also give slightly different results. Actually, this is one important criterion to select a good bootstrap confidence interval technique: despite random sampling, using the same technique many times should provide overall similar results. Another important criterion is the probability coverage: if we build a 95% confidence interval, we want that confidence interval to contain the population value we’re trying to estimate 95% of the time. That’s right, the probability attached to a confidence interval is a long run coverage: assuming a population with a certain median, if we perform the same experiment over and over, every time drawing a sample of n observations and computing an (1-alpha)% confidence interval using the same technique, (1-alpha)% of these confidence intervals will contain the population median. So, if everything is fine (n is large enough, the number of bootstrap samples is large enough, the combination of bootstrap technique and estimator is appropriate), alpha% of the time (usually 5%), a confidence interval WILL NOT include the population parameter of interest. This implies that given the 1,000s of neuroscience & psychology experiments performed every year, 100s of paper report the wrong confidence intervals – but this possibility is never considered in the articles’ conclusions…

In many situations, the long run probability coverage can be actually much lower or much higher than (1-alpha). So can we check that we’re building accurate confidence intervals, at least in the long run? For that, we’ve got to run simulations. Here is an example. First, we create a fake population, for instance with a skewed distribution, which could reflect our belief of the nature of the population we’re studying:

Fig4-sim population

Figure 4. Population of 1,000,000 values with a 10 degrees of freedom chi2 distribution.

Second, we compute benchmark values, e.g. median, mean…

Third, we run simulations in which we perform fake experiments with a given sample size, and then compute confidence intervals of certain quantities. Finally, we check how often the different confidence intervals actually contain the population parameters (probability coverage):

  • pbse(hd) = 0.9530
  • pbci(hd) = 0.9473
  • pbci(median) = 0.9452
  • pbci(mean) = 0.9394

They’re all very close to 95%. However, the confidence intervals of hd created using the pbse approach tended to be larger than those created using the pbci approach. The confidence intervals for the mean missed the population mean 1% of the time compared to the expected 95% – that’s because they tended to be shorter than the other 3. The bootstrap estimates of the sampling distribution of hd, the median and the mean, as well as the width of the confidence intervals can be explored using the code on github.

Of course, no one is ever going to run 10,000 times the same experiment! And these results assume a certain population, a certain number of observations per experiment, and a certain number of bootstrap samples. We would need a more systematic exploration of the different combinations of options to be sure the present results are not special cases.

To be clear: there is absolutely no guarantee that any particular confidence interval contains the population parameter you’re trying to estimate. So be humble, and don’t make such a big deal about your confidence intervals, especially if you have small sample sizes.

Personally, more and more I use confidence intervals to try to describe the variability in the sample at hand. For that purpose, and to avoid potential inferential problems associated with confidence intervals, I think it is more satisfactory to use highest density intervals HDI. I will post R & Matlab functions to compute the HDI of the bootstrap quantiles on github at some stage. By reporting HDI, there are no associated p values and we minimise the temptation to cross proton streams (i.e. dichotomise a continuous variable to make a binary decision – MacCallum et al. 2002).

Finally, we consider something a bit more interesting than the age of our participants: the distribution of ERP onsets.

Here are the onsets in milliseconds:

Fig7-onset distribution

Figure 5. Onsets.

And the deciles with their confidence intervals, which provide a very nice summary of the distribution:

Fig8-onset deciles

Figure 6. Onset deciles with confidence intervals.

If you’re interested, I’ve also attempted a Bayesian estimation of the onset data using R and JAGS. See also this later post on using Bayesian quantile estimation and model-based inference.

Conclusion

Now you’ve got the tools to describe a distribution in detail. There is no particular reason why we should be obsessed with the mean, especially when robust and more informative statistics are available. Next, I will show you how to compare all the deciles of two distributions using a mighty tool: the shift function. This will, of course, rely on the Harrell-Davis estimator and the bootstrap.

References

Bieniek, M.M., Bennett, P.J., Sekuler, A.B. & Rousselet, G.A. (2015) A robust and representative lower bound on object processing speed in humans. The European journal of neuroscience.

Harrell, F.E. & Davis, C.E. (1982) A new distribution-free quantile estimator. Biometrika, 69, 635-640.

MacCallum RC, Zhang S, Preacher KJ, Rucker DD. 2002. On the practice of dichotomization of quantitative variables. Psychological Methods 7: 19-40

Wilcox, R.R. (2012) Introduction to robust estimation and hypothesis testing. Academic Press.

Priors for Bayesian estimation of visual object processing speed in humans

In Bieniek et al. 2015 we reported EEG estimates of visual object processing onsets from a sample of 120 human adult participants aged 18-81. Among these participants, 74 were tested a second time for test-retest assessment. All the details are described in the paper, which is open access. The main outcomes of that paper were onset estimation in every participant and robust descriptive group statistics with various frequentist confidence intervals. Although I still think we could improve on the onset detection technique (notably using shape matching instead of amplitude boundary crossing), we can leave that issue aside for the moment and assume that we’re dealing with a decent sample of EEG onsets. Here I want to revisit the main group analysis using a Bayesian approach. In doing so, I will cover a very Bayesian problem: how to integrate information across experiments. I will focus on a very simple one-sample case, because that’s the nature of the data, and more importantly because this is my first attempt at Bayesian statistics!

Code and data

The R & JAGS code and the data to reproduce all the analyses and the figures are available in the R script onset_priors.R located in this stand-alone folder.

Analyses were performed in R version 3.2.2, using JAGS version 4.1.0, and R functions and examples from the BEST package version 2 by John Kruschke. The R script used to generate the results and the figures assumes that you have read Kruschke’s BEST paper and installed his code and dependencies⁠. Much more can be found in his excellent book.

For the onset results, a more comprehensive dataset + Matlab code is available here:

Rousselet, Guillaume; Bieniek, Magdalena; Bennett, Patrick; Sekuler, Allison (2015): Face-noise ERP onsets from 194 recording sessions. figshare. https://dx.doi.org/10.6084/m9.figshare.1588513.v2.

The EEG data used to compute the onsets are available here:

Bieniek MM, Bennett PJ, Sekuler AB, Rousselet GA (2015) Data from: A robust and representative lower bound on object processing speed in humans. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.46786

Fake data

Before we get started on our experimental data, we should explore a toy example for which we know exactly what to expect. Indeed, it can be a very bad idea to apply completely new tools to your own data, because there are potentially too many unknowns, and no easy way to check that the tools are doing what they’re supposed to. So first we create fake data to check that the procedure works. We make a sample of 20 observations with a mean of 100 and standard deviation of 5.

Run Bayesian analysis using BEST’s defaults

We use BEST’s defaults with broad priors, 3 chains, 500 adaptation steps, 1,000 burn-in steps and 10,000 monitoring steps. Here are the results:

fig1_test

Figure 1. Parameter estimation given BEST’s defaults broad priors. Panels A-D show histograms of credible values from the posterior distribution for the mean, standard deviation, effect size and normality parameters. In panel A, the expected value of 100 is indicated in green under the mode of the marginal mean distribution. This value appears between 2 values: the percentage of the posterior distribution below and above the expected value. HDI stands for highest density interval. Panel E shows examples of representative posterior predictive t distributions. These examples appear in blue and are combinations of value parameters illustrated in panels A, B and D. In red is a histogram of the data. The distributions in blue are not data, they are posterior predictive distributions given our priors and the data.

The results are as expected given the fake data we have generated: the MCMC (Markov Chain Monte Carlo) Gibbs algorithm seems to do a good job at sampling regions of the posterior distribution that are compatible with the data. But how do we assess the quality of the analysis? This is a huge topic covered in part in Kruschke’s book, Doing Bayesian data analysis, 2nd edition. He offers a diagnostic plotting function diagMCMC to investigate the behaviour of the chains.

Run diagnostic tools

We run diagMCMC on the mu parameter and obtain the figure below. Although Kruschke suggests to use this figure to determine if the chains have converged, it is not a very appropriate description, because strictly speaking Markov chains do not converge. So what does this figure actually tells us?

fig2_test_diag_mu.jpg

Figure 2. MCMC diagnostics. A Trace plots of iterations, or steps, of the 3 chains for parameter mu (the mean). The first 1,500 steps were discarded because of the default 500 adaptation steps and 1000 burn in steps. The strong overlap among the 3 chains suggest that they are representative of the posterior distribution. B Autocorrelation functions for lags 0 to 35 for the 3 chains. This gives an indication of the number of steps necessary for the chains to provide independent samples from the posterior distribution. Here it looks as if 5 steps between samples would be sufficient to obtain a more representative estimation of the posterior distribution. ESS is the effective sample size: it estimates the length of a chain without autocorrelation that would give us the same information as the current one. Here we used 10,000 steps, so an ESS of ~6,000 suggests redundancy among samples which are taken from very similar parts of the distribution. For an accurate description of the 95% HDI, an ESS of at least 10,000 is recommended, according to Kruschke; for particular applications it would be wise to validate that number. C Shrink factor over iterations (steps) in the chains. The shrink factor is a measure of between chain variance relative to within chain variance. A value close to 1 again suggests that the chains are sampling similar representative regions of the posterior distribution. A shrink factor \<1.1 is a recommended cut-off according to Kruschke. D Density plots of the parameter value for the 3 chains, and the 95% HDI for each chain. The 3 density plots and HDIs overlap well, suggesting again that the parameter values have been sampled from similar representative regions of the posterior distribution. MCSE is the Monte Carlo standard error – the lower the better. I do not understand yet how MCSE, ESS and shrink factors are calculated.

The 3 chains appear to be well mixed and to be sampling from a high-probability region. However there is some autocorrelation, and the ESS is ~6000. Autocorrelation can be associated with poor, unrepresentative, sampling of the posterior distribution. To see if this is a concern, we could run the MCMC analysis again, and compare the 95% HDIs. If they are very similar between analyses, then we have probably sampled sufficiently from similar representative portions of the posterior distribution, and there is nothing to worry about. An alternative is to compare the results from the 3 chains we ran independently: in panel D of the figure above, the 3 marginal distributions of the mean and their associated 95% HDIs are very similar, so the sampling was probably sufficient.

To decrease the autocorrelation, we could also thin the chains, or increase the number of steps. See the very useful discussion in Kruschke’s blog suggesting that thinning is a poor strategy against auto-correlation. It seems that introducing burn-in steps is also unnecessary.

Results with 5 thinning steps

Let’s try thinning anyway. I’ll report on better strategies once I have performed some simulations. In this example, using 5 thinning steps gets rid of the auto-correlation and might be sufficient to provide stable estimation of the mean and its 95% HDI:

fig3_test_with_thinning_diag_mu.jpg

Figure 3. MCMC diagnostics for analyses with 5 thinning steps.

The results show improved HDI consistency across the 3 chains, from already very consistent results. In this case, 5 thinning steps were sufficient to get an ESS around the expected 10,000, although running the same analysis again produced inconsistent results with ESS ranging from ~8000 to ~10,000. If our main interest is the mode and the 95% HDI of the mean of the samples from the posterior distribution, then thinning does not make much difference compared to the previous results. The required precision of the estimation depends on the application of course.

Results are robust to outliers

In the lab, we only adopt tools that are robust against outliers. Certainly, t-tests and ANOVAs on means are not robust. What about our new Bayesian tool? To check, we add outliers to our sample of fake data, and run the analysis again. The results are very convincing:

fig4_test_with_outliers.jpg

Figure 4. Parameter estimation given data with outliers.

Although the estimation of SD has changed, here our main interest is to estimate the mean of probable underlying distributions compatible with the data. And for the mean the results with outliers are almost identical to those without outliers. Thus, MCMC with a t distribution is robust: the estimation of the mean is barely different from the original, and within random fluctuations expected from running the same MCMC several times. You can try different combinations of outliers to convince yourselves: personally, I’m sold.

Impact of priors

How are the results affected by our choice of priors? Instead of BEST’s default settings, we can for instance run the analysis using a uniform prior on mean and a gamma prior on SD. We strongly expect onsets to be contained in the broad interval 0-200 ms, so we use a uniform distribution spanning that interval. This gives results very similar to the original ones, which is expected because by default BEST uses very broad priors.

fig5_test_with_unif_prior.jpg

Figure 5. Parameter estimation given uniform prior on the mean.

But what happens if we run the analysis with strong priors on the mean? Let’s assume that we expect onsets to have a mean around 80 ms, with little dispersion around that value (mean prior SD=3). In that case the posterior predictive samples for the mean look very different from those obtained using broad priors (Figures 1 & 5):

fig6_test_with_strong_priors

 

Figure 6. Posterior distribution of the mean given a strong 80 ms prior. The mode is used as central tendency of the distribution – although it can be more unstable than the median, it tends to be more representative. The expected value of 100 is indicated in green under the mode and marked by a green vertical dotted line. This value appears between 2 values: the percentage of the posterior distribution below and above the expected value. Under that, in red, is a [95, 105] ROPE, marked by vertical doted lines. The 3 percentage values indicate the percentage of the MCMC distribution below, inside, and above the ROPE. Finally, the 95% HDI is indicated by a thick horizontal black line at the bottom of the histogram.

That’s reassuring, because if we have strong prior knowledge about the underlying distribution from which the data might be sampled, a new experiment should not be able to override that knowledge. Here, in particular, the 95% HDI does not include 100, the sample mean. So should we conclude that 100 is not a credible value for the underlying distribution, given our prior knowledge and the current experiment? If we were dealing with processing speed estimates from EEG recordings, our decision should be guided by other physiological considerations. For instance, if we think that our measurements originate from one particular cortical area, and after considering conduction times  between areas & other factors (Salin & Bullier, 1995; Nowak & Bullier, 1997), we could use for instance a narrow margin of error of 5 ms on either side of 100 ms. We would then define a region of practical equivalence or ROPE of [95, 105]. Because the HDI does overlap with the proposed ROPE, we should probably reserve judgement about whether 100 is credible given the data.

Bayesian analysis of EEG onsets

Now that we’ve gained some confidence in the tools, let’s do a Bayesian analysis of our real dataset of EEG onsets. The variable ses1 contains onsets from 120 participants. ses2 contains onsets from 74 participants who also provided ses1 onsets. Although this is a paired design, and the results can be used to investigate test-retest reliability (see our EJN 2015 paper), here we treat the two sessions as independent.

First, let’s plot kernel density estimates for the two sessions:

fig7_kde_2sessions

Figure 7. Kernel density estimates for session 1 and session 2 observations.

Bayesian analysis on session 1 using BEST’s defaults

To start, we use BEST’s defaults. Before we consider the results, we call the diagnostic plotting function diagMCMC for mu, our main parameter of interest.

fig8_ses1_diag_mu

Figure 8. MCMC diagnostics for session 1 onset analysis using BEST’s defaults.

The diagnostic tools suggests the chains are well mixed and seem to be sampling from a high-probability region. However the ESS is ~6000, which calls for some thinning or more steps. So let’s try again with 20,000 steps:

fig9_ses1_with_more_steps_diag_mu

Figure 9. MCMC diagnostics for session 1 onset analysis using 20,000 steps.

ESS is now ~10000, and all the diagnostic indicators look good.  It seems we’ve got a good sampling of the posterior distribution. So let’s display the results, with a 100 ms onset for comparison:

fig10_ses1

Figure 10. Parameter estimation for session 1 onset data using 20,000 iterations.

Here we get a mode of 92.6 and 95% HDI [88.4, 96.7]. If we had the hypothesis of a mean onset of 100 ms, the results suggest a credible mean lower than 100 ms. Let’s plot only mean posterior samples:

fig11_ses1_mean_posterior

Figure 11. Session 1 onset posterior distribution of the mean.

A 95-105 ms ROPE overlaps with the 95% HDI, so we conclude that 100 ms is still a credible value.

Bayesian analysis on session 1 using informed priors

What happens if we use more informed priors? For instance, the literature suggests that face onsets are around 50-150 ms. Here are the results using a uniform distribution on that interval:

fig12_ses1_unifprior

Figure 12. Parameter estimation for session 1 data using a uniform prior.

And the mean posterior only, with a [95, 105] ROPE:

fig13_ses1_unifprior_mean_posterior

Figure 13. Session 1 onset posterior distribution of the mean given a uniform prior.

As expected with uniform priors, the results changed very little compared to using BEST’s default options.

Bayesian analysis of session 2 results

With our new posterior estimates, we can now study results from session 2 using more informed priors. We plugin estimates of the posterior samples from session 1 as priors for session 2. This makes a lot of sense given that session 1 has a large sample size (for this type of research) and is therefore our best description of the population we’re trying to estimate. Here are the results:

fig14_ses2_ses1prior

Figure 14. Parameter estimation for session 2 onset data given session 1 priors.

fig15_ses2_ses1prior_mean_posterior.jpg

Figure 15. Session 2 onset posterior distribution of the mean given session 1 priors.

Compared to session 1 estimation, we’ve gained information: the ROPE is now completely outside the 95% HDI, which is [87.3, 93.4], and the mode of the mean is 90.3. So 100 ms is no longer a credible onset value.

What do we get if we pretend that session 1 does not exist?

  • session 2 results with session 1 priors: 95% HDI = [87.3, 93.4], mode of the mean = 90.3 (Figure 15)

  • session 2 results with broad priors: 95% HDI = [83.2, 92.2], mode of the mean = 88.1

Because the results are very similar across the two sessions, using broad priors instead of session 1 priors give very similar results. However, session 2 estimates using broad priors are shifted to the left, because the session 2 data distribution suggests a slightly lower mean compared to session 1.

In our EJN paper, we reported these median onsets with 95% percentile bootstrap confidence intervals:
– session 1 = 92 ms [85, 99]
– session 2 = 85 ms [79, 92]
Our Bayesian estimation across sessions suggests that the underlying distribution’s most credible mean is 90 ms. The 95% HDI is also about twice narrower than the 95% confidence interval, so we’ve reduced the uncertainty about the mean.

The same procedure employed in this section can be used to analyse a new distribution of onsets from similar experiments. We have collected a smaller number of observations in a series of new experiments, and we will be analysing the results using as priors session 2 posteriors, themselves obtained given session 1 posteriors.

Comparison to a new point estimate

Let’s assume someone finds an onset at 50 ms, probably using group statistics, such that there is only a single point estimate available. Before writing a paper declaring “face processing takes 50 ms”, is 50 ms credible given our estimation from 2 large samples? Let’s assume a ROPE of 10 ms on each side of the point estimate, so that we would consider 50 ms not credible only if our previous 95% HDI does not contain the interval 40-60 ms at all. There are other ways to estimate the ROPE, for instance by sampling participants with replacement many times and compute the group estimate for every iteration. Given that most MEEG papers I have ever read report only a single value, it’s very likely that future studies will focus on point estimates. Here are the results with a [40, 60] ROPE:

fig16_ses2_comp_to_point_estimate

Figure 16. Comparison between a 50 ms point estimate + ROPE & session 2 onset posterior distribution of the mean.

Well, as you’ve guessed, the answer is no, 50 ms is not a statistically credible onset! It is also not physiologically credible! And as incredible as such a 50 ms processing speed claim might seem, it has been made before, and I’m received recently a paper making such a claim to review… Now I can point the authors to this blog. Go Bayes!