Category Archives: teaching

Why normality tests are great…

…as a teaching example and should be avoided in research.

These statements are common in the psychology and neuroscience literature:

“In order to assess the normal distribution of the population in terms of age, BV% and CSF%, the Lilliefors-corrected Kolmogorov–Smirnov test was performed” (Porcu et al. 2019)

“The Kolmogorov–Smirnov-Test revealed a normal distribution (p = 0.82).” (Knolle et al. 2019)

“The distribution was not normal (P < 0.01 with the Shapiro–Wilk test).” (Beaudu-Lange et al. 2001)

“Assumptions of the one-way anova for normality were also confirmed with the Shapiro–Wilk test.” (Holloway et al. 2015)

“The Shapiro-Wilk-W-test (P < 0.05) revealed that all distributions could be assumed to be Gaussian as a prerequisite for the application of a t-test.” (Dicke et al. 2008)

“Given the non-normal distribution of such data (Shapiro–Wilk’s p < .05), we applied a nonparametric one-sample t test (the one-sample Wilcoxon signed rank test).” (Zapparoli et al. 2019)

A common recipe goes like this:

  • apply a normality test;
  • if p>0.05, conclude that the data are normally distributed and proceed with a parametric test;
  • if p<0.05, conclude that the data are not normally distributed and proceed with a non-parametric test (or transform the data to try to achieve normality).

It is a useful exercise or class activity to consider the statements above with the goal of identifying all the underlying issues. It could take several hours of teaching to do justice to the rich topics we need to cover to properly understand these issues.

Here is a succinct and non-exhaustive list of issues, with references for follow-up readings:

[1] In the general context of linear regression, the normality assumption applies to the residuals, not the marginal distributions. The main solution involves graphical checks of the residuals (Ernst & Albers, 2017; Vanhove, 2018).

Resources for graphical checks:

Visualization of Regression Models Using visreg

Visualizing regression model predictions

Extracting and visualizing tidy residuals from Bayesian models

Other solutions involve model comparison, to contrast models making different assumptions, and using models robust to assumption violations (Bürkner, 2017; Kruschke, 2013; Wilcox & Rousselet, 2018).

[2] The p value from standard frequentist tests, such as normality tests, cannot be used to accept the null (Rouder et al., 2016; Kruschke, 2018). The p value being computed assuming that the null is true, it cannot in turn be used to support the null — that’s circular. To find support for the null, we need an alternative hypothesis (to compute a Bayes Factor; Rouder et al., 2016; Wagenmakers et al., 2020) or a Region of Practical Equivalence (ROPE, to compute a test of equivalence; Freedman et al., 1984; Kruschke, 2018; Lakens, 2017; Campbell & Gustafson, 2022). Setting an alternative hypothesis is also crucial to get a consistent test (Rouder et al., 2016; Kruschke & Liddell, 2018). Tests of normality, like all Point Null Hypothesis Significance Tests (PNHST), are inconsistent: given alpha = 0.05, even if normality holds, 5% of tests will be positive no matter how large the sample size is.

[3] Failure to reject (p>0.05) doesn’t mean data were sampled from a normal distribution. Another function could fit the data equally well (for instance a shifted lognormal distribution). This point follows directly from [2]. Since our alternative hypothesis is extremely vague, the possibility of another distribution being a plausible data generation process is ignored: the typical test considers only a point null hypothesis versus “anything else”. So when we ask a very vague question, we can only get a very vague answer (there is no free lunch in inference – Rouder et al., 2016).

[4] Failure to reject (p>0.05) could be due to low power. This is well known but usually ignored. Here are the results of simulations to illustrate this point. The code is available on GitHub. We sample from g-and-h distributions (Yan & Genton, 2019), which let us vary asymmetry (parameter g) and tail-thickness (parameter h, which also affects how peaky the distribution is). We start by varying g, keeping a constant h=0.

g-and-h populations used in the simulation in which we vary parameter g

Here are results for the Shapiro-Wilk test, based on a simulation with 10,000 iterations.

The Shapiro-Wilk test has low power unless the departure from normality is pronounced, or sample sizes are large. With small departures from normality (say g=0.1, g=0.2), achieving high power won’t be possible with typical sample sizes in psychology and neuroscience. For g=0, the proportion of false positives is at the expected 5% level (false positive rate).

The Kolmogorov-Smirnov test is dramatically less powerful than the Shapiro-Wilk test (Yap & Sim, 2011).

What happens if we sample from symmetric distributions that are more prone to outliers than the normal distribution? By varying the h parameter, keeping a constant g=0, we can consider distributions that are progressively more kurtotic than the normal distribution.

g-and-h populations used in the simulation in which we vary parameter h

Are the tests considered previously able to detect such deviations from normality? Here is how the Shapiro-Wilk test behaves.

And here are the catastrophic results for the Kolmogorov-Smirnov test.

[5] As the sample size increases, progressively smaller and smaller deviations from normality can be detected, eventually reaching absurd levels of precision, such that tiny differences of no practical relevance will be flagged. This point applies to all PNHST and again follows from [2]: because in PNHST no alternative is considered, tests are biased against the null (Rouder et al., 2016; Wagenmakers et al., 2020). Even when p<0.05, contrasting two hypotheses could reveal that a normal distribution and a non-normal distribution are equally plausible, given our data. Also, because PNHST is not consistent, even when the null is true, 5% of tests will be positive.

[6] Choosing a model conditional on the outcome of a preliminary check affects sampling distributions and thus p values and confidence intervals. The same problem arises when doing balance tests. If a t-test is conditional on a normality test, the p value of the t-test will be different (but unknown) from the one obtained if a t-test is performed without a preliminary check. That’s because p values depend on sampling distributions of imaginary experiments, which in turn depend on sampling and testing intentions (Wagenmaker, 2007; Kruschke & Liddell, 2018). This dependence can make p values difficult to interpret, because unless we simulate the world of possibilities that led to our p value, the sampling distribution for our statistic (say t statistic) is unknown.

[7] When non-normality is detected or suspected, a classic alternative to the two sample t-test is the Wilcoxon-Mann-Whitney test. However, in general different tests or models address different hypotheses — they are not interchangeable. For instance, the WMW’s U statistics is related to the distribution of all pairwise differences between two independent groups; unlike the t-test it doesn’t involve a comparison of the marginal means. Similarly, if instead of the mean, we use a trimmed mean, a robust measure of central tendency, our inferences are about the population trimmed mean, not the population mean.

[8] In most cases, researchers know the answer to the normality question before conducting the experiment. For instance, we know that reaction times, accuracy and questionnaire data are not normally distributed. Testing for normality when we already know the answer is unnecessary and falls into the category of tautological tests. Since we know the answer in most situations, it is better practice to use appropriate models and drop the checks altogether. For instance, accuracy data follow beta-binomial distributions (Jaeger, 2008; Kruschke, 2014); questionnaire data can be modelled using ordinal regression (Liddell & Kruschke, 2018; Bürkner & Vuorre, 2019; Taylor et al., 2022); reaction time data can be modelled using several families of skewed distributions (Lindeløv, 2019).

References

Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1–28. https://doi.org/10.18637/jss.v080.i01

Bürkner, P.-C. & Vuorre, M. (2019) Ordinal Regression Models in Psychology: A Tutorial. Advances in Methods and Practices in Psychological Science, 2, 77–101. https://journals.sagepub.com/doi/full/10.1177/2515245918823199

Campbell, H. & Gustafson, P. (2021) re:Linde et al. (2021): The Bayes factor, HDI-ROPE and frequentist equivalence tests can all be reverse engineered – almost exactly – from one another. https://arxiv.org/abs/2104.07834

Ernst AF, Albers CJ. 2017. Regression assumptions in clinical psychology research practice—a systematic review of common misconceptions. PeerJ 5:e3323 https://doi.org/10.7717/peerj.3323

Freedman, L.S., Lowe, D., & Macaskill, P. (1984) Stopping rules for clinical trials incorporating clinical opinion. Biometrics, 40, 575–586.

Jaeger, T.F. (2008) Categorical Data Analysis: Away from ANOVAs (transformation or not) and towards Logit Mixed Models. J Mem Lang, 59, 434–446.

Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. J Exp Psychol Gen, 142, 573–603.

Kruschke, J.K. (2014) Doing Bayesian Data Analysis, 2nd Edition. edn. Academic Press.

Kruschke, J.K. (2018) Rejecting or Accepting Parameter Values in Bayesian Estimation. Advances in Methods and Practices in Psychological Science, 1, 270–280.

Kruschke, J.K. & Liddell, T.M. (2018) The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychon Bull Rev, 25, 178–206.

Lakens, D. (2017). Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Social Psychological and Personality Science, 8(4), 355–362. https://doi.org/10.1177/1948550617697177

Liddell, T.M. & Kruschke, J.K. (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328–348.

Lindeløv, J.K. (2019) Reaction time distributions: an interactive overview
https://lindeloev.github.io/shiny-rt/

Rouder, J.N., Morey, R.D., Verhagen, J., Province, J.M. and Wagenmakers, E.-J. (2016), Is There a Free Lunch in Inference?. Top Cogn Sci, 8: 520-547. https://doi.org/10.1111/tops.12214

Taylor, J.E., Rousselet, G.A., Scheepers, C. et al. Rating norms should be calculated from cumulative link mixed effects models. Behav Res (2022). https://doi.org/10.3758/s13428-022-01814-7

Torrin M.Liddell & John K.Kruschke (2018) Analyzing ordinal data with metric models: What could possibly go wrong? Journal of Experimental Social Psychology, 79, 328-348
https://www.sciencedirect.com/science/article/abs/pii/S0022103117307746

Vanhove (2018) Checking model assumptions without getting paranoid. https://janhove.github.io/analysis/2018/04/25/graphical-model-checking

Wagenmakers, E.-J. (2007) A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804.

Wagenmakers, E.-J., Lee, M.D., Rouder, J.N., & Morey, R.D. (2020) The Principle of Predictive Irrelevance or Why Intervals Should Not be Used for Model Comparison Featuring a Point Null Hypothesis. In Gruber, C.W. (ed), The Theory of Statistics in Psychology: Applications, Use, and Misunderstandings. Springer International Publishing, Cham, pp. 111–129.

Wilcox RR, Rousselet GA. A Guide to Robust Statistical Methods in Neuroscience. Curr Protoc Neurosci. 2018 Jan 22;82:8.42.1-8.42.30. doi: 10.1002/cpns.41. PMID: 29357109.

Yan, Y. & Genton, M.G. (2019) The Tukey g-and-h distribution. Significance, 16, 12–13. https://doi.org/10.1111/j.1740-9713.2019.01273.x

Yap, B.W. & Sim, C.H. (2011) Comparisons of various types of normality tests. Journal of Statistical Computation and Simulation, 81, 2141–2155. DOI: 10.1080/00949655.2010.520163

Analog teaching activities about sampling and resampling

This year I’m teaching a new undergraduate course on the bootstrap for 4th year psychology students. Class examples, take-home exercises and the exam use R. I will also use a few analog activities in class. Here I’d like to share some of these activities. (This is also the opportunity to start a new category of posts on teaching.) The course is short, with only 5 sessions of 2 hours, but I think it is important to spend some of that time to try to get key concepts across by providing engaging activities. I’ll report back on how it went.

The 3 main activities involve dice, poker chips and wooden fish, to explore different types of sampling, sampling distributions, the distinction between sample and population, resampling…

Activity 1: dice (hierarchical sampling)

We use dice to simulate sampling with replacement from an infinite population of participants and trials.

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • hierarchical sampling;
  • running simulations;
  • estimation;
  • the distinction between finite and infinite populations.

Material:

  • 3 bags of dice
  • 3 trays (optional)

Each bag contains a selection of dice with 4 to 20 facets, forming 3 independent populations. I used a lot of dice in each of bag but that’s not necessary. It just makes it harder to guess the content of the bags. I got the dice from the TheDiceShopOnline.

Many exercises can be proposed, involving different sampling strategies, with the aim of making some sort of inference about the populations. Here is the setup we will use in class:

  • 3 participants or groups of participants are involved, each working independently with their own bag/population;
  • a dice is randomly picked from a bag (without looking inside the bag!) — this is similar to randomly sampling a participant from the population;
  • the dice is rolled 5 times, and the results written down — this is similar to randomly sampling trials from the participant;
  • perform the two previous steps 10 times, for a total of 10 participants x 5 trials = 50 trials.

These values are then entered into a text file and shared with the rest of the class. The text files are opened in R, and the main question is: is there evidence that our 3 samples of 10 participants x 5 trials were drawn from different populations? To simplify the problem, a first step could involve averaging over trials, so we are left with 10 values from each group. The second step is to produce some graphical representation of the data. Then we can try various inferential statistics.

The exercise can be repeated on different days, to see how much variability we get between simulated experiments. During the last class, the populations and the sampling distributions are revealed.

Also, in this exercise, because the dice are sampled with replacement, the population has an infinite size. The content of each bag defines the probability of sampling each type of dice, but it is not the entire population, unlike in the faux fish activity (see below).

Here is an example of samples after averaging the 5 trials per dice/participant:

Activity 2: poker chips (bootstrap sampling with replacement)

We use poker chips to demonstrate sampling with replacement, as done in the bootstrap.

This exercise provides an opportunity to learn about:

  • sampling with replacement;
  • bootstrap sampling;
  • running simulations.

A bag contains 8 poker chips, representing the outcome of an experiment. Each chip is like an observation.

First, we demonstrate sampling with replacement by getting a random chip from the bag, writing down its value, and replacing the chip in the bag. Second, we demonstrate bootstrap sampling by performing sampling with replacement 8 times, each time writing down the value of the random chip selected from the bag. This should help make bootstrap sampling intuitive.

After this analog exercise, we switch to R to demonstrate the implementation of sampling with replacement using the sample function.

Activity 3: faux fish (sampling distributions)

We sample with replacement from a finite population of faux fish to demonstrate the effect of sample size on the shape of sampling distributions.

The faux fish activity is mentioned in Steel, Liermann & Guttorp (2019), with pictures of class results. The activity is described in detail in Kelsey & Steel (2001).

Steel, Liermann & Guttorp (2019)

This exercise provides an opportunity to learn about:

  • the distinction between population and sample;
  • sampling with replacement;
  • running simulations;
  • estimation;
  • sampling distributions.

Material:

  • two sets of 97 faux fish = fish-shaped bits of paper or other material
  • two containers = ponds
  • two large blank sheets of paper
  • x axis = ‘Mean weight (g)’
  • y axis = ‘Number of experiments’
  • titles = ‘n=3 replicates’ / ‘n=10 replicates’

I got faux fish made of wood from Wood Craft Shapes.

Each faux fish has a weight in grams written on it.

The frequencies of the weights is given in Kelsey & Steel (2001).

The fish population is stored in a box. I made 2 identical populations, so that two groups can work in parallel.

The first goal of the exercise is to produce sampling distributions by sampling with replacement from a population. The second goal is to evaluate the effect of the sample size on the shape of the sampling distribution. The third goal is to experiment with a digital version of the analog task, to gain familiarity with simulations.

Unlike the dice activity, this activity involves a finite size population: each box contains the full population under study.

Setup:

  • two groups of participants;
  • each group is assigned a box;
  • participants from each group take turn sampling from the box n=3 or n=10 faux fish (depending on the group), without looking inside the box;
  • each participant averages the numbers, writes down the answer and marks it on the large sheet of paper assigned to each group;
  • this is repeated until a sufficient number of simulated experiments have been carried out to assess the shape of the resulting sampling distribution.

To speed up the exercise, a participant picks n fish, writes down the weights, puts all the fish back in the box, and passes the box to the next participant. While the next participant is sampling fish from the box, the previous participant computes the mean and marks the result on the group graph.

Once done, the class discusses the results:

  • the sampling distributions are compared;
  • the population mean is revealed;
  • the population is revealed by showing the handout from the book and opening the boxes.

Then we do the same in R, but much quicker!

Here is an example of simulated results for n=3 (the vertical line marks the population mean):

References

Kelsey, Kathryn, and Ashley Steel. The Truth about Science: A Curriculum for Developing Young Scientists. NSTA Press, 2001.

Steel, E. Ashley, Martin Liermann, and Peter Guttorp. Beyond Calculations: A Course in Statistical Thinking. The American Statistician 73, no. sup1 (29 March 2019): 392–401. https://doi.org/10.1080/00031305.2018.1505657.