Normality Test: A Practical, Production‑Ready Guide

I’ve watched entire analyses go sideways because someone assumed the data looked “kind of bell-shaped.” That assumption seems harmless until you run a t‑test on clearly skewed data or fit a linear model whose residuals are anything but normal. Normality testing is my guardrail: it tells me when classic methods are safe, when I should switch to robust or non‑parametric tools, and when the data itself needs a rethink. If you’re working with experiments, product metrics, ML features, or finance, you’ll hit this decision point constantly.

In this post, I’ll explain how I reason about normality, what the major tests actually measure, and how I pair them with visual checks. I’ll show complete, runnable code in Python and JavaScript, and I’ll share the practical thresholds I use in production. You’ll also see where tests lie, how sample size changes the interpretation, and how I handle edge cases like heavy tails or zero‑inflation. My goal is simple: give you a workflow you can reuse tomorrow, not just theory.

Why I check normality before choosing a model

I treat normality as a contract between the data and the method. Many statistical tools assume that either the data itself or the residuals of a model are normally distributed. If that contract is broken, the results can be unreliable. So before I choose a method, I ask two questions:

1) What part of this analysis assumes normality?

2) How costly is it if that assumption is wrong?

If I’m comparing two means with a small sample and tight business risk, I’ll test normality before I trust a t‑test. If I’m modeling a continuous response and the residuals look skewed, I’ll switch to a robust regression or a transformation before I interpret coefficients. If I’m building a model where prediction matters more than inference, normality still matters because it affects calibration, confidence intervals, and error estimates.

I also use normality checks as a diagnostic. When a dataset is supposed to represent many small independent effects (like test scores or manufacturing measurements), a normal distribution is a reasonable expectation. If I find strong deviations, I treat it as a clue: maybe there’s a hidden subgroup, a data pipeline bug, or a real‑world mechanism that isn’t symmetric.

Mental model: what normality means in real data

Normality isn’t a mystical property; it’s a shape. A normal distribution is symmetric, centered around the mean, and has tails that taper smoothly. It’s the distribution you get when many small, independent effects add up. That’s why it appears in measurement error, standardized test scores, and averaged behavior metrics.

But real data often breaks those conditions. You get skew from processes that can’t go below zero (like latency), heavy tails from rare events (like outages), and multi‑modal distributions when you mix user segments (like new vs. returning customers). So when I think about normality, I’m not asking “Is the data perfect?” I’m asking “Is the data close enough for the method I plan to use?”

A simple analogy I use with teams: imagine a bell curve as a road with guardrails. If your data stays on the road, the method behaves predictably. If it keeps drifting into the gravel, the method may still drive, but you’ll have to slow down or change vehicles.

Tests that actually work: Shapiro–Wilk, K‑S, Anderson–Darling

There are many normality tests, but I mainly rely on three. Each has a sweet spot, and I choose based on sample size and sensitivity.

Shapiro–Wilk

This is my default for small to medium datasets (up to about 2,000 samples). It compares the order statistics of your data to what you would expect from a normal distribution. In practice, it’s very sensitive to both skew and kurtosis.

If p > 0.05: I treat normality as plausible.
If p < 0.05: I treat normality as unlikely.

The nuance is that p‑values depend on sample size. A tiny deviation can look “significant” in large samples, while a real deviation can slip through in small ones. That’s why I never use this test alone.

Kolmogorov–Smirnov (K‑S)

K‑S compares the empirical distribution function to a specified theoretical distribution. For normality, that means you compare to a normal CDF with the same mean and variance. It’s more common in large‑sample settings, but it can be overly sensitive when you have thousands of points.

If p > 0.05: the sample is consistent with normality.
If p < 0.05: there’s a significant difference from normality.

I only use K‑S if I’m also visualizing the data, because it can flag tiny, irrelevant deviations in huge datasets.

Anderson–Darling

This test is like K‑S but pays extra attention to the tails. That makes it ideal when extreme values matter, such as in risk analysis, latency distributions, or fraud detection. It doesn’t return a p‑value directly; instead, you compare the statistic to critical values at different significance levels.

If the statistic exceeds the critical value at 5%, I reject normality. If it only exceeds at 1%, I treat the evidence as weaker. This nuance is one reason I like the test in production workflows.

Visual checks that keep you honest

Statistical tests are useful, but I don’t trust them without a visual check. In my workflow, I always create at least two visuals: a histogram with a fitted normal curve and a Q‑Q plot.

Histogram + normal curve shows the overall shape.
Q‑Q plot shows whether the quantiles line up; bends at the tails are easy to spot.

If the histogram is symmetric and the Q‑Q plot is roughly a straight line, I’m comfortable proceeding with normality‑sensitive methods. If the tails bend away or the histogram is clearly skewed, I treat it as non‑normal regardless of p‑values.

One of my favorite real‑world examples is standardized test scores for a large cohort of students. You often see a bell curve centered near the mean after normalization. A kernel density estimate overlays a smooth line that mirrors the normal curve, with tails that taper evenly. That’s usually a green light for many parametric analyses. If the KDE shows a heavy tail or a second bump, I pause and check for hidden segments.

Implementation patterns in Python and JavaScript

I build normality checks into my data pipelines so they’re repeatable. Here are complete examples in Python and JavaScript. Each one includes a visual check and at least one statistical test.

Python (NumPy, SciPy, Matplotlib)

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
Sample data: replace with your dataset
scores = np.array([1.2, -0.3, 0.7, -1.1, 0.5, 0.2, -0.6, 1.1, 0.0, -0.4])
Shapiro–Wilk test
shapirostat, shapirop = stats.shapiro(scores)
print(f"Shapiro–Wilk stat={shapirostat:.4f}, p={shapirop:.4f}")
Anderson–Darling test
ad_result = stats.anderson(scores, dist="norm")
print(f"Anderson–Darling stat={ad_result.statistic:.4f}")
for level, crit in zip(adresult.significancelevel, adresult.criticalvalues):
print(f"  {level:.1f}% critical value: {crit:.4f}")
Histogram with fitted normal curve
mu, sigma = scores.mean(), scores.std(ddof=1)
counts, bins, _ = plt.hist(scores, bins=8, density=True, alpha=0.6, color="steelblue")
x = np.linspace(bins.min(), bins.max(), 200)
plt.plot(x, stats.norm.pdf(x, mu, sigma), color="darkred", lw=2)
plt.title("Histogram with Normal Curve")
plt.xlabel("Score")
plt.ylabel("Density")
plt.show()
Q-Q plot
stats.probplot(scores, dist="norm", plot=plt)
plt.title("Q-Q Plot")
plt.show()

Notes I care about:

I use ddof=1 for sample standard deviation.
I print Anderson–Darling critical values so I can compare at multiple thresholds.
I always plot, even if p‑values look clear.

JavaScript (Node.js with simple stats)

If you’re in JavaScript, you can still do solid normality checks. I use a small stats library for convenience. Here’s a minimal approach using simple-statistics for mean and std dev, plus a manual Q‑Q plot setup for browser dashboards.

import { mean, standardDeviation } from "simple-statistics";
// Sample data
const scores = [1.2, -0.3, 0.7, -1.1, 0.5, 0.2, -0.6, 1.1, 0.0, -0.4];
const mu = mean(scores);
const sigma = standardDeviation(scores);
console.log(mean=${mu.toFixed(3)}, std=${sigma.toFixed(3)});
// For a production check, I call a Python service or WASM module
// for Shapiro–Wilk or Anderson–Darling. The browser view focuses on visuals.
// Q-Q plot data (basic)
const sorted = [...scores].sort((a, b) => a - b);
const n = sorted.length;
const qqData = sorted.map((x, i) => {
const p = (i + 0.5) / n;
// Approximate normal quantile using inverse error function
const z = Math.sqrt(2)  erfinv(2  p - 1);
return { sample: x, theoretical: mu + sigma * z };
});
console.log("Q-Q data", qqData);
function erfinv(x) {
// Approximation for inverse error function
const a = 0.147;
const ln = Math.log(1 - x * x);
const term = (2 / (Math.PI * a)) + (ln / 2);
return Math.sign(x)  Math.sqrt(Math.sqrt(term  term - ln / a) - term);
}

In 2026, I often wire this into an AI‑assisted analysis notebook: I let a model generate an initial report, then I verify it with deterministic tests. That combo saves time while keeping the result trustworthy.

Traditional vs modern workflows

Here’s how I contrast older approaches with modern practice when normality matters:

Traditional workflow

Modern workflow (2026)

—

Run a single normality test and accept the p‑value

Combine tests, visuals, and domain context

Assume normality for common metrics

Validate normality on raw data and residuals

Use only parametric tests

Switch to robust or non‑parametric methods when needed

Manual interpretation in spreadsheets

Automated checks in pipelines with human review## Decision rules I actually use

I don’t blindly follow p‑values. I use a layered decision process that I can explain to stakeholders:

1) If sample size < 50: I trust visuals more than p‑values. I still run Shapiro–Wilk, but I know it can miss deviations.

2) If 50 ≤ sample size ≤ 2,000: Shapiro–Wilk is my primary test, with Q‑Q plot confirmation.

3) If sample size > 2,000: I use Anderson–Darling or K‑S, but I weight visual checks heavily because tiny deviations trigger low p‑values.

I also check the use case. If I’m doing inference with strict error control, I’m more conservative. If I’m doing exploratory modeling where effect size matters more than p‑values, I might proceed with a mild deviation but keep notes.

Finally, I look at residuals. You can have non‑normal raw data and still get normal residuals after a transformation or a well‑fit model. That’s often the best case for parametric methods.

Common mistakes, edge cases, and performance notes

Here are the problems I see most often and how I avoid them:

Treating “p > 0.05” as proof of normality. It’s just lack of evidence against normality. I still verify visually.
Running K‑S without fitting mean and variance. If you don’t use the sample mean and variance, the test can mislead.
Ignoring tails. If extreme values matter (risk, latency, finance), I favor Anderson–Darling and Q‑Q plots.
Testing raw data but using residual‑based methods. If the method assumes residual normality, test residuals.
Assuming normality for ratios or rates. These often skew; I check for log transforms or non‑parametric tests.

Edge cases I call out in code reviews:

Zero‑inflated data: Many zeros plus a continuous tail. Normality tests will fail, and that’s correct. I use hurdle models or zero‑inflated distributions instead.
Mixtures: A blended dataset can look non‑normal because it is. Segment by user type or process step before testing.
Outliers from data errors: A single parsing bug can break normality. I validate data quality before I interpret tests.

Performance considerations are usually fine. On modern hardware, Shapiro–Wilk on a few thousand points is fast (typically in the 10–20ms range in Python). Anderson–Darling is similar. The heavier cost is usually visualization or data prep, not the test itself. For very large datasets, I often sample 5,000–20,000 points for testing and keep the full data for modeling.

Where I go after the test

Once I’ve decided whether normality holds, I move quickly. If the data or residuals look normal enough, I proceed with parametric tools and build confidence intervals as usual. If normality fails, I shift to tools that don’t rely on that assumption: Mann‑Whitney U instead of t‑tests, Kruskal‑Wallis instead of ANOVA, or robust regression instead of OLS. In some cases, a simple transformation like log or Box‑Cox brings the data into a usable shape, and I retest to confirm.

The key is to tie the choice back to the decision I need to make. If the outcome is high‑stakes, I keep the method conservative and explain the reasoning in plain language: “The distribution is skewed, so I used a rank‑based test that is less sensitive to that skew.” If the outcome is exploratory, I still document the normality check so the team knows how far they can trust the result.

If you want a practical next step, I recommend building a reusable normality check into your analysis pipeline: one function that runs Shapiro–Wilk and Anderson–Darling, saves a histogram and Q‑Q plot, and prints a short verdict. Run it on raw data and residuals. That habit will save you from subtle errors and make your analyses easier to defend.

Most of all, treat normality tests as a conversation with your data. They are not a gatekeeper that says “yes” or “no.” They’re a signal about how to proceed, what to double‑check, and which tools you should trust. When you approach them that way, they become one of the most practical checks you can run.

When normality tests are the wrong tool

Here’s a blunt truth: normality tests can be a distraction when the analysis doesn’t depend on normality. I’ve seen teams spend hours debating p‑values for a metric that they later feed into a model that doesn’t care about distribution shape. So I ask, “Is normality actually part of the contract?” If not, I skip the test and move on.

Examples where I usually skip normality tests:

Tree‑based models: Gradient boosting and random forests don’t require normality. I might check for outliers, but not for normality.
Large‑scale A/B tests: With big samples, the central limit effect often makes mean estimates behave normally even if raw data is skewed. I focus on variance and robustness instead.
Purely descriptive dashboards: If I’m only reporting medians, percentiles, or counts, normality is irrelevant.

Examples where I do test, even if it feels optional:

Small‑sample experiments: Here, normality affects p‑values and confidence intervals directly.
Regression inference: If I care about the meaning of coefficients, I check residuals.
Quality control: Process deviations can show up as non‑normality long before they become obvious to the naked eye.

The key is to connect the test to a decision. If there’s no decision, there’s no reason to test.

Normality vs. central limit intuition (and where that intuition fails)

A lot of confusion comes from mixing up data normality with normality of the sampling distribution. People say, “The mean is normal because of the central limit effect,” and then they assume the raw data must be normal. That’s not the same thing.

Here’s how I explain it:

Raw data normality: The data itself looks like a bell curve.
Sampling distribution normality: The distribution of sample means looks like a bell curve, even if the raw data is skewed.

The central limit effect gives you some safety in large samples, but it’s not magic. It depends on sample size, tail heaviness, and independence. If you have strong dependence or heavy tails, the mean can converge slowly. And if the decision relies on the distribution of the data itself (like modeling residuals or predicting individual outcomes), the central limit effect doesn’t rescue you.

So I keep the distinction clear in my notes: “Raw distribution is skewed; sample mean is approximately normal.” That clarity prevents misinterpretation and keeps stakeholders aligned.

A deeper look at the three tests (what they’re sensitive to)

Normality tests aren’t all measuring the same thing. Understanding what they’re sensitive to helps me pick the right one and interpret it correctly.

Shapiro–Wilk is sensitive to both skewness and kurtosis. It tends to catch subtle deviations in the center of the distribution, which is why it works well for small and mid‑size samples.
K‑S focuses on the maximum distance between the empirical CDF and a reference CDF. This makes it sensitive to any deviation, but it can miss tail issues if the maximum difference occurs in the middle.
Anderson–Darling weights tails more. If your data has long tails or sharp tail bends, this test will detect it more reliably than K‑S.

I also remind myself that all three are hypothesis tests against a specific normal distribution. If I estimate mean and variance from the data, I’m implicitly fitting a normal distribution. That means the tests are not evaluating “generic normality”; they’re evaluating “normality with the fitted parameters.” That’s usually what we want, but it’s worth remembering.

Practical scenarios: how I choose the method

Here are scenarios I see all the time and how I handle them.

Scenario 1: Product metric with a long tail

You’re analyzing time‑to‑complete for a feature. The histogram is heavily right‑skewed with long tails.

I run Anderson–Darling to confirm the tails are heavy.
I try a log transformation and re‑test normality on the transformed data.
If the log transform works, I use parametric methods on the transformed scale and convert back for interpretation.
If not, I use a non‑parametric test or a robust model that down‑weights outliers.

Scenario 2: Manufacturing measurement with tight specs

You’re measuring part thickness and need to detect drift. The process is designed to be normal.

I test normality regularly and treat deviations as a quality signal.
I use a Q‑Q plot as a quick diagnostic; small tail bends can indicate the process is drifting before the mean shifts.
If normality fails, I suspect a machine issue or measurement error.

Scenario 3: Model residuals in regression

You’re building a linear model and want to interpret coefficients.

I check normality on residuals, not on raw data.
If residuals are skewed, I try a transformation of the response or add a missing variable.
If residuals remain non‑normal, I switch to robust regression or a generalized linear model.

Scenario 4: Two‑sample comparison in a small pilot

You have 20 samples per group and you’re about to run a t‑test.

I run Shapiro–Wilk on each group and examine Q‑Q plots.
If either group fails, I use Mann‑Whitney U and report medians.
If the groups pass but I’m still uneasy, I use both and compare conclusions.

These decisions keep my process transparent and defensible, even when the data is messy.

Deeper code example: a reusable Python normality report

I often wrap normality checks in a function that returns a structured report I can log or render in a notebook. This makes it easy to compare across datasets and track over time.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from dataclasses import dataclass
@dataclass
class NormalityReport:
n: int
mean: float
std: float
shapiro_stat: float
shapiro_p: float
ad_stat: float
adcrit5: float
adcrit1: float
verdict: str
def normalityreport(data, label="sample", bins=30, showplots=True):
x = np.asarray(data)
x = x[np.isfinite(x)]  # drop NaN/inf
n = len(x)
if n < 3:
raise ValueError("Need at least 3 finite values for normality checks.")
mu = x.mean()
sigma = x.std(ddof=1)
shapirostat, shapirop = stats.shapiro(x)
ad = stats.anderson(x, dist="norm")
# Find the 5% and 1% critical values for convenience
adcrit5 = ad.criticalvalues[list(ad.significancelevel).index(5.0)]
adcrit1 = ad.criticalvalues[list(ad.significancelevel).index(1.0)]
# A simple, transparent verdict rule
if shapirop >= 0.05 and ad.statistic < adcrit_5:
verdict = "normalish"
elif shapirop  adcrit_1:
verdict = "non-normal"
else:
verdict = "borderline"
if show_plots:
# Histogram + normal curve
counts, binshist,  = plt.hist(x, bins=bins, density=True, alpha=0.6, color="steelblue")
grid = np.linspace(binshist.min(), binshist.max(), 200)
plt.plot(grid, stats.norm.pdf(grid, mu, sigma), color="darkred", lw=2)
plt.title(f"Histogram + Normal Curve ({label})")
plt.xlabel(label)
plt.ylabel("Density")
plt.show()
# Q-Q plot
stats.probplot(x, dist="norm", plot=plt)
plt.title(f"Q-Q Plot ({label})")
plt.show()
return NormalityReport(
n=n,
mean=mu,
std=sigma,
shapirostat=shapirostat,
shapirop=shapirop,
ad_stat=ad.statistic,
adcrit5=adcrit5,
adcrit1=adcrit1,
verdict=verdict,
)
Example usage
scores = [1.2, -0.3, 0.7, -1.1, 0.5, 0.2, -0.6, 1.1, 0.0, -0.4]
report = normality_report(scores, label="scores")
print(report)

Why I like this pattern:

It’s deterministic and easy to automate.
It logs both numeric tests and human‑readable visuals.
The “verdict” is explicit and auditable.

I’m careful not to oversell the verdict; it’s a decision shortcut, not a mathematical truth. But for product teams, having “normalish / borderline / non‑normal” in logs helps them move quickly while still respecting statistical reality.

Deeper code example: JavaScript with a hybrid approach

JavaScript is great for dashboards, but I still lean on a numerical backend for exact tests. Here’s a realistic pattern I use: a Node service that calls a Python worker for Shapiro–Wilk and Anderson–Darling, then returns a JSON report that the front‑end can render.

// Node.js pseudo-implementation
import { mean, standardDeviation } from "simple-statistics";
import { spawn } from "node:child_process";
function runPythonNormality(data) {
return new Promise((resolve, reject) => {
const py = spawn("python", ["normality_worker.py"]);
let output = "";
let err = "";
py.stdout.on("data", (d) => (output += d.toString()));
py.stderr.on("data", (d) => (err += d.toString()));
py.on("close", (code) => {
if (code !== 0) return reject(new Error(err || "Python worker failed"));
try {
resolve(JSON.parse(output));
} catch (e) {
reject(e);
}
});
py.stdin.write(JSON.stringify({ data }));
py.stdin.end();
});
}
async function normalityReport(data) {
const mu = mean(data);
const sigma = standardDeviation(data);
// Basic JS stats for quick UI feedback
const basic = { n: data.length, mean: mu, std: sigma };
// Exact tests from Python worker
const tests = await runPythonNormality(data);
return { basic, tests };
}
// Example usage
const scores = [1.2, -0.3, 0.7, -1.1, 0.5, 0.2, -0.6, 1.1, 0.0, -0.4];
normalityReport(scores).then(console.log).catch(console.error);

This setup gives me the best of both worlds: fast UI‑level checks in JS, and reliable statistical tests in Python. It also scales well in production because I can centralize the stats logic and audit it.

Edge cases and how I handle them in practice

Normality tests work best on clean, continuous data. But real data rarely fits that description. Here’s how I handle some stubborn edge cases.

Zero‑inflation

Many metrics (like daily purchases per user or session counts) have a pile of zeros plus a tail of positive values.

Normality tests will reject this distribution, and they’re right.
I use a two‑part model: a binary component for zero vs non‑zero, and a continuous distribution for positive values.
If I must summarize, I report both the zero rate and the distribution of non‑zeros.

Discrete or bounded values

If values are integers or bounded (like ratings from 1–5), normality tests are not appropriate.

I prefer ordinal models or non‑parametric tests.
If I do a normality test anyway, I treat it as exploratory, not definitive.

Strong autocorrelation

Time series data can appear non‑normal because of autocorrelation, not because the underlying innovation is non‑normal.

I check residuals after fitting an appropriate time series model.
If residuals look normal, I proceed with parametric assumptions on the innovations.

Censored data

Latency data often has timeouts, meaning some values are truncated or censored.

I use survival analysis or censored regression instead of normality tests.
If I must test, I use the uncensored subset but note the bias.

These cases remind me that normality tests aren’t a universal filter; they’re a tool for a specific data regime.

Common pitfalls I’ve seen in code reviews

This is the short list I repeat in internal reviews, because it saves teams from subtle mistakes.

1) Testing the wrong thing: Checking raw data when the model assumes residual normality.

2) Over‑reacting to p‑values in huge samples: With 100k rows, everything fails. Use visuals and effect sizes.

3) Ignoring independence: Even perfectly normal marginal distributions can break assumptions if data points are correlated.

4) Confusing “normal” with “good”: Some processes are naturally skewed, and that’s not a problem.

5) Skipping domain context: A small deviation might be fine in marketing, but unacceptable in medical trials.

I keep a mental checklist: distribution shape, independence, data quality, and model assumptions. If those are aligned, the normality test is just confirmation.

Alternative approaches when normality fails

If normality fails, I don’t panic. I switch tools. Here are the options I reach for first.

Transformations

Log transform: Great for right‑skewed positive values.
Square root transform: Useful for count‑like data.
Box‑Cox or Yeo‑Johnson: More flexible, can be tuned to fit.

After transformation, I retest. If the transformed data looks normal, I proceed with parametric tests on the transformed scale.

Robust methods

Robust regression: Reduces influence of outliers.
Winsorization: Caps extreme values (used sparingly and documented clearly).
Bootstrapping: Avoids normality assumptions by resampling.

Non‑parametric methods

Mann‑Whitney U: Alternative to t‑test for two groups.
Kruskal‑Wallis: Alternative to ANOVA for multiple groups.
Spearman correlation: For monotonic relationships without normality.

The best choice depends on what you’re trying to preserve: interpretability, power, or simplicity. I make that tradeoff explicit in my write‑ups.

A production‑ready workflow I actually use

Here’s the workflow I recommend to teams who want normality checks without drama:

1) Pre‑check data quality: Remove NaNs, handle obvious errors, confirm units.

2) Run a visual check: Histogram + Q‑Q plot.

3) Run one primary test: Shapiro–Wilk for small/mid samples, Anderson–Darling for large samples.

4) Interpret in context: Does the method require normality? If yes, how sensitive is it?

5) Decide and document: Choose parametric, robust, or non‑parametric. Save the plots and a short verdict.

This workflow is fast, repeatable, and defensible. It also scales: you can run it in automated pipelines and still have humans review edge cases.

Performance and scale: what matters in real pipelines

Normality tests are rarely the bottleneck. The heavy cost is usually reading data or generating visuals. But there are a few performance choices I make deliberately:

Sampling for large datasets: For 1M rows, I sample 5k–20k for testing and keep the full data for modeling. This is usually enough to detect non‑normality without overwhelming the test.
Batch processing: If I’m checking dozens of metrics, I parallelize tests and cache visual outputs.
Avoiding unnecessary tests: I only test metrics that feed into normality‑sensitive methods.

In terms of runtime, I’ve seen Shapiro–Wilk and Anderson–Darling execute in the low tens of milliseconds for a few thousand samples. Visuals are typically slower, but still manageable if you’re not generating thousands of plots.

A practical decision tree I share with teams

When people get stuck debating normality, I show them this decision tree in words:

1) Does your method assume normality?

– If no: skip the test.

– If yes: continue.

2) Is the sample size small (< 50)?

– Use visuals + Shapiro–Wilk, but trust visuals more.

3) Is the sample size medium (50–2,000)?

– Use Shapiro–Wilk + Q‑Q plot.

4) Is the sample size large (> 2,000)?

– Use Anderson–Darling or K‑S + Q‑Q plot, interpret cautiously.

5) If normality fails, do you need inference or prediction?

– For inference: robust or non‑parametric.

– For prediction: consider transformations or models that don’t require normality.

This tree keeps discussions short and action‑oriented.

Why I still use normality tests in 2026

With modern machine learning and robust methods, it’s easy to dismiss normality as “old school.” But in production analytics, normality checks still matter because they prevent subtle errors and clarify assumptions. They’re fast, easy, and provide a common language for discussing uncertainty.

I also see them as a quality gate: if data that should be normal suddenly isn’t, something changed. That can be a signal of product behavior shifts, pipeline bugs, or data collection issues. Normality tests become part of monitoring, not just modeling.

Final takeaways I keep in my own notes

If I had to reduce all of this to a few lines, it would be:

Normality tests are decision tools, not truth machines.
Use tests plus visuals, and always interpret in context.
Sample size changes everything; large data makes p‑values brutal.
When normality fails, switch methods instead of forcing assumptions.
Test residuals when the model assumes residual normality.

That’s the workflow I’ve found reliable across experiments, product metrics, and ML pipelines. It’s not about chasing perfect normality; it’s about choosing methods that match the shape of your data and the risk of your decision.

If you want a practical next step, build a small, reusable normality module that logs tests, plots, and a verdict. Run it on raw data and residuals, and make it part of your pipeline. That habit will save you from fragile analyses and make your results easier to trust.

Most of all, treat normality as a conversation with your data. The tests are just one voice in that conversation. Your job is to listen, interpret, and choose the method that keeps you honest.