A few months ago I was reviewing a model that looked perfect on paper and terrible in production. The culprit wasn’t the algorithm — it was the data. We assumed the input features were normally distributed and applied methods that depend on that assumption. They weren’t even close. That mistake cost the team a week of rework and a painful rollback.
If you build analytics, ML pipelines, or even simple dashboards, you’ll run into the normality assumption. Many statistical tools still assume bell-shaped data, and when that assumption is wrong your p-values, confidence intervals, and error bars can quietly lie to you. I’ve learned to treat normality tests as a guardrail: quick, evidence-based checks that tell me whether I should proceed with parametric methods or take a different path.
I’ll walk you through how I test for normality, how I interpret results, and how I decide what to do next. You’ll get practical guidance, runnable code, and common mistakes I see in real projects — along with modern 2026 workflows that keep the process fast and repeatable.
What a normality test actually tells you
A normality test checks whether a dataset is consistent with a normal distribution. That sounds simple, but the nuance matters. You’re not proving the data is normal. You’re testing a null hypothesis: “These data could come from a normal distribution.” If the test says “reject,” you’ve got evidence that the data deviates in a meaningful way. If the test says “fail to reject,” it means the test didn’t detect a strong deviation — not that the data is perfectly normal.
I keep three truths in mind:
- Normality is an assumption, not a guarantee. The real world is messy.
- Sample size influences sensitivity. With large samples, tiny deviations look significant.
- Distribution shape matters for what you plan to do next. A slight skew might be fine for some methods and disastrous for others.
From a developer’s perspective, the value of normality testing is decision support. It’s not a trophy to hang on the wall. You run the test to decide whether to use tools that assume normality or to switch to methods that don’t care.
The three tests I trust most
I focus on three core tests because they’re widely implemented, well understood, and reliable when used correctly.
Shapiro–Wilk
I reach for Shapiro–Wilk when the sample size is small to medium (roughly under 2,000). It compares the ordered data with what you would expect from a normal distribution.
- If p-value > 0.05, I usually treat the data as “normal enough.”
- If p-value < 0.05, I assume a meaningful deviation from normality.
In practice, I like it because it’s sensitive and well behaved for modest samples. In many codebases, this is the default normality test — and with good reason.
Kolmogorov–Smirnov (K–S)
K–S compares the empirical distribution function of the sample to the theoretical CDF of the normal distribution. It’s a good fit for larger datasets, but it has a catch: it becomes too sensitive with very large samples. That means minor, irrelevant deviations can look “statistically significant.”
I use K–S when I have large N and want a broad, distribution-level check, but I never rely on it alone. I pair it with visual inspection and effect size reasoning.
Anderson–Darling
Anderson–Darling adds weight to the tails of the distribution. If you care about extreme values — which is often the case in finance, reliability engineering, or anomaly detection — this test is one of the most practical.
Unlike Shapiro–Wilk and K–S, Anderson–Darling typically compares a test statistic to critical values rather than giving a direct p-value. It’s still easy to interpret: a statistic above the critical value means normality is rejected.
When to test, and when not to
I don’t run normality tests blindly. I do it when the result will change my next step. Here’s how I decide.
You should test for normality when:
- You plan to use methods that assume normality (t-tests, ANOVA, linear regression residuals).
- You need to interpret confidence intervals or prediction intervals based on normality.
- You’re comparing group means and want to avoid Type I errors.
You should skip or de-emphasize normality tests when:
- You’re using nonparametric methods (Mann–Whitney, Kruskal–Wallis).
- You’re working with very large samples where statistical significance becomes inevitable.
- The target variable is categorical, ordinal, or bounded in a way that can’t be normal by design.
A practical rule I use: if the analysis step is robust to deviations, I won’t spend time proving normality. If a method is fragile, I verify early.
Visual checks are not optional
Statistical tests are necessary, but they’re not enough. I always pair them with visuals. It takes minutes and saves hours of confusion.
My minimal visual checklist:
- Histogram with density overlay (do I see a bell shape?)
- Q–Q plot (do points follow a straight line?)
- Box plot (are tails extreme, is there skew?)
If the visuals scream “skewed” but the test says “fail to reject,” I trust the visuals and adjust. That can happen with small samples where tests lack power. If the visuals look great but the test rejects for a huge sample, I consider whether the deviation matters for my specific decision.
Simple analogy I use with teams: a normality test is like a smoke alarm. It can trigger because of a burned toast or a real fire. The test is the alarm; your visual check is you walking into the kitchen to see what’s actually going on.
Hands-on workflow in Python (with repeatable checks)
Here’s a complete, runnable Python example that I use in projects. It runs all three tests, produces a histogram, and prints a clear decision summary. You can drop this into a notebook or a script.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
Sample data: replace this with your real dataset
rng = np.random.default_rng(42)
data = rng.normal(loc=0, scale=1, size=1000)
Shapiro–Wilk
shapirostat, shapirop = stats.shapiro(data)
Kolmogorov–Smirnov (normalize using sample mean/std)
mean = np.mean(data)
std = np.std(data, ddof=1)
ksstat, ksp = stats.kstest(data, ‘norm‘, args=(mean, std))
Anderson–Darling
ad_result = stats.anderson(data, dist=‘norm‘)
print(f"Shapiro–Wilk: stat={shapirostat:.4f}, p={shapirop:.4g}")
print(f"K–S: stat={ksstat:.4f}, p={ksp:.4g}")
print("Anderson–Darling: stat={:.4f}".format(ad_result.statistic))
print("Critical values:")
for sig, crit in zip(adresult.significancelevel, adresult.criticalvalues):
print(f" {sig}% -> {crit:.4f}")
Visual checks
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ax[0].hist(data, bins=30, density=True, alpha=0.6, color=‘steelblue‘)
Overlay normal curve
x = np.linspace(data.min(), data.max(), 200)
ax[0].plot(x, stats.norm.pdf(x, mean, std), color=‘darkred‘)
ax[0].set_title("Histogram with normal curve")
Q–Q plot
stats.probplot(data, dist=‘norm‘, plot=ax[1])
ax[1].set_title("Q–Q plot")
plt.tight_layout()
plt.show()
Decision summary
if shapiro_p < 0.05:
print("Shapiro–Wilk suggests non-normal data.")
else:
print("Shapiro–Wilk suggests data is normal enough.")
if ks_p < 0.05:
print("K–S suggests non-normal data.")
else:
print("K–S suggests data is normal enough.")
Anderson–Darling decision at 5% level
Use the 5% critical value (index varies by implementation; here we match by value)
try:
idx5 = list(adresult.significance_level).index(5.0)
crit5 = adresult.criticalvalues[idx5]
if adresult.statistic > crit5:
print("Anderson–Darling suggests non-normal data at 5%.")
else:
print("Anderson–Darling suggests normal data at 5%.")
except ValueError:
print("5% significance level not found in results.")
The key design choice here is the K–S test’s normalization. K–S expects a fully specified distribution, so I estimate mean and standard deviation from the sample and pass those in. It’s common, but it means the test is not a perfect K–S with known parameters. If you want the strict Lilliefors correction, you should use a dedicated implementation (many stats libraries provide it).
JavaScript example for browser and Node
When you’re shipping analytics in a product, normality checks often happen in Node or even in the browser for lightweight diagnostics. Here’s a minimal example using the jstat library. This gives you Shapiro–Wilk and a Q–Q plot check. You can run it in Node or bundle it for the web.
import { jStat } from ‘jstat‘;
// Example data
const data = Array.from({ length: 200 }, (_, i) => jStat.normal.sample(0, 1));
// Shapiro–Wilk (jStat has a basic implementation)
const shapiro = jStat.shapiroWilk(data);
console.log(Shapiro–Wilk W=${shapiro[0].toFixed(4)} p=${shapiro[1].toFixed(4)});
// Quick Q–Q diagnostic: correlation between sorted data and normal quantiles
const sorted = [...data].sort((a, b) => a - b);
const quantiles = sorted.map((_, i) => jStat.normal.inv((i + 0.5) / sorted.length, 0, 1));
const corr = jStat.corrcoeff(sorted, quantiles);
console.log(Q–Q correlation ≈ ${corr.toFixed(4)});
if (shapiro[1] < 0.05) {
console.log(‘Shapiro–Wilk suggests non-normal data.‘);
} else {
console.log(‘Shapiro–Wilk suggests normal enough.‘);
}
I like the Q–Q correlation as a quick, lightweight check — not a formal test, but useful for dashboards and developer tooling. A correlation close to 1 indicates the data align well with a normal distribution.
Practical interpretation: p-values, effect sizes, and context
The fastest path to a bad decision is to treat a p-value as a binary truth. I always interpret p-values alongside three practical checks:
1) Effect size of deviation
If the sample is huge, even trivial deviations trigger rejection. I ask: does this deviation change the choice of model or method? If not, I proceed with the parametric method and note the deviation in documentation.
2) Downstream sensitivity
If I’m estimating a mean with a large sample, central limit behavior often makes the method robust. If I’m estimating tail risk or extreme quantiles, I take deviations seriously.
3) Residuals matter more than raw values
In regression, I care most about the normality of residuals, not the raw target variable. If residuals look normal enough, the model can be acceptable even if raw data are skewed.
Example decision rule I use in real projects
- If Shapiro–Wilk p < 0.05 and Q–Q plot shows clear curvature, I switch to nonparametric methods or transform the data.
- If p < 0.05 but the Q–Q plot is close to linear and N is large, I proceed with parametric methods and document the deviation.
- If p > 0.05 but visuals show strong skew, I treat it as non-normal and investigate data quality.
This rule of thumb is not a math proof — it’s a pragmatic approach I’ve found reliable.
Common mistakes I see (and how to avoid them)
Here are the mistakes I still see in 2026 codebases and notebooks:
1) Testing normality on the full dataset after heavy transformations
If you standardize or normalize aggressively before testing, you can hide real structure. Test before you apply transformations, then test again if you plan to use methods that depend on normality after the transformation.
2) Treating “not rejected” as “confirmed normal”
Failing to reject is not proof. It means the test did not detect a strong deviation. If your sample is small, it might simply lack power.
3) Ignoring the tails
Many decisions are sensitive to tail behavior. Anderson–Darling exists for a reason. If your domain cares about extremes — fraud detection, reliability, medical outcomes — you should not ignore tail-sensitive tests.
4) Mixing up normality of data and normality of residuals
In regression, check residuals. In time series, check the residuals after model fitting. For raw data, normality often matters only if you’re using methods that depend on it.
5) Forgetting multiple testing corrections
If you run normality tests on dozens of columns, you’re going to see false positives. I apply a simple correction or prioritize variables based on impact.
Traditional vs modern workflows (and why it matters now)
Modern dev teams have 2026 tooling that makes normality testing faster, more observable, and less error-prone. Here’s how I think about it.
Modern approach
—
Automated checks in data pipelines
P-value + visual diagnostics + effect size
Tests run per batch with alerts
Interactive dashboards with Q–Q and distribution overlays
Decision rules encoded in code reviewsIn my current workflow, I run normality checks as part of data validation. If a new batch fails, I don’t block the pipeline, but I flag it in the data quality report and route it to the model owner. This keeps the pipeline moving while still surfacing potential issues.
AI-assisted workflows are also useful here. I use LLM-based agents to generate summary reports (“Column A failed Shapiro–Wilk with p=0.002 and strong right skew; recommend log transform”). It saves time without letting the model decide the outcome for me.
Real-world scenario: standardized test scores
Imagine you’re analyzing standardized test scores from 1,000 students. You normalized them so the mean is 0. The histogram shows a classic bell curve with slight tapering tails. You run Shapiro–Wilk and get p=0.12.
That’s a clean outcome: no strong evidence against normality, and the visual check agrees. In this case I’d proceed with parametric methods and note that normality is reasonable.
Now imagine the same dataset but with a few dozen extreme outliers because of scoring errors. Your histogram shows a bell curve with spikes at the far tails. The Shapiro–Wilk p drops below 0.01, and the Q–Q plot bends away from the line near the ends. Here I would investigate the outliers first — they might be data quality errors. If they’re real, I’d use robust or nonparametric methods and note the tail risks explicitly.
That’s the core idea: the test tells you something is off, and your job is to figure out why and what to do next.
Performance considerations and scaling
Normality tests are cheap, but when you’re running them at scale on wide datasets, you need to be intentional. In production pipelines, I typically see these patterns:
- Shapiro–Wilk: typically 1–5 ms per column for a few thousand rows in Python. It scales poorly with huge N, so I sample.
- K–S: 2–10 ms per column for large samples; fine for big datasets but sensitive.
- Anderson–Darling: 2–8 ms per column; good tradeoff when you care about tails.
Two tactics keep it fast:
- Sample smartly. I use stratified sampling if the data is segmented (time windows, regions).
- Cache summaries. If you’ve already computed mean, standard deviation, and quantiles for monitoring, reuse them.
If you need to test dozens of columns nightly, run these tests in batch and store results in your data quality table. You’ll get trend lines over time and catch drift early.
What to do when normality fails
If a test rejects normality, I don’t panic. I pick one of three paths depending on the goal:
1) Transform the data
Log, square root, or Box–Cox transforms can reduce skew. I use them when the variable is strictly positive and the interpretation still makes sense. I always re-check normality after transformation.
2) Switch to robust or nonparametric methods
For group comparisons, I shift to Mann–Whitney or Kruskal–Wallis. For regression, I consider robust regression or quantile regression. This often saves me from forcing a transformation that makes results harder to interpret.
3) Model the distribution directly
If the data are inherently non-normal (counts, rates, bounded scores), I use models that match their natural distribution — Poisson, negative binomial, beta, or logistic. This is cleaner than trying to coerce normality.
Here’s a small Python snippet showing a transform option with a re-check:
import numpy as np
import scipy.stats as stats
Original skewed data
x = np.random.lognormal(mean=0.0, sigma=0.8, size=1000)
Box–Cox transform
xtransformed, lambda = stats.boxcox(x)
print(f"Box–Cox lambda: {lambda_:.3f}")
print("Shapiro–Wilk before:", stats.shapiro(x)[1])
print("Shapiro–Wilk after:", stats.shapiro(x_transformed)[1])
When the transformed data becomes normal enough, you can proceed with parametric methods while still keeping interpretability in mind.
My decision checklist (use this in code reviews)
I’ve turned my own normality decisions into a checklist I use during code reviews and pipeline reviews:
- Did you test normality only when it changes the method choice?
- Did you pair statistical tests with visual checks?
- Are you testing residuals instead of raw values when appropriate?
- Are you sampling in large datasets to avoid over-sensitivity?
- Did you document the decision: parametric vs nonparametric vs transformation?
This checklist keeps teams consistent and avoids the trap of running tests just because the library makes it easy.
Closing: the move that saves you the most time
The single best habit I’ve built is running normality tests early in the project, before choosing modeling tools. That one step stops me from spending hours tuning a model that never had a chance to satisfy its assumptions.
If you’re short on time, here’s what I recommend: run Shapiro–Wilk for small to medium datasets, pair it with a Q–Q plot, and make a decision about method choice. If the data are large, add K–S or Anderson–Darling, but treat rejection as a prompt to examine effect size, not as a hard stop. You’ll avoid unnecessary rewrites and you’ll build stronger intuition about your data.
When normality fails, don’t force it. Pick the method that fits the data instead of bending the data to match the method. That mindset saves you from fragile pipelines and brittle analysis. You’ll also end up with results you can explain confidently to stakeholders.
If you apply these steps consistently, you won’t just check a box — you’ll make better decisions, ship more reliable analytics, and build models that survive real-world messiness. That’s the real payoff of normality testing.



