Statistics Cheat Sheet — A Practical, Pressure‑Tested Guide

Data bugs rarely look like bugs at first. They show up as a model that misses its targets, a dashboard that swings wildly week to week, or a metric that changes after a tiny code tweak. I’ve learned to treat those moments as a signal that my statistical foundations need to be tighter, not that I need more model complexity. A good cheat sheet acts like a checklist: it keeps me honest about definitions, assumptions, and what each number really means.

What follows is the condensed set of concepts I keep on a whiteboard when I’m building data pipelines, running experiments, or reviewing analytics. I’ll walk through the core building blocks, why they matter, and how I use them in real work. You’ll get quick formulas, decision points, common mistakes, and a couple of small, runnable code examples you can lift into your own tooling. The goal is not to memorize every formula, but to build a mental map you can reach for under pressure, whether you’re debugging a KPI shift or preparing a product experiment.

Mental model and notation I rely on

The fastest way to get lost is to mix up parameters and statistics. I keep the following distinctions crisp because they drive every later choice:

Population vs sample: The population is the full set you care about; the sample is what you have. Your sample is a window, not the building.
Parameter vs statistic: A parameter describes the population (like the true mean μ). A statistic describes the sample (like the sample mean x̄).
Variable types: Categorical (labels), ordinal (ranked labels), interval (numeric with no true zero), ratio (numeric with true zero).
Random variable: A variable whose value depends on chance. I think of it as a mapping from outcomes to numbers, not just a column.

I also keep a minimal notation set that keeps formulas readable:

μ, σ: population mean and standard deviation
x̄, s: sample mean and sample standard deviation
n: sample size
p: population proportion, p̂: sample proportion
H0, H1: null and alternative hypotheses

Analogy that sticks with me: think of the population as an entire orchestra and the sample as a few microphones on stage. You can infer the overall sound, but you can’t claim absolute certainty about every instrument. Your statistics are the mic readings; your parameters are the true mix.

A nuance I always add: sampling is a data‑collection procedure, not just a number. If your sample has selection bias, you can compute beautiful statistics and still be wrong. This is why I keep a note in the margin that reads, “No formula beats a flawed sample.”

Descriptive statistics that actually tell the story

When people say “just compute the average,” I hear “ignore the shape of the data.” I never stop at one number. A useful cheat sheet covers center, spread, and shape.

Center:

Mean: x̄ = (1/n) Σ xi. Best when data is symmetric and no heavy outliers.
Median: 50th percentile. Best when data is skewed or outlier‑heavy.
Mode: Most frequent value. Useful for categorical data or discrete spikes.

Spread:

Range: max – min. Fast, but outlier sensitive.
Variance: s^2 = Σ (xi – x̄)^2 / (n – 1) for sample variance.
Standard deviation: s = √s^2. In original units, easier to explain.
Interquartile range (IQR): Q3 – Q1. Strong against outliers.

Shape:

Skewness: asymmetry of distribution.
Kurtosis: tail heaviness; I treat it as a warning sign for rare but large deviations.

When I need robust summaries, I use the median and IQR. When I need interpretability for an engineering team, I use mean and standard deviation, but I always show a histogram or at least a box plot.

Practical guidance:

Use mean for metrics like average latency per request only after trimming extreme outliers.
Use median for user session length and revenue per user if you have heavy tails.
Report mean ± standard deviation when you need comparability across teams.

Common mistake I see: treating standard deviation like error margin. Standard deviation measures spread of the data, not uncertainty in the mean. If you need uncertainty, look at standard error or a confidence interval.

A simple, runnable example:

import math
values = [120, 130, 128, 200, 125, 127, 123]
mean = sum(values) / len(values)
median = sorted(values)[len(values) // 2]
variance = sum((x - mean)  2 for x in values) / (len(values) - 1)
std_dev = math.sqrt(variance)
print({‘mean‘: mean, ‘median‘: median, ‘stddev‘: stddev})

In practice I log both mean and median for production metrics. When they diverge, I look for a tail event or a rollout artifact.

A small enhancement I use in dashboards: I add a “robust band” column (median ± 1.5 * IQR). If the mean falls outside that band, I flag it as likely outlier‑driven. It’s a quick visual that reduces false alarms without hiding real shifts.

Distribution shape diagnostics I actually use

Most teams stare at a histogram once and move on. I add a few lightweight checks so I can explain why I chose a particular statistic or model.

Quick checks:

Compare mean vs median: mean much larger suggests right‑skew.
Plot log scale if values span orders of magnitude: log‑normal patterns pop out.
QQ plot: if points bend away from a line, normality is suspect.
Tail ratio: (P99 – P50) / (P90 – P50) to detect tail heaviness.

I keep a “shape note” in my analysis: a single sentence like “Right‑skewed with heavy upper tail; use median/IQR and log scale for models.” That sentence justifies downstream choices and helps reviewers follow the chain of reasoning.

Common pitfall: assuming a distribution is normal because a t‑test is easy. The Central Limit Theorem helps the mean, not the raw data. If you’re modeling the raw data, you still need to consider shape.

Probability and distributions I keep on speed dial

Probability is the engine behind inference. My cheat sheet keeps the core functions and a small list of distributions I actually use.

Core functions:

PMF (discrete): P(X = x)
PDF (continuous): f(x), not a probability by itself
CDF: P(X ≤ x), works for both
Expected value: E[X]
Variance: Var(X) = E[(X – E[X])^2]

Distributions and when I reach for them:

Bernoulli: one yes/no trial. Use for single conversion outcomes.
Binomial: count of successes in n trials. Use for conversion counts.
Poisson: count of rare events in a fixed interval. Use for incidents per hour.
Normal: many small additive effects. Use for metrics that look symmetric.
Exponential: time between independent events. Use for time‑to‑failure.
Uniform: any value in a range equally likely. Use in simulations.

Rules of thumb I keep:

Binomial approximates Normal when n is large and p is not extreme.
Poisson approximates Binomial when n is large and p is small.
Log‑normal often fits data like income, session length, or latency tails.

Simple analogy for CDF: think of filling a bathtub. The CDF tells you how full it is at each point. The PDF tells you how fast the water level is rising at each point.

Mistake to avoid: treating a PDF value as a probability. For continuous distributions, probabilities come from areas under the curve, not single points.

Another rule I use in practice: if I can’t justify a distribution with a data‑generating story, I don’t use it. For example, Poisson is not just “counts.” It’s “counts from independent events with a constant rate.” That story matters.

Sampling, estimators, and confidence you can explain

Sampling is where I see teams lose rigor. A clean sample is better than a big sample. The cheat sheet here is about bias, variance, and the mechanics of estimation.

Key ideas:

Unbiased estimator: expected value equals the true parameter.
Consistent estimator: converges to the true value as n grows.
Standard error: SE = s / √n. Measures uncertainty in the sample mean.
Central Limit Theorem: sample mean tends toward Normal for large n, even if data isn’t Normal.

Confidence intervals I trust:

Mean (Normal or large n): x̄ ± z * (s / √n)
Proportion: p̂ ± z * √(p̂(1 – p̂)/n)

I pick 95% when I need a balanced tradeoff; 90% for quick iteration; 99% for safety‑critical decisions. I do not treat a 95% interval as a 95% chance that the true mean is inside it. It’s about long‑run frequency across repeated samples.

Bootstrap approach I use when assumptions are shaky:

Resample with replacement many times
Compute the statistic each time
Use quantiles of the bootstrap distribution as the interval

Edge case: small sample sizes with heavy tails. I prefer bootstrap or robust estimators over textbook formulas.

Common mistake: using a confidence interval and then narrating it like a probability statement for the parameter. I frame it as, “If we repeated this experiment many times, 95% of the intervals would contain the true value.”

A practical add‑on: I include the “minimum detectable effect” (MDE) in experiment plans. It forces a conversation about what effect size actually matters and keeps teams from celebrating tiny, meaningless wins.

Practical sampling pitfalls and fixes

This is the part I wish more teams learned early. Many issues aren’t statistical—they’re operational.

Survivorship bias: if the dataset only includes users who stayed, you’re missing early churn. Fix: include drop‑off events and define a cohort entry rule.
Selection bias: if you sample only from power users or certain geos, you overfit to the loudest segment. Fix: stratified sampling or post‑stratification weights.
Non‑response bias: in surveys, people who respond are often different. Fix: compare responders vs non‑responders on available variables.
Time‑window bias: if you compare a busy week to a quiet week, you build a false narrative. Fix: align on same weekday patterns or use a longer baseline.

I keep a small checklist: “Who is missing? Why are they missing? How does that change the story?” It’s the simplest way I know to avoid overconfidence.

Hypothesis tests that answer real product questions

A hypothesis test is a decision framework, not a truth oracle. I keep a few practical steps in front of me:

1) State H0 and H1 in plain language.

2) Pick a test that matches the data and design.

3) Choose α (significance level) before seeing the results.

4) Compute test statistic and p‑value.

5) Decide and explain in business terms.

Core terms:

Type I error: false positive, controlled by α.
Type II error: false negative, controlled by power.
Power: probability of detecting a true effect.
Effect size: magnitude of difference, not just whether it’s non‑zero.

Common tests and when I use them:

t‑test: compare means with continuous data. Use Welch when variances differ.
Paired t‑test: same subjects before and after.
Chi‑square: categorical counts across groups.
Fisher’s exact: small sample categorical tests.
Mann–Whitney U: nonparametric alternative to t‑test.
ANOVA: compare more than two group means.

Multiple comparisons: if you test 20 metrics, you will find a “significant” one by chance. I use a correction (like Benjamini–Hochberg) when I’m scanning many metrics.

Recommendation I give teams: default to effect size plus confidence intervals, then use p‑values as a check, not a headline. A tiny effect with a tiny p‑value is still tiny.

When not to use tests:

When the data is not representative (selection bias).
When you’ve already made the decision and just want validation.
When the sample size is so large that any trivial effect looks “significant.”

A phrasing I use in readouts: “The effect is positive but small; we’re confident it exists, not that it’s meaningful.” That line usually prevents premature launches.

Effect sizes that keep decisions grounded

Effect size is the difference between statistical significance and practical significance. I keep a few compact measures on my sheet:

Absolute difference: Δ = x̄A – x̄B. Easy to explain in business units.
Relative difference: (x̄A – x̄B) / x̄B. Useful for growth narratives.
Cohen’s d: (x̄A – x̄B) / s_pooled. Normalized for scale comparison.
Odds ratio: for binary outcomes in logistic settings.

I always tie effect size to a threshold: “We need at least +1.5% conversion to justify the cost.” Without this, you don’t have a decision, you have a number.

Power and sample size without mysticism

Power is often treated as advanced math, but I see it as a budgeting tool. You’re trading time, exposure, and confidence.

Rules I use:

Power increases with n, effect size, and lower noise.
If you can’t change n, try to reduce variance (better instrumentation, cleaner cohorts).
If you can’t change variance, define a larger meaningful effect (or accept lower power).

Quick mental check: if your observed effect is tiny and variance is high, you’re not underpowered—you’re asking for the impossible. I’ll either change the metric or the product decision.

Correlation and regression without the usual traps

Correlation is a quick signal. Regression is a model of relationships. I use them differently.

Correlation:

Pearson: linear relationships, numeric data, sensitive to outliers.
Spearman: monotonic relationships, ranked data, robust to outliers.

Rule I keep: correlation is about association, not causation. If you need causation, you need a design like randomized experiments or strong causal inference methods.

Regression basics:

Linear regression: y = β0 + β1x + ε
Multiple regression: includes more predictors
Logistic regression: binary outcome, uses log‑odds

Key diagnostics:

R^2: how much variance is explained. I treat it as a fit signal, not a performance metric.
Residual plots: check for non‑linearity and heteroscedasticity.
Multicollinearity: high correlation among predictors inflates variance.

Simple JavaScript example for linear regression with one feature:

const points = [
{x: 1, y: 2},
{x: 2, y: 3},
{x: 3, y: 5},
{x: 4, y: 4}
];
const n = points.length;
const meanX = points.reduce((s, p) => s + p.x, 0) / n;
const meanY = points.reduce((s, p) => s + p.y, 0) / n;
let num = 0;
let den = 0;
for (const p of points) {
num += (p.x - meanX) * (p.y - meanY);
den += (p.x - meanX)  2;
}
const slope = num / den;
const intercept = meanY - slope * meanX;
console.log({slope, intercept});

When not to use linear regression:

When your outcome is a count or a probability (use Poisson or logistic).
When the relationship is clearly non‑linear (use transformations or non‑linear models).

Performance considerations: simple linear regression is typically fast, often 10–30ms for 100k rows in a vectorized library. But feature engineering can dominate runtime; I budget 100–300ms for full preprocessing on mid‑size datasets.

Regression edge cases and fixes

Regression models fail quietly. These are the pitfalls I routinely guard against:

Non‑independence: repeated measures for the same user can bias estimates. Fix: use mixed‑effects models or cluster‑robust standard errors.
Heteroscedasticity: variance changes with predictor values. Fix: transform variables or use robust errors.
Omitted variable bias: missing a confounder makes coefficients misleading. Fix: add covariates or reconsider the question.
Leakage: using features that include future information. Fix: enforce strict temporal splits and audit feature definitions.

I keep a short “regression sanity” routine: check residual plot, check leverage/outliers, check variance inflation (VIF), then explain coefficients in plain language.

Time series basics I keep for production metrics

Many real metrics are time series. A static average hides dynamics.

Key ideas:

Trend: long‑term direction (growth or decline).
Seasonality: repeating patterns (day of week, month).
Noise: short‑term random variation.

Useful tools:

Rolling mean/median: smooth out noise.
Year‑over‑year comparison: controls for seasonality.
Differencing: remove trend to analyze change rather than level.

Pitfall: comparing two points in time without context. I always compare a point to a baseline window, not another single point.

A practical approach: if a metric shifts, I compute three deltas—day‑over‑day, week‑over‑week, and same‑weekday‑last‑month. If they all agree, I trust the signal more.

A/B testing in the real world

A/B testing is where stats meets product pressure. My cheat sheet is a decision guardrail.

I always define:

Primary metric: one number that drives the decision.
Guardrail metrics: ensure we don’t damage other key behaviors.
Stopping rule: time‑based or sample‑based, fixed in advance.

Common design choices:

Randomization unit: user, session, device. Must match exposure.
Holdout percentage: smaller for low‑risk changes, larger for critical ones.
Experiment duration: long enough to capture natural cycles.

Frequent mistakes:

Peeking and stopping early.
Changing metrics mid‑experiment.
Not excluding bots or test accounts.

I keep one sentence in every readout: “This decision assumes randomization is clean and exposure is stable.” If either is untrue, I pause the rollout.

When Bayesian thinking helps me move faster

I’m not doctrinaire about frequentist vs Bayesian. I use Bayesian thinking when I need iterative decision‑making and interpretability.

Why it helps:

It gives you direct probability statements about parameters.
It integrates prior knowledge explicitly.
It updates naturally with new data.

Practical use cases:

Small data scenarios where frequentist intervals are unstable.
Sequential testing where peeking is inevitable.
When I need to combine multiple data sources into one belief.

I keep it simple: a prior, a likelihood, a posterior. If a strong prior dominates the data, I flag it and explain why. Transparency is the point.

Practical workflow, modern tooling, and a quick comparison

In 2026, my workflow is a mix of local notebooks, reproducible pipelines, and AI‑assisted review. The math is the same; the way I keep myself from making mistakes is different.

What I do in practice:

Keep a short stats checklist in the repo with assumptions and tests used.
Use notebooks for exploration, then port logic to scripts with fixed seeds.
Log intermediate summaries (mean, median, IQR, counts) for every dataset.
Use AI tools to spot‑check reasoning, but I verify outputs with code.

Traditional vs Modern approach:

Task

Traditional

Modern (2026) —

—

— Exploration

Manual summary in spreadsheet

Auto‑generated profiling reports + notebook notes Testing

Hand‑picked tests

Test selection helpers + effect size dashboards Reproducibility

Manual scripts

Pipelines with versioned data snapshots Review

Peer review only

Peer review + AI checklists

I recommend keeping everything reproducible. If I can’t re‑run a result in a clean environment, I don’t trust it.

Common mistakes I still see in modern stacks:

Mixing different data windows for treatment and control.
Stopping an experiment the moment a p‑value crosses a line.
Changing the success metric mid‑flight.

Edge cases to watch:

Seasonality masking effects. I compare with a same‑period baseline.
Simpson’s paradox. I segment and look for reversals.
Censored data (e.g., user sessions truncated). I account for it explicitly.

Production considerations: instrumentation, monitoring, and rollback

This isn’t just theory. Stats gets operationalized fast.

Instrumentation habits:

Define metrics once in a shared layer (avoid duplicate logic).
Log raw events, not just aggregates, so you can re‑compute later.
Version metric definitions with code changes.

Monitoring practices:

Track mean and median together.
Watch distribution shifts (e.g., P50, P90, P99).
Add “definition drift” alerts for changes in event semantics.

Rollback strategy:

If the metric changes and you can’t explain it within 24 hours, roll back.
Favor reversibility: avoid changes that you can’t undo quickly.
Keep a control holdout where possible.

I include this because I’ve seen the same bug appear as “statistical noise” until someone found the instrumentation change. A simple rollback rule saves you from arguing with a broken dataset.

Data quality checklist I keep on the wall

Before I trust any statistic, I ask these questions:

Are there missing values? If so, are they random or systematic?
Is the dataset complete for the time window?
Are event definitions stable across the period?
Are there duplicates or late‑arriving events?
Do the distributions look plausible compared to history?

The biggest stat mistake I see is not statistical at all: it’s trusting bad data. A quick checklist has saved me more times than any formula.

A compact cheat sheet I reach for under pressure

Here’s the condensed list I keep nearby:

Mean: x̄ = Σxi / n
Sample variance: s^2 = Σ (xi – x̄)^2 / (n – 1)
Standard error of mean: SE = s / √n
Confidence interval: statistic ± z * SE
Bernoulli mean: E[X] = p, Var(X) = p(1 – p)
Binomial mean: np, variance: np(1 – p)
Poisson mean and variance: λ
Normal standardization: z = (x – μ) / σ
Correlation: r = Cov(x, y) / (σx σy)

Decisions I make with this sheet:

Use median and IQR for heavy‑tailed data.
Use Welch’s t‑test when variances look different.
Use bootstrap intervals when assumptions feel shaky.
Report effect size and confidence interval before p‑value.
Treat correlation as a hint, not a reason.

If you keep only one thing from this section, let it be this: a formula without a decision rule is just algebra. I always attach “when I use it” to every statistic.

Expanded examples you can lift into your workflow

I like to keep practical snippets that can be dropped into a notebook or script.

Example 1: Bootstrap confidence interval in Python

import random
values = [120, 130, 128, 200, 125, 127, 123]
def mean(xs):
return sum(xs) / len(xs)
def bootstrap_ci(xs, iters=5000, alpha=0.05):
stats = []
for _ in range(iters):
sample = [random.choice(xs) for _ in xs]
stats.append(mean(sample))
stats.sort()
lower = stats[int((alpha/2) * iters)]
upper = stats[int((1 - alpha/2) * iters)]
return lower, upper
print(bootstrap_ci(values))

I use this when the data is noisy or skewed. It’s slow for huge datasets, but it gives a more honest interval than a normal assumption.

Example 2: Simple A/B test for proportions

import math
Conversion counts
conva, totala = 120, 2000
convb, totalb = 138, 2100
pa = conva / total_a
pb = convb / total_b
ppool = (conva + convb) / (totala + total_b)
se = math.sqrt(ppool  (1 - ppool)  (1/totala + 1/totalb))
z = (pb - pa) / se
print({‘pa‘: pa, ‘pb‘: pb, ‘z‘: z})

I still verify with a stats library, but I keep this as a sanity check when I don’t trust black‑box tools.

Example 3: Robust summary in JavaScript

function median(xs) {
const s = [...xs].sort((a, b) => a - b);
const mid = Math.floor(s.length / 2);
return s.length % 2 ? s[mid] : (s[mid - 1] + s[mid]) / 2;
}
function iqr(xs) {
const s = [...xs].sort((a, b) => a - b);
const q1 = s[Math.floor(s.length * 0.25)];
const q3 = s[Math.floor(s.length * 0.75)];
return q3 - q1;
}
const values = [12, 14, 15, 18, 21, 40, 45];
console.log({median: median(values), iqr: iqr(values)});

This gives me a fast check for tail‑heavy data in a data‑pipeline script.

Common pitfalls and how I avoid them

I keep this list right next to my formulas:

“Significant” does not mean “important.” Always check effect size.
Clean math cannot fix biased samples. Fix the sample first.
One metric doesn’t tell the story. Use at least one robust measure.
Peeking breaks frequentist guarantees. If you peek, use a sequential method.
Correlation without a causal design is just a hypothesis.

I also remind myself: if I can’t explain the result in two sentences to a product manager, I probably don’t understand it myself.

Alternative approaches when the standard tools fail

Sometimes the classic toolbox doesn’t fit the problem. Here are the alternatives I reach for:

Nonparametric tests when distributions are weird.
Permutation tests for custom metrics.
Bayesian methods for sequential decisions.
Quantile regression for effects across the distribution.
Time‑series models (like ARIMA) for autocorrelated data.

I prefer to state why I changed methods: “Metric is heavily skewed, so I used a rank‑based test.” That keeps the analysis defensible.

Closing thoughts and next steps you can take

I’ve built this cheat sheet because the math is easy to forget when the pressure is high. What keeps me steady is a repeatable pattern: define the question, check the data type, choose the right statistic, and then tell the story in a way a product team can act on. The real win is not the perfect p‑value, it’s the decision that survives scrutiny.

If you want to make this practical right away, I suggest three concrete steps. First, pick one metric you report weekly and compute mean, median, IQR, and a confidence interval side by side. That single exercise will show you whether you’re relying on fragile averages. Second, build a one‑page stats checklist in your repo and link it to every analysis. You’ll catch mistakes before they ship. Third, run a small experiment with a clear effect size target and a fixed stopping rule. You’ll start to feel how hypothesis tests support decisions rather than replace them.

I’m not chasing math for math’s sake. I’m chasing clarity: which numbers are stable, which are noisy, and which actions are actually justified. If you build that clarity into your daily workflow, the cheat sheet does its job.

Expansion Strategy

Add new sections or deepen existing ones with:

Deeper code examples: More complete, real‑world implementations
Edge cases: What breaks and how to handle it
Practical scenarios: When to use vs when NOT to use
Performance considerations: Before/after comparisons (use ranges, not exact numbers)
Common pitfalls: Mistakes developers make and how to avoid them
Alternative approaches: Different ways to solve the same problem

If Relevant to Topic

Modern tooling and AI‑assisted workflows (for infrastructure/framework topics)
Comparison tables for Traditional vs Modern approaches
Production considerations: deployment, monitoring, scaling