AVG() Function in SQL: Practical Guide, Edge Cases, and Modern Patterns

If you have ever stared at a dashboard and wondered why two reports disagree on the same metric, AVG() is usually at the center of the story. I see it constantly in production queries because it answers a simple question fast: what is the central value of a numeric column? Yet that simplicity hides a lot of edge cases: NULLs, grouping rules, data types, and even the definition of fairness when you average across uneven categories. When I mentor engineers, I start with one promise: if you truly understand AVG(), you will debug half of your analytics bugs in minutes instead of hours. You will also write cleaner SQL that scales, because average is often the first aggregate you wire into an ETL job or a BI model. In this guide I will show you the mental model that keeps you safe, a set of runnable examples you can copy into a scratch database, and the modern patterns I use in 2026 projects. You will leave with practical heuristics for when AVG() is correct, when it is misleading, and how to make it dependable.

A mental model that prevents mistakes

AVG() is nothing more than sum divided by count, but the count is only for non-NULL values. That one sentence explains most surprises. I keep this simple picture in my head: imagine a row of cups, some filled with marbles, some empty, and some missing entirely. AVG() counts only the cups that actually have a value, not the missing ones. If you have 10 rows and 4 NULLs, AVG() uses 6 rows in the denominator. This is correct behavior, but it can make the average look higher than expected if missing data is common.

AVG() also only accepts numeric input. If you hand it a text column, most engines throw an error or attempt an implicit cast that can fail. I always make my intent explicit: either cast text to numeric or fix the schema. I also remind people that AVG() does not care about duplicates or business meaning. It is blind to semantics; if you have multiple rows for the same entity, AVG() treats each row equally. That is why a customer with 50 orders can drown out one with 1 order unless you are careful.

When I explain this to junior engineers, I use a quick analogy: averaging is like spreading 30 cookies across 5 kids. If you do not count one kid because you forgot they were there (NULL), everyone else gets more cookies. That is what AVG() does by design. Understanding that is more valuable than memorizing syntax. If you learn this model, you will spot mistakes like averaging averages, missing weights, or misusing HAVING long before they hit production.

Core syntax and a runnable starter example

The basic syntax is short, but a runnable example makes it stick. Below is a complete script you can run in most SQL engines with minimal changes. It creates a demo table, inserts data, and calculates a single average.

-- Create a demo table

CREATE TABLE student (

student_id INTEGER PRIMARY KEY,

student_name VARCHAR(50),

marks INTEGER

);

-- Insert sample data

INSERT INTO student (studentid, studentname, marks) VALUES

(1, ‘Ari‘, 88),

(2, ‘Bela‘, 91),

(3, ‘Chen‘, 76),

(4, ‘Dana‘, NULL),

(5, ‘Eli‘, 95);

-- Basic AVG

SELECT AVG(marks) AS average_marks

FROM student;

Notice the NULL for Dana. The result is the average of 88, 91, 76, and 95, divided by 4, not 5. That is exactly what you want in many cases, but it must be intentional. If missing marks should count as zero, you must say so, which I cover later.

I also recommend aliasing the result. You will thank yourself when the query is reused in a BI tool or a data warehouse model, because a clean column name makes downstream work easier. Another habit I keep is using lowersnakecase for alias names so they do not require quotes.

Filtering rows vs filtering groups: WHERE, GROUP BY, HAVING

AVG() is most powerful when you control the scope. I think of scope in two layers: row filtering (WHERE) and group filtering (HAVING). This distinction matters a lot.

Here is a student_scores table example that mirrors what you will see in actual analytics: multiple rows per subject, one score per row.

CREATE TABLE student_scores (

student_id INTEGER,

subject VARCHAR(30),

score INTEGER

);

INSERT INTO studentscores (studentid, subject, score) VALUES

(1, ‘Math‘, 92),

(1, ‘Science‘, 88),

(2, ‘Math‘, 81),

(2, ‘Science‘, 90),

(3, ‘Math‘, 79),

(3, ‘Science‘, 85),

(4, ‘Math‘, 95),

(4, ‘Science‘, NULL);

Overall average across all rows:

SELECT AVG(score) AS overallaveragescore

FROM student_scores;

Average per subject with GROUP BY:

SELECT subject, AVG(score) AS average_score

FROM student_scores

GROUP BY subject;

Average for a specific subject with WHERE:

SELECT AVG(score) AS averagesciencescore

FROM student_scores

WHERE subject = ‘Science‘;

Filter groups with HAVING:

SELECT subject, AVG(score) AS average_score

FROM student_scores

GROUP BY subject

HAVING AVG(score) > 85;

Here is the rule I teach: WHERE reduces rows before aggregation, HAVING reduces groups after aggregation. If you want only Science rows, use WHERE. If you want only subjects whose average exceeds a threshold, use HAVING. Mixing them up is a common cause of off-by-one dashboards.

Precision, NULLs, and type control

AVG() often returns a decimal, but you should not assume the exact type. Some engines return a floating type for integers, others return a numeric with scale. When precision matters, I make it explicit. The safest pattern is to cast to a high-precision numeric, then round for display.

SELECT ROUND(AVG(CAST(score AS NUMERIC(10,2))), 2) AS avgscorerounded

FROM student_scores;

When NULLs represent missing data, AVG() ignores them. But what if NULL should mean zero, like a missed quiz that counts as zero? Then you must make it zero:

SELECT AVG(COALESCE(score, 0)) AS avgwithmissingaszero

FROM student_scores;

Be careful: that changes the meaning of the metric. I only do this when the business rule is explicit, and I always name the column to reflect the rule so no one reads it as a standard average.

Another nuance is integer division in related formulas. If you compute the average manually as SUM(score) / COUNT(score) in some engines, you might get integer division if both operands are integers. AVG() protects you from that most of the time, but if you build custom formulas, cast one side to a decimal type:

SELECT SUM(score) / CAST(COUNT(score) AS NUMERIC(10,2)) AS avg_manual

FROM student_scores;

I also advise you to keep units clear. If you are averaging durations, store them in a single base unit like seconds, and convert at the end. Mixing minutes and seconds is an easy way to create a misleading average that still looks plausible.

Weighted averages and when plain AVG lies

Plain AVG assumes every row has equal weight. That is not always true. Suppose you are averaging assignment grades but some assignments are worth more. Or you are averaging product ratings but some products have far more reviews. In those cases, a weighted average is the honest answer.

Weighted average formula: sum(value * weight) / sum(weight). In SQL:

-- Example: assignments with different weights

CREATE TABLE assignment_scores (

student_id INTEGER,

assignment_name VARCHAR(40),

score INTEGER,

weight INTEGER

);

INSERT INTO assignmentscores (studentid, assignment_name, score, weight) VALUES

(1, ‘Quiz 1‘, 90, 1),

(1, ‘Project‘, 85, 4),

(1, ‘Final‘, 88, 5);

SELECT

SUM(score * weight) / CAST(SUM(weight) AS NUMERIC(10,2)) AS weighted_average

FROM assignment_scores

WHERE student_id = 1;

Another common mistake is averaging averages. If you first compute average score per class and then average those averages, you are implicitly giving every class equal weight, regardless of size. That might be fine if you are comparing class performance, but it is wrong if you want the overall student average. In that case, you must compute the average from the raw rows or use a weighted average with class size.

If your data has heavy outliers, AVG can be misleading even with correct weighting. A single extreme value can pull the mean far away from the center. In those cases, I consider median or percentile, depending on the business question. AVG is great when you want total magnitude spread evenly, but it is not the best when you need a typical value that resists spikes.

Windowed averages for trends and comparisons

In modern analytics, the average you care about is often a moving target: last 7 days, last 30 rows, or the average per group while still keeping row-level detail. Window functions make this clean.

Here is a rolling average of daily sales per product. This gives you a trend without losing daily granularity.

CREATE TABLE daily_sales (

sale_date DATE,

product_id INTEGER,

amount NUMERIC(10,2)

);

-- Rolling 7-day average per product

SELECT

sale_date,

product_id,

amount,

AVG(amount) OVER (

PARTITION BY product_id

ORDER BY sale_date

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW

) AS avg7day

FROM daily_sales;

Windowed AVG is also a nice way to compare an item to its category average without collapsing rows. Example: average score per subject alongside each row.

SELECT

student_id,

subject,

score,

AVG(score) OVER (PARTITION BY subject) AS subject_avg

FROM student_scores;

I use this pattern in feature engineering for ML and in BI modeling because it preserves detail while giving context. It also reduces the need for subqueries or joins. Just remember: window functions do not reduce rows, they add columns.

Performance and a modern workflow

AVG() is an aggregate, so the engine must read the rows you target. On a small table this is instant. On a multi-billion-row table, it is a scan. The way you scope your query is the biggest performance lever you have: filter early, partition data, or pre-aggregate when necessary. In data warehouses, partition pruning can cut a full scan to a fraction of the data if you filter by date. In OLTP databases, a covering index can help if the engine can read only the index rather than the full table.

Typical performance ranges I see in production are in the 10-50ms range for small to medium partitioned aggregates, and 500ms to several seconds for large, unpartitioned scans. These are not guarantees, just a sense of scale. When latency matters, I prefer materialized views or summary tables, especially for dashboards that refresh frequently.

In 2026, the workflow around AVG() is also different. I often pair SQL with dbt models and AI assistants to generate checks. The query still matters, but the testing and lineage are just as important. Here is a quick comparison of a traditional and modern approach I see on teams.

Approach

Query authoring

Validation

Delivery

Failure mode —

— Traditional

Handwritten SQL in a BI tool

Manual spot checks

Dashboard refresh

Silent drift in results Modern

SQL in version control with tests

Automated data tests and row counts

CI to warehouse models

Alert on variance or null spikes

This is not about tools for their own sake. It is about avoiding silent failure. If AVG() feeds a KPI, I want a test that flags sudden jumps or drops beyond a fixed threshold. I also add a test to catch a sudden rise in NULL share, because that directly changes the denominator of AVG().

Common mistakes and quick fixes

Here are the issues I see most often, along with the fix I recommend.

1) Averaging averages by accident. Fix: compute AVG from raw rows or use weighted averages based on group size.

2) Ignoring NULL meaning. Fix: decide whether NULL means missing or zero, then use COALESCE if needed.

3) Mixing types without a cast. Fix: cast to NUMERIC and round for display.

4) Using HAVING when you need WHERE. Fix: use WHERE to filter rows, HAVING to filter aggregate groups.

5) Forgetting units. Fix: convert to base units first, average, then convert for display.

A quick habit that saves time is to always add a companion COUNT. It tells you how many rows went into your AVG and can surface missing data or unexpected filters.

SELECT

AVG(score) AS avg_score,

COUNT(score) AS countnonnull_scores

FROM student_scores;

If you need to include NULL rows in the count, use COUNT(*) as a separate check. When both counts diverge, I know the result might be affected by missing data, and I log it or surface it in the UI. In practice, that one extra column prevents many false assumptions.

A deeper look at NULL semantics and data quality

I want to go deeper on NULL because it is the single most important nuance in AVG(). In most datasets I touch, NULL means one of three things: missing measurement, not applicable, or a failed ingest. Those three meanings should not be treated the same, and AVG() will not differentiate them for you.

If NULL means "missing measurement" (for example, a student absent on test day), you might exclude it from the denominator as AVG() does by default. If NULL means "not applicable" (a subject the student never took), you probably want to exclude it as well. But if NULL means "failed ingest" (a bug that dropped the value), you may want to fail the pipeline or backfill, because a partial denominator is hiding the system issue.

A practical pattern I use is to track NULL rate alongside AVG in every report that matters. Here is a lightweight example that makes it explicit:

SELECT

AVG(score) AS avg_score,

COUNT(*) AS row_count,

COUNT(score) AS nonnullcount,

1.0 - (COUNT(score) / CAST(COUNT(*) AS NUMERIC(10,4))) AS null_rate

FROM student_scores;

If your engine doesn’t allow implicit float literals (like 1.0), you can cast the numerator instead. The point is the same: I want the NULL rate to be visible. If the average moves by 3 points but the NULL rate moved by 15%, I treat the average as suspect. That single column gives you immediate context.

Another trick: some teams store a sentinel value like -1 or 0 instead of NULL. That can be convenient for storage but is dangerous for AVG() because it changes the math silently. If you inherit such a schema, I strongly recommend converting those sentinels back to NULL at query time so AVG() doesn’t treat them as actual values:

SELECT

AVG(NULLIF(score, -1)) AS avg_score

FROM student_scores;

That pattern is simple, but it saves hours of debugging on legacy datasets.

Real-world scenario: average order value (AOV)

If you work in ecommerce or subscription businesses, you will run into average order value. It looks simple, but it is full of traps. Consider this small order table:

CREATE TABLE orders (

order_id INTEGER,

customer_id INTEGER,

order_total NUMERIC(10,2),

order_status VARCHAR(20)

);

INSERT INTO orders (orderid, customerid, ordertotal, orderstatus) VALUES

(1001, 10, 120.00, ‘paid‘),

(1002, 10, 80.00, ‘paid‘),

(1003, 11, 50.00, ‘cancelled‘),

(1004, 12, NULL, ‘paid‘),

(1005, 13, 300.00, ‘paid‘);

The naive AOV is:

SELECT AVG(order_total) AS aov

FROM orders;

That result is already wrong in two ways: it includes a cancelled order and it ignores a NULL which might be a data quality issue, not a legitimate zero. The better version is:

SELECT AVG(ordertotal) AS aovpaid

FROM orders

WHERE order_status = ‘paid‘

AND order_total IS NOT NULL;

But there is still a subtle question: do you want AOV per order or per customer? If your business metric is per customer, then a customer with 10 small orders can overwhelm a single customer with one large order. That is not wrong, but it is a different metric. If you want average spend per customer, you need a two-step approach:

SELECT AVG(customertotal) AS avgcustomer_spend

FROM (

SELECT customerid, SUM(ordertotal) AS customer_total

FROM orders

WHERE order_status = ‘paid‘

GROUP BY customer_id

) t;

This is not the same as AOV. This is average customer spend (for the period) and it is useful for retention and cohort analysis. I always clarify which one I am using, because executives often assume they are interchangeable. They are not.

Real-world scenario: average session duration

Another common metric is average session duration. It sounds straightforward until you look at data quality. Sessions often have missing end times, extremely long durations due to app crashes, or negative durations due to clock drift. If you compute AVG(duration_seconds) without cleaning, you can get nonsense.

A pattern I use is to cap outliers and discard impossible values before averaging:

SELECT

AVG(durationseconds) AS avgsession_seconds

FROM sessions

WHERE duration_seconds BETWEEN 1 AND 14400; -- 1 sec to 4 hours

That range is not universal, but you should choose something reasonable for your product. The point is to make your assumptions explicit and documented. If you do not, outliers will distort the average and you will end up optimizing the wrong user experience.

For some teams, median session duration is more meaningful. But even if you stick with AVG, the cleaning rule belongs in the query, not in a hidden ETL job. That way, when someone reads the query later, they understand the metric’s scope.

Real-world scenario: customer satisfaction scores

Average satisfaction score (like CSAT on a 1–5 scale) is often quoted, but it is easy to misuse. For example, some product areas might have a low response rate. If you average scores without seeing response rate, you could be comparing a noisy sample to a robust one.

Here is a pattern that keeps me honest:

SELECT

product_area,

AVG(score) AS avg_csat,

COUNT(score) AS responses

FROM csat_responses

GROUP BY product_area

HAVING COUNT(score) >= 30;

The HAVING clause here is not about filtering by average; it is about statistical stability. If a product area has only 2 responses, I do not treat that average as meaningful. This is a good example of using HAVING for group-level filters that are not about AVG directly but still affect its credibility.

Averaging across uneven categories: fairness and interpretation

This is one of the most important conceptual issues I see. Suppose you want the average salary across departments, but each department has a different size. A naive average of department averages treats each department equally, which might be fair if you are comparing departments, but it is wrong if you want the overall company average.

Here is the trap:

SELECT AVG(deptavg) AS companyavg

FROM (

SELECT department, AVG(salary) AS dept_avg

FROM employees

GROUP BY department

) d;

This gives you the average of department averages. If one department has 5 people and another has 500, this result heavily overweights the small department. The correct overall average is the average of all employees:

SELECT AVG(salary) AS company_avg

FROM employees;

Or, if you only have department-level data, you need a weighted average using department size:

SELECT

SUM(deptavg * deptsize) / CAST(SUM(deptsize) AS NUMERIC(12,2)) AS companyavg

FROM (

SELECT department, AVG(salary) AS deptavg, COUNT(*) AS deptsize

FROM employees

GROUP BY department

) d;

The key lesson: when you change the unit of analysis, you change the question you are answering. AVG() is simple, but the semantics are not. That is why I always name the metric with its unit, like avgsalaryperemployee or avgsalaryperdepartment. The name becomes a guardrail.

AVG() with DISTINCT: when duplicates are intentional or harmful

Sometimes duplicate values are meaningful, and sometimes they are accidental. AVG(DISTINCT col) gives you the average of unique values only. That can be useful, but it is often misunderstood.

Imagine a table of product reviews where each row is a review and the rating is 1–5. If you compute AVG(rating), you get the average across all reviews, which is typically what you want. If you compute AVG(DISTINCT rating), you will get the average of the unique rating values (1,2,3,4,5), which is almost never meaningful.

The only time I use AVG(DISTINCT) is when duplicates represent a data artifact, such as duplicate sensor readings or duplicated rows from a bad join. Even then, I prefer to remove duplicates explicitly with a proper key rather than using DISTINCT inside AVG. That way, the logic is clear and auditable.

Using AVG() in joins: duplicate explosion

AVG() becomes dangerous when you join tables incorrectly. The classic bug is a join that multiplies rows. Suppose you join orders to orderitems and then average ordertotal. Each order now appears multiple times, and your average is inflated by the number of items. This is subtle because the query still runs and the number looks plausible.

Here is the wrong approach:

SELECT AVG(o.ordertotal) AS avgorder_total

FROM orders o

JOIN orderitems i ON o.orderid = i.order_id;

The fix is to aggregate at the order level first, then join, or avoid the join entirely if you do not need item-level data. I often write it like this:

SELECT AVG(ordertotal) AS avgorder_total

FROM orders;

If I must include a join, I isolate it:

SELECT AVG(ordertotal) AS avgorder_total

FROM (

SELECT DISTINCT o.orderid, o.ordertotal

FROM orders o

JOIN orderitems i ON o.orderid = i.order_id

) t;

This is still a workaround. A better solution is to understand why the join is necessary and whether the analysis should be at the order-item level instead of the order level. Once again, AVG() forces you to be explicit about the unit of analysis.

AVG() and time zones: subtle errors in time-based averages

If you average time-based values like daily revenue or session duration, time zones can skew your results. For example, if a table stores timestamps in UTC but your business day is in Pacific time, a daily average by UTC date will slice the day differently than your business reports.

The fix is to normalize timestamps to the correct time zone before grouping. Here is a generic pattern:

SELECT

DATE(CONVERTTZ(eventts, ‘UTC‘, ‘America/LosAngeles‘)) AS localday,

AVG(value) AS avg_value

FROM events

GROUP BY DATE(CONVERTTZ(eventts, ‘UTC‘, ‘America/Los_Angeles‘));

Your database’s time zone function will vary. The point is to avoid mixing a business interpretation of “day” with a raw UTC date. Averages are sensitive to grouping, and time zone is a hidden grouping dimension.

How AVG() behaves across SQL engines

While AVG() is standard, behavior differences still appear in practice. I keep three things in mind when I move between engines:

1) Return type: some engines return floating types for integer inputs, others return decimal with scale. If you rely on exact scale, cast explicitly.

2) NULL handling: standard AVG ignores NULL, but some engines have optional syntax or settings that treat NULL differently. Never rely on defaults if the rule matters.

3) Integer division: AVG usually protects you, but if you compute manual averages, you must cast. This is more common in older engines or when you build a weighted average using integer weights.

I also test a small dataset in the target engine whenever I port queries. It takes two minutes and prevents mismatched dashboards across systems.

Defensive SQL patterns I use with AVG()

These are the patterns that keep me safe in production. I use them often enough that they feel like muscle memory:

1) Always label the metric with its unit or rule, like avgscorenonnull or avgsalaryperemployee.

2) Add COUNT and NULL rate alongside AVG in exploratory analysis.

3) For averages that feed business KPIs, pair them with a minimum sample size (via HAVING or a filter).

4) For time-series averages, validate that your time grain is correct and consistent with business reporting.

5) For averages in joins, verify the row count before and after joining.

Here is a compact “safe average” template I use in analytics queries:

SELECT

category,

AVG(value) AS avg_value,

COUNT(*) AS row_count,

COUNT(value) AS nonnullcount

FROM facts

WHERE event_date >= DATE ‘2026-01-01‘

GROUP BY category

HAVING COUNT(value) >= 30;

This template already captures scope (WHERE), definition (AVG on non-null), data quality (nonnullcount), and stability (HAVING). It is not perfect, but it has saved me from many bad decisions.

Alternative approaches: median, percentile, trimmed mean

AVG is not always the best summary statistic. When distributions are skewed, outliers can dominate. In those cases, I consider alternatives:

  • Median: shows the typical value; robust to outliers.
  • Percentiles: show distribution spread (p50, p90, p99).
  • Trimmed mean: excludes top and bottom X% to reduce noise.

If you do not have built-in percentile functions, some engines allow window functions or approximate percentiles. Here is a conceptual example for median (the exact syntax varies):

SELECT PERCENTILECONT(0.5) WITHIN GROUP (ORDER BY value) AS medianvalue

FROM facts;

I bring this up because teams sometimes treat AVG as a proxy for “typical,” which is incorrect when data is skewed. The right approach is to match the metric to the question. AVG is excellent when you need to understand total magnitude spread evenly across the population. It is weak when you want a typical value in a heavy-tailed distribution.

Practical scenarios: when AVG is correct vs misleading

Here is a quick decision list I use:

Use AVG when:

  • You want the overall magnitude per unit (average revenue per order).
  • The distribution is reasonably symmetric or not heavily skewed.
  • Each row should have equal weight.
  • You need to compare means across groups with similar sizes.

Avoid or qualify AVG when:

  • A few extreme values dominate the metric (latency, revenue, session duration).
  • You have a low response rate and want “typical” behavior.
  • Group sizes are wildly uneven and you are comparing group averages.
  • Your business question is about fairness or typical experience rather than total magnitude.

This list is a heuristic, not a law, but it is a fast sanity check. If AVG feels wrong, it usually is.

Performance considerations in more detail

I want to go deeper on performance because AVG is often used in interactive dashboards. The cost of AVG is almost always tied to how many rows you scan and whether the engine can use a covering index or partition pruning.

Here are the levers I use most:

1) Filter early by time or category to reduce scanned data. This is the largest win.

2) Partition big tables by date so queries only scan relevant partitions.

3) Use materialized views for frequently-used averages.

4) Pre-aggregate at the grain you need for dashboards.

5) If you must do ad hoc averages, ensure the target column is in an index so the engine can do an index-only scan.

Before/after performance can be dramatic. On a 1-year table, a full scan might take seconds. If you filter down to one week and the table is partitioned by date, the same AVG might return in tens of milliseconds. I avoid exact claims because environments differ, but the pattern is consistent: scoping the dataset is the difference between fast and slow.

I also watch out for approximate averages in distributed systems. Some engines use approximate algorithms by default for speed. That can be fine for exploratory analysis, but for finance or compliance metrics I demand exact averages. If you are unsure, check the query plan or the engine documentation and make it explicit.

Debugging unexpected averages: a checklist

When an average looks wrong, I use this checklist. It catches almost every issue:

1) Check the row count and the non-null count. Are they what you expect?

2) Verify the filters. Are they applied at the right stage (WHERE vs HAVING)?

3) Confirm the join logic. Is there a one-to-many join inflating rows?

4) Inspect the data distribution. Are there outliers or sentinel values?

5) Verify the unit of analysis. Are you averaging per order, per customer, or per day?

6) Check time boundaries and time zones.

Here is a “debug view” query that I run to see basic stats and detect outliers:

SELECT

COUNT(*) AS rows_total,

COUNT(value) AS rowswithvalue,

MIN(value) AS min_value,

MAX(value) AS max_value,

AVG(value) AS avg_value

FROM facts

WHERE event_date BETWEEN DATE ‘2026-01-01‘ AND DATE ‘2026-01-31‘;

Even a quick min/max can highlight impossible values that are skewing the average. This is especially useful for latency or duration metrics where a bug can generate a 7-day session or a 0 ms event.

Using AVG in analytics pipelines and data tests

In production analytics, I try to avoid silent failure. AVG is sensitive to missing data, so I build tests around it. Two tests I add frequently:

1) Null share test: if null_share jumps beyond a threshold, alert.

2) Drift test: if avg_value changes by more than X% day-over-day or week-over-week, alert.

These tests are not perfect, but they are cheap and effective. Here is a conceptual example of a null share test in SQL:

SELECT

event_date,

1.0 - (COUNT(value) / CAST(COUNT(*) AS NUMERIC(10,4))) AS null_share

FROM facts

GROUP BY event_date

HAVING 1.0 - (COUNT(value) / CAST(COUNT(*) AS NUMERIC(10,4))) > 0.05;

This is not a replacement for proper monitoring, but it is a practical guardrail. In modern workflows, I run these checks daily and treat failures as data incidents.

Advanced pattern: AVG with CASE expressions

CASE expressions are a clean way to compute conditional averages without multiple queries. For example, average score by subject and also by whether the score is passing:

SELECT

subject,

AVG(score) AS avg_score,

AVG(CASE WHEN score >= 85 THEN score END) AS avgpassingscore

FROM student_scores

GROUP BY subject;

Remember that CASE without ELSE returns NULL, which AVG ignores. That makes it a natural way to compute conditional averages. This pattern is compact and avoids extra subqueries. I use it when I want to compare segments side by side.

Advanced pattern: average of ratios vs ratio of averages

Another subtle mistake: averaging ratios vs dividing totals. These two metrics are not equivalent. Suppose you want average conversion rate per page. If you compute AVG(conversions / views), you are giving each page equal weight regardless of traffic. If you compute SUM(conversions) / SUM(views), you get the overall conversion rate weighted by traffic.

Both can be right, but they answer different questions. I always label them clearly:

SELECT

AVG(conversions / NULLIF(views, 0)) AS avgpageconversion_rate,

SUM(conversions) / NULLIF(SUM(views), 0) AS overallconversionrate

FROM page_stats;

This is a powerful example because it shows how AVG can be the wrong tool if you do not clarify the unit of analysis.

Edge cases: empty sets and all NULLs

What happens if your WHERE filter yields no rows? AVG returns NULL, not 0. If your BI tool or application interprets NULL as zero, you can end up with a misleading trend. This is especially common in dashboards that display 0 when a metric is NULL.

If you need a default, handle it explicitly:

SELECT COALESCE(AVG(value), 0) AS avg_value

FROM facts

WHERE event_date = DATE ‘2026-02-01‘;

Be careful with this. I only use COALESCE to 0 when it makes business sense (for example, zero sales on a day with no orders). If I am measuring a quality score or user feedback, I prefer NULL because it indicates “no data.”

Another edge case is when all rows are NULL. AVG returns NULL again. This is correct, but if you are chaining calculations, NULL can propagate. In these cases, I often keep NULL and let the dashboard indicate “no data,” rather than filling with 0 and implying a meaning that is not there.

Short, practical exercises

If you want to internalize AVG, I recommend three quick exercises. They take 15 minutes and make the ideas stick:

1) Create a table with some NULLs and compute AVG with and without COALESCE. Observe how the denominator changes.

2) Build a small dataset with uneven group sizes. Compute average of averages and compare to the true overall average.

3) Join two tables to intentionally create duplicate rows. See how AVG changes, then fix it with a subquery.

These exercises create a mental model that is far more durable than memorizing syntax.

Closing thoughts and next steps

AVG() looks simple, but it carries decisions about data quality, fairness, and scope. When I build metrics, I treat AVG() like a small contract: it tells you what rows you count, how you handle missing values, and whether each row has equal weight. If you make those choices explicit in SQL, your results become stable and easier to defend.

The next time you write an average, take 30 seconds to answer three questions: What is the population? What does NULL mean? Should each row count equally? If you can answer those, you can trust your number. If you cannot, I recommend you pause and define the rule before you ship the query. That is how you avoid late-night dashboard fixes.

If you want to go further, try these practical steps: add a COUNT next to every AVG in your analytics models, build a simple data test that checks for spikes in NULLs, and experiment with a rolling average window to smooth noisy metrics. Those three actions pay off quickly in most production systems. When AVG() is used with intent, it becomes one of the most reliable tools in your SQL toolbox, and your analytics work becomes easier to explain, easier to audit, and far more resilient.

Scroll to Top