PostgreSQL GROUP BY: Practical Patterns, Pitfalls, and Performance

Most teams I work with hit the same wall sooner or later: raw event data is abundant, but decisions require summaries. You might have millions of purchase records, login events, or sensor readings, yet the question you actually need to answer is something like “Which regions are trending upward this quarter?” or “How many active subscriptions did we have per plan yesterday?” That’s where the PostgreSQL GROUP BY clause pays for itself. It’s the workhorse that turns granular rows into useful, comparable aggregates.

I’ve used GROUP BY in analytics pipelines, SaaS billing systems, and operational dashboards. It’s simple in syntax but surprisingly rich in practice. If you’ve ever shipped a query that looked correct but silently overcounted or grouped by the wrong dimension, you know the pain. I’m going to walk you through the patterns I rely on, the edge cases that matter in production, and the performance choices that keep queries snappy at scale. You’ll see runnable examples, realistic datasets, and practical guidance on when to use GROUP BY, when not to, and how to avoid mistakes I’ve seen teams repeat.

The core idea: group rows so aggregates mean something

GROUP BY tells PostgreSQL to collapse rows that share the same values in one or more columns, then compute aggregates per group. Conceptually, I explain it as “bucket the rows by a key, then compute metrics within each bucket.” It’s like organizing a pile of receipts by store name and then adding totals per store.

Here’s a concrete dataset we can use throughout. It models a simple e-commerce sales table with a real-world shape.

CREATE TABLE sales (

sale_id BIGSERIAL PRIMARY KEY,

sale_ts TIMESTAMPTZ NOT NULL,

customer_id BIGINT NOT NULL,

region TEXT NOT NULL,

product_id BIGINT NOT NULL,

unit_price NUMERIC(10,2) NOT NULL,

quantity INT NOT NULL CHECK (quantity > 0),

discount_pct NUMERIC(5,2) NOT NULL DEFAULT 0

);

INSERT INTO sales (salets, customerid, region, productid, unitprice, quantity, discount_pct)

VALUES

(‘2025-12-01 10:05+00‘, 101, ‘North‘, 501, 49.00, 2, 0),

(‘2025-12-01 10:15+00‘, 102, ‘South‘, 502, 19.00, 1, 10),

(‘2025-12-01 11:03+00‘, 101, ‘North‘, 501, 49.00, 1, 0),

(‘2025-12-02 08:40+00‘, 103, ‘West‘, 503, 99.00, 1, 5),

(‘2025-12-02 12:20+00‘, 104, ‘South‘, 504, 15.00, 5, 0),

(‘2025-12-02 13:10+00‘, 105, ‘North‘, 505, 29.00, 3, 0);

A basic GROUP BY per region looks like this:

SELECT

region,

COUNT(*) AS order_count,

SUM(unitprice quantity (1 - discountpct / 100.0)) AS revenue

FROM sales

GROUP BY region

ORDER BY revenue DESC;

That query gives you two important signals by region: count of orders and total revenue, after discount. Note the formula inside SUM(). I prefer writing the calculation explicitly so it’s easy to validate and reuse.

How PostgreSQL decides what you can select

One of the most common early mistakes is mixing aggregated and non-aggregated columns. PostgreSQL enforces a simple rule: every column in the SELECT list must be either aggregated or included in the GROUP BY list. It feels strict at first, but it prevents misleading results.

Bad example (this will fail):

SELECT region, product_id, COUNT(*)

FROM sales

GROUP BY region;

product_id is not part of the grouping, so PostgreSQL can’t know which value to show. The fix is to decide the granularity you actually want:

SELECT region, product_id, COUNT(*)

FROM sales

GROUP BY region, product_id;

If you only need a representative value per group, don’t be tempted to use MIN() or MAX() as a hack unless that’s truly meaningful. If you need the “most recent product_id per region,” that’s a different query using window functions or DISTINCT ON.

Multi-column grouping: modeling real dimensions

In real systems, you rarely group by a single column. You group by region and month, by customer and plan, or by day and channel. The key is to ensure each grouping column matches a meaningful dimension in your reporting model.

Let’s compute monthly revenue per region:

SELECT

datetrunc(‘month‘, salets) AS month_start,

region,

SUM(unitprice quantity (1 - discountpct / 100.0)) AS revenue

FROM sales

GROUP BY datetrunc(‘month‘, salets), region

ORDER BY month_start, region;

A couple of lessons embedded here:

  • I group by the same expression I select. If you alias the expression as month_start, PostgreSQL still requires the full expression in GROUP BY.
  • Using datetrunc produces a consistent bucket boundary. If you need local time, convert salets to the desired zone before truncating.

I like to think of the GROUP BY list as defining a “result table grain.” If it doesn’t align with your business question, your answers will be skewed.

Aggregates beyond COUNT and SUM

PostgreSQL supports rich aggregation, and the right aggregate can answer a question directly rather than requiring a post-processing step. Here are a few I use often:

Average and weighted average

Simple average:

SELECT region, AVG(unitprice) AS avgunit_price

FROM sales

GROUP BY region;

If you want a weighted average based on quantity, don’t use AVG(unit_price)—it will treat every order equally. Weighted average is:

SELECT

region,

SUM(unitprice * quantity) / NULLIF(SUM(quantity), 0) AS avgprice_weighted

FROM sales

GROUP BY region;

Distinct counts

Counting unique customers per region is a classic need:

SELECT region, COUNT(DISTINCT customerid) AS uniquecustomers

FROM sales

GROUP BY region;

COUNT(DISTINCT ...) can be expensive on huge tables, so I often pair it with a time filter or consider approximate counts using extensions if I need speed over precision.

Min/Max for range signals

Useful for finding first/last activity per group:

SELECT

customer_id,

MIN(salets) AS firstpurchase,

MAX(salets) AS lastpurchase

FROM sales

GROUP BY customer_id;

This is a reliable pattern for customer lifecycle analytics.

String aggregation and arrays

Sometimes you need a human-readable list of items per group, or you want to build an array of IDs. PostgreSQL makes this straightforward:

SELECT

customer_id,

STRINGAGG(DISTINCT region, ‘, ‘ ORDER BY region) AS regionsseen

FROM sales

GROUP BY customer_id;

Or, if you want a structured array:

SELECT

region,

ARRAYAGG(DISTINCT productid ORDER BY productid) AS productssold

FROM sales

GROUP BY region;

These are great for exports, debugging, or quick sanity checks when you need to understand the composition of a group.

Filtering with WHERE vs HAVING

A consistent source of bugs is misplacing filters. WHERE filters rows before grouping. HAVING filters groups after aggregation. I explain it this way: WHERE decides who gets into the party; HAVING decides which groups stay after you tally the results.

If you want to focus on a specific time range, use WHERE:

SELECT region, COUNT(*) AS order_count

FROM sales

WHERE salets >= ‘2025-12-01‘ AND salets < '2026-01-01'

GROUP BY region;

If you want only regions with substantial revenue, use HAVING:

SELECT region, SUM(unit_price * quantity) AS revenue

FROM sales

GROUP BY region

HAVING SUM(unit_price * quantity) >= 1000;

If you put the revenue filter in WHERE, PostgreSQL will complain because it doesn’t exist before grouping. This distinction is non-negotiable in SQL, so I recommend memorizing it early.

Real-world patterns I use in production

Daily active customers per region

SELECT

datetrunc(‘day‘, salets) AS day_start,

region,

COUNT(DISTINCT customerid) AS activecustomers

FROM sales

GROUP BY datetrunc(‘day‘, salets), region

ORDER BY day_start, region;

This is a classic dashboard metric. If the table is huge, create a daily materialized view that refreshes incrementally to keep results fast.

Revenue per product category (with join)

When you need aggregation across tables, join first, then group. Suppose there’s a products table:

CREATE TABLE products (

product_id BIGINT PRIMARY KEY,

category TEXT NOT NULL,

name TEXT NOT NULL

);

Now aggregate by category:

SELECT

p.category,

SUM(s.unit_price * s.quantity) AS revenue

FROM sales s

JOIN products p ON p.productid = s.productid

GROUP BY p.category

ORDER BY revenue DESC;

The join is part of the row set, then GROUP BY collapses it. This pattern scales well if you index the join keys.

Revenue by hour for anomaly detection

SELECT

datetrunc(‘hour‘, salets) AS hour_start,

SUM(unit_price * quantity) AS revenue

FROM sales

WHERE sale_ts >= now() - INTERVAL ‘7 days‘

GROUP BY datetrunc(‘hour‘, salets)

ORDER BY hour_start;

When I build alerting for traffic anomalies, I start with hourly aggregates like this, then apply thresholds. It’s a clean base layer for time-series monitoring.

Cohort-style grouping

Grouping by a “cohort month” is a common approach in retention analysis. You can calculate the first purchase month per customer and then aggregate per cohort:

WITH first_purchase AS (

SELECT

customer_id,

datetrunc(‘month‘, MIN(salets)) AS cohort_month

FROM sales

GROUP BY customer_id

)

SELECT

fp.cohort_month,

COUNT(*) AS customersincohort

FROM first_purchase fp

GROUP BY fp.cohort_month

ORDER BY fp.cohort_month;

This is simple, but it highlights an important pattern: GROUP BY isn’t only for the final query. It can be a building block in intermediate CTEs that make the final aggregation clearer and safer.

Common mistakes I keep seeing (and how to avoid them)

Mistake 1: Grouping by a derived column without defining it

If you group by datetrunc(‘month‘, salets) in the SELECT, you must repeat the full expression in GROUP BY. Some dialects allow grouping by alias; PostgreSQL does not. I stay consistent and repeat the expression to avoid ambiguity.

Mistake 2: Using COUNT(*) when you need COUNT(column)

COUNT(*) counts all rows, including those with nulls in your target column. If you need to count only rows with a non-null value, use COUNT(column). For example:

SELECT region, COUNT(discountpct) AS discountsset

FROM sales

GROUP BY region;

This counts only rows where discount_pct is not null. It’s an easy way to misreport if you default to COUNT(*).

Mistake 3: Grouping by high-cardinality fields accidentally

Grouping by customerid or saleid can explode your result size. If you’re trying to aggregate, be deliberate with grain. I’ve seen dashboards slow to a crawl because someone grouped by an ID column out of habit.

Mistake 4: Mixing time zones in date grouping

If you group by datetrunc(‘day‘, salets) on a timestamptz, PostgreSQL uses your session time zone. That might not match your business logic. I always set time zone in the query or in the connection.

SET TIME ZONE ‘UTC‘;

-- or

SELECT datetrunc(‘day‘, salets AT TIME ZONE ‘America/LosAngeles‘) AS localday

FROM sales

GROUP BY datetrunc(‘day‘, salets AT TIME ZONE ‘America/Los_Angeles‘);

Mistake 5: Forgetting to handle division by zero

When computing averages or ratios, use NULLIF to avoid errors:

SELECT region,

SUM(unitprice * quantity) / NULLIF(SUM(quantity), 0) AS avgprice

FROM sales

GROUP BY region;

Mistake 6: Unexpected duplication after joins

When you join a detail table to a dimension table, you can multiply rows without realizing it. If a product has multiple tags and you join tags before grouping, you’ll inflate revenue. The fix is to either group at the right level before joining, or aggregate the dimension first.

-- Safer: aggregate sales first, then join

WITH salesbyproduct AS (

SELECT productid, SUM(unitprice * quantity) AS revenue

FROM sales

GROUP BY product_id

)

SELECT t.tag, SUM(sbp.revenue) AS revenue

FROM salesbyproduct sbp

JOIN producttags t ON t.productid = sbp.product_id

GROUP BY t.tag;

This pattern keeps the measures from double-counting.

When to use GROUP BY — and when not to

I use GROUP BY when I need per-dimension aggregates: revenue per plan, users per country, or errors per endpoint. That’s the obvious case. But there are also times when GROUP BY is the wrong tool:

  • If you need the full row plus a per-group metric, consider window functions. A window query can keep row-level detail and include an aggregate column. Example:
SELECT

sale_id,

region,

unit_price,

quantity,

SUM(unitprice * quantity) OVER (PARTITION BY region) AS regionalrevenue

FROM sales;

  • If you need “top N per group,” you’re often better off with ROW_NUMBER() in a window and filtering.
  • If the data is already aggregated (like a daily rollup table), adding another GROUP BY might over-aggregate and hide spikes.

The key decision is whether you want a collapsed result set or just a computed value alongside each row. If you want detail rows, reach for window functions first.

Performance considerations that actually matter

Performance advice is usually vague, so I keep it practical. On large datasets, the main cost is scanning and sorting/grouping. Here’s what I check:

Indexes help only when they filter or join

GROUP BY itself doesn’t always benefit from indexes. Indexes help when you reduce the number of rows before grouping (via WHERE) or when you join. I focus on indexing filter columns like salets and join keys like productid.

Use incremental aggregates for dashboards

If you’re computing daily or hourly aggregates repeatedly, create a materialized view and refresh it. For large systems, I schedule incremental refreshes or use a dedicated rollup table.

CREATE MATERIALIZED VIEW sales_daily AS

SELECT

datetrunc(‘day‘, salets) AS day_start,

region,

SUM(unit_price * quantity) AS revenue,

COUNT(*) AS order_count

FROM sales

GROUP BY datetrunc(‘day‘, salets), region;

Beware of high-cardinality group keys

Grouping by customer_id across billions of rows can be very expensive. If you need counts by user, consider batching by time or using approximate count tools to keep query times in a practical range.

Hash aggregation vs sort aggregation

PostgreSQL chooses between hash and sort aggregation. Hash aggregation is often faster for large groups but can spill to disk if memory is tight. You can influence memory with work_mem (carefully). I typically test both on representative datasets.

Expect ranges, not exact timings

On modern hardware, a well-filtered group query might land in the 10–50ms range. The same query over a massive range without filters can jump to hundreds of milliseconds or a few seconds. The difference is almost always about row counts and IO, not SQL syntax.

Grouping sets and rollups: advanced but powerful

If you need multiple aggregation levels in one query—like per region and global totals—grouping sets can save you separate queries. I use this in reporting exports where I want subtotal rows without another round-trip.

SELECT

region,

datetrunc(‘month‘, salets) AS month_start,

SUM(unit_price * quantity) AS revenue

FROM sales

GROUP BY GROUPING SETS (

(region, datetrunc(‘month‘, salets)),

(region),

()

)

ORDER BY region NULLS LAST, month_start NULLS LAST;

The empty () set gives you the grand total. This is a useful technique for CSV reports or BI exports where totals and subtotals are expected in the same dataset.

Distinguishing subtotal rows

When you use grouping sets, you’ll often want to label rows so downstream systems can tell totals from subtotals. PostgreSQL provides GROUPING() for that.

SELECT

region,

datetrunc(‘month‘, salets) AS month_start,

SUM(unit_price * quantity) AS revenue,

GROUPING(region) AS isregiontotal,

GROUPING(datetrunc(‘month‘, salets)) AS ismonthtotal

FROM sales

GROUP BY GROUPING SETS (

(region, datetrunc(‘month‘, salets)),

(region),

()

);

If isregiontotal is 1, that row is a grand total or a subtotal without a region. This gives you explicit control over presentation.

Practical edge cases you should handle

Null grouping values

If a grouping column is null, those rows are still grouped together. That may or may not be what you want. If you want a placeholder, use COALESCE:

SELECT

COALESCE(region, ‘Unknown‘) AS region_label,

COUNT(*) AS order_count

FROM sales

GROUP BY COALESCE(region, ‘Unknown‘);

Sparse data and missing buckets

GROUP BY won’t generate rows for missing groups. If you need a full time series with zeroes for missing days, you’ll need a calendar table or generate_series:

WITH days AS (

SELECT generate_series(‘2025-12-01‘::date, ‘2025-12-07‘::date, ‘1 day‘) AS day

)

SELECT

d.day,

COALESCE(SUM(s.unit_price * s.quantity), 0) AS revenue

FROM days d

LEFT JOIN sales s

ON datetrunc(‘day‘, s.salets) = d.day

GROUP BY d.day

ORDER BY d.day;

This is an easy way to avoid gaps in charts.

Floating point surprises

When you aggregate monetary values, use NUMERIC and be consistent. Avoid float types for money if you need precise totals. I keep unit_price as NUMERIC and cast only when necessary.

Grouping by expression vs storing derived columns

If you group by datetrunc(‘day‘, salets) all the time, consider adding a generated column or storing a date column. It can simplify queries and make indexing more effective. Just be careful to keep it consistent with the time zone you care about.

ALTER TABLE sales

ADD COLUMN saledate date GENERATED ALWAYS AS (salets AT TIME ZONE ‘UTC‘) STORED;

Then you can group by sale_date directly.

Boolean grouping

If you have a boolean column, grouping by it is a quick way to split counts. It’s easy but also easy to misread. I like to label the values explicitly:

SELECT

CASE WHEN discountpct > 0 THEN ‘discounted‘ ELSE ‘fullprice‘ END AS price_type,

COUNT(*) AS order_count

FROM sales

GROUP BY price_type;

This avoids “t/ f” ambiguity in dashboards.

Alternative approaches to the same problem

Window functions for per-group metrics without collapsing

When you want detail rows plus group-level metrics, use window functions instead of GROUP BY. You can even calculate multiple aggregates without changing the result grain.

SELECT

sale_id,

region,

unit_price,

quantity,

COUNT(*) OVER (PARTITION BY region) AS ordersinregion,

SUM(unitprice * quantity) OVER (PARTITION BY region) AS revenuein_region

FROM sales;

You can still filter these results later, but you keep detail rows for drill-downs.

Subqueries to control join explosion

Sometimes I’ll aggregate first, then join to avoid double-counting. This is especially useful when joins multiply rows.

WITH salespercustomer AS (

SELECT customerid, SUM(unitprice * quantity) AS revenue

FROM sales

GROUP BY customer_id

)

SELECT c.customer_id, c.region, spc.revenue

FROM customers c

JOIN salespercustomer spc ON spc.customerid = c.customerid;

This keeps the metric stable regardless of how many rows the customer has in other joined tables.

Materialized views and rollup tables

If you can trade a little freshness for speed, store aggregates. This is especially true for dashboards that hit the same query every minute.

CREATE MATERIALIZED VIEW sales_monthly AS

SELECT

datetrunc(‘month‘, salets) AS month_start,

region,

SUM(unit_price * quantity) AS revenue,

COUNT(*) AS order_count

FROM sales

GROUP BY datetrunc(‘month‘, salets), region;

Refresh on a schedule or incrementally, depending on your load. The end result is a query that runs in milliseconds rather than seconds.

Postgres-specific nuances worth knowing

GROUP BY position numbers

PostgreSQL lets you group by the position of a select expression, like GROUP BY 1, 2. It works, but I avoid it in production. It’s brittle when the SELECT list changes. Explicit expressions are safer and easier to read.

GROUP BY with collations and text

Text grouping can be impacted by collation rules (case sensitivity, locale). If you need case-insensitive grouping, use lower() or citext.

SELECT lower(region) AS region_norm, COUNT(*)

FROM sales

GROUP BY lower(region);

Be consistent in how you normalize, or you’ll get subtle splits in your groups.

GROUP BY with JSON

If you store attributes in JSON, you can group by a JSON key, but be mindful of missing or inconsistent keys.

SELECT

(payload->>‘source‘) AS source,

COUNT(*)

FROM events

GROUP BY (payload->>‘source‘);

If some rows are missing source, they’ll fall into a null group. You might want COALESCE(payload->>‘source‘, ‘unknown‘) instead.

Testing and validating grouped results

I’m cautious about GROUP BY queries because they can look correct while hiding subtle issues. My quick validation loop usually includes:

1) Sanity check totals: The sum of grouped counts should match the base row count for the time window.

WITH grouped AS (

SELECT region, COUNT(*) AS cnt

FROM sales

WHERE salets >= ‘2025-12-01‘ AND salets < '2026-01-01'

GROUP BY region

)

SELECT SUM(cnt) FROM grouped;

If that doesn’t match the raw count, I likely have a join explosion or a filter issue.

2) Cross-check a small sample: Pull a small set of regions or customers and compute the metrics by hand or with a filtered query.

3) Validate distinct counts: COUNT(DISTINCT ...) can be tricky with joins. I’ll run the distinct count on the base table if possible, or isolate the join.

4) Compare to a raw slice: If I’m reporting “revenue per region,” I’ll check a single region with a raw filter and ensure the sums match.

These steps are fast and save me from shipping wrong metrics.

Handling “top N per group” with GROUP BY and windows

GROUP BY alone doesn’t handle “top N per group” because it collapses rows. The common pattern is to first calculate per-group metrics, then rank within each group using a window.

WITH salesbyproduct AS (

SELECT

region,

product_id,

SUM(unit_price * quantity) AS revenue

FROM sales

GROUP BY region, product_id

), ranked AS (

SELECT

region,

product_id,

revenue,

ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) AS rn

FROM salesbyproduct

)

SELECT region, product_id, revenue

FROM ranked

WHERE rn <= 3

ORDER BY region, revenue DESC;

This is a pattern I use constantly in product analytics and merchandising reports.

Deeper performance tuning: what I actually do

If a group query is slow in production, I follow a simple ladder:

1) Add or tighten filters: Ask if the query can use a smaller time window or a narrower set of regions.

2) Check indexes: Ensure filter and join columns are indexed (salets, productid, customer_id).

3) Pre-aggregate: Use materialized views or summary tables for repeated queries.

4) Review join order: Aggregate first, join later to reduce row counts.

5) Inspect query plans: Look for large sequential scans and hash aggregation spills.

I try not to jump to exotic tuning until I’ve confirmed the basics are right. Most slow group queries are slow because they’re scanning too much data, not because GROUP BY is inherently expensive.

Comparing traditional vs modern approaches

Here’s how I think about “traditional” (ad hoc queries) vs “modern” (pre-aggregated and assisted) workflows for GROUP BY.

Traditional approach:

  • Write a GROUP BY query for each dashboard.
  • Add indexes as needed.
  • Re-run queries on every dashboard load.

Modern approach:

  • Use GROUP BY to build rollup tables.
  • Refresh those tables on a schedule (or incrementally).
  • Query the rollups for dashboards and exports.
  • Use AI tools to draft queries, then validate with tests.

The modern approach shifts the heavy work to a scheduled job, which keeps user-facing queries fast. It also creates a stable layer for analytics and reporting teams.

Production considerations: deployment, monitoring, scaling

If you’re running GROUP BY queries in production, I recommend:

  • Observability for heavy queries: Track query timings and row counts. Group queries can balloon unexpectedly as data grows.
  • Data retention: If raw data grows indefinitely, group queries will slow down. Archive older data or partition by date.
  • Partitioning: Partition by time for large event tables. PostgreSQL can prune partitions, which dramatically reduces the rows scanned.
  • Incremental loads: For rollups, update only the latest time range instead of refreshing everything.
  • Backfills: When schema changes happen, backfill aggregates carefully to avoid mixing old and new logic.

These practices aren’t specific to GROUP BY, but they matter more because aggregation touches lots of data and can make performance issues visible quickly.

Modern workflows in 2026: AI-assisted SQL, but still verified

In 2026, it’s normal to use AI tooling to draft SQL quickly. I use copilots to sketch queries and then validate with tests or data checks. GROUP BY is a prime candidate for verification because it’s easy to write a syntactically valid query that returns the wrong totals.

My workflow looks like this:

  • Ask the tool to draft the query.
  • Add explicit comments or CTEs so intent is clear.
  • Validate totals with a smaller slice.
  • Compare against a trusted baseline.

AI is a speed boost, not a substitute for verification.

A compact checklist I keep in my head

When I’m writing a GROUP BY query under pressure, I mentally check:

  • What is the exact grain of the result? (Which dimensions are in GROUP BY?)
  • Are all selected columns either aggregated or grouped?
  • Do filters belong in WHERE or HAVING?
  • Will any joins multiply rows?
  • Do I need distinct counts or weighted averages?
  • Is the time zone correct for time bucketing?
  • Are any groups missing because I need a calendar table?

If I answer those quickly, I usually avoid the pitfalls.

Extended examples for deeper practice

Example 1: Plan-level metrics for subscriptions

Imagine a subscriptions table where each row is a subscription event. You want active subscriptions per plan, excluding cancelled ones.

CREATE TABLE subscriptions (

subscription_id BIGSERIAL PRIMARY KEY,

customer_id BIGINT NOT NULL,

plan_id BIGINT NOT NULL,

status TEXT NOT NULL, -- ‘active‘, ‘cancelled‘

start_ts TIMESTAMPTZ NOT NULL,

end_ts TIMESTAMPTZ

);

SELECT

plan_id,

COUNT(*) AS active_subscriptions

FROM subscriptions

WHERE status = ‘active‘

GROUP BY plan_id

ORDER BY active_subscriptions DESC;

If you need the active count “as of” a date, it becomes a range filter:

SELECT

plan_id,

COUNT(*) AS activeasof

FROM subscriptions

WHERE start_ts <= '2025-12-31'

AND (endts IS NULL OR endts > ‘2025-12-31‘)

GROUP BY plan_id;

This is a typical case where GROUP BY returns a business metric you can put straight into a report.

Example 2: Error rates per endpoint

If you log API requests and want error rates per endpoint, you can group by endpoint and status.

SELECT

endpoint,

COUNT(*) AS total,

SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END) AS errors,

SUM(CASE WHEN status_code >= 500 THEN 1 ELSE 0 END)::numeric

/ NULLIF(COUNT(*), 0) AS error_rate

FROM api_logs

WHERE log_ts >= now() - interval ‘24 hours‘

GROUP BY endpoint

ORDER BY error_rate DESC;

This example shows how a derived aggregate in SUM(CASE...) can give you a direct rate in the same query. It’s a powerful pattern when you need multiple metrics per group.

Example 3: Funnel stages grouped by day

Suppose you have event logs with stage names (signup, onboarding, purchase) and you want counts per day per stage.

SELECT

datetrunc(‘day‘, eventts) AS day_start,

stage,

COUNT(*) AS events

FROM event_log

WHERE event_ts >= now() - interval ‘30 days‘

GROUP BY datetrunc(‘day‘, eventts), stage

ORDER BY day_start, stage;

This is a typical analytics query. If your dashboard needs zeroes for missing stages, pair with generate_series and a stage reference table.

Dealing with skew and uneven groups

Sometimes one group dominates the data (like a single region or a “default” tenant). This can skew aggregates and make performance unpredictable.

  • If a single group is huge, aggregate that group separately and union the results.
  • Consider partitioning by the grouping column if it’s a major dimension.
  • Use WHERE to limit the scope for interactive dashboards.

Skew isn’t always a query problem; it’s sometimes a data distribution problem. But acknowledging it helps avoid surprises.

Final thoughts

GROUP BY is one of those SQL features that feels basic until you’ve lived with it in production. The syntax is small, but the practice is deep. The difference between a correct, fast, trustworthy aggregation and a misleading, slow one often comes down to tiny details: grouping at the right grain, filtering at the right stage, or preventing double-counting after joins.

If you take one thing away, it’s this: be explicit. Be explicit about grain, explicit about filters, explicit about time zones, and explicit about which columns are aggregated. The more explicit you are, the less likely you are to ship a query that quietly lies.

When in doubt, I run a small validation slice, compare totals, and trust the data more than my intuition. GROUP BY is simple, but it’s also the foundation of how teams turn raw data into decisions. Getting it right makes everything downstream—dashboards, reports, alerts, and models—more reliable.

Scroll to Top