A lot of SQL mistakes I review aren’t “syntax” mistakes—they’re shape mistakes. Someone needed one list, but accidentally produced a wide table. Or they needed to combine related attributes, but ended up stacking unrelated rows. Those errors can look reasonable in a quick spot-check, especially with small test data, and then quietly corrupt dashboards, exports, and downstream services.
The fastest way I’ve found to prevent this class of bug is to internalize one simple question before you write anything: Am I trying to combine columns or combine rows? If you’re combining columns across tables based on a relationship, you’re in JOIN territory. If you’re stacking result sets from multiple queries into one longer result, you’re in UNION territory.
I’ll walk you through that mental model with runnable examples, show where people slip (especially around duplicates, NULLs, and mismatched schemas), and give you concrete rules I use in production: when I reach for JOIN, when I reach for UNION, when I refuse to use either, and how I validate the output shape before it ships.
The “Shape” Mental Model: Columns vs Rows
When you JOIN, you’re “widening” your result. You take rows from a primary table and attach related columns from another table (or the same table) based on a match condition.
When you UNION, you’re “lengthening” your result. You take the output of one SELECT and stack it on top of the output of another SELECT, producing more rows.
Here’s a simple analogy I use when explaining this to new engineers:
- JOIN is like merging two spreadsheets side-by-side using a shared key column (like Order ID).
- UNION is like taking two spreadsheets with the same columns and pasting one under the other.
This is not just conceptual—it’s literal. JOIN increases the number of columns (unless you select fewer), and UNION increases the number of rows.
I also keep two “sanity checks” in my head:
1) With JOIN, row counts can change (sometimes dramatically) if your relationship isn’t truly 1-to-1.
2) With UNION, column definitions must align (count and compatible types), or you’re either going to get an error or you’ll force an implicit cast that bites later.
To make this concrete, I’ll use small tables you can run in PostgreSQL, MySQL 8+, SQL Server (minor tweaks), or SQLite (minor tweaks). I’ll write standard SQL where possible.
One extra shape rule I’ve learned to say out loud (because it catches mistakes early):
- If I can draw an arrow between tables (customer → orders), I probably want a JOIN.
- If I can describe it as “two sources of the same thing” (events from system A + events from system B), I probably want a UNION ALL.
Also: you can absolutely combine the two. A lot of “real” queries are: shape two streams into the same schema (UNION ALL), then enrich them (JOIN), then summarize (GROUP BY). The important part is that each step has a clear shape intention.
JOIN: Combining Related Data Into New Columns
A JOIN connects rows across tables using a predicate (most often equality on key columns). The output is a single rowset where columns from both sides appear together.
The reason JOIN exists is simple: normalized databases store different attributes in different tables. JOIN is the mechanism that lets you reconstruct a useful view without denormalizing everything.
A runnable example
Let’s model two tables: one holding students, another holding addresses. (This mirrors a common pattern: “entity” table plus “detail” table.)
— Works in PostgreSQL. For other databases, the idea is the same.
WITH boys AS (
SELECT * FROM (VALUES
(10, ‘Ram‘, 14),
(15, ‘Ritik‘, 15),
(16, ‘Sanjay‘, 16),
(17, ‘Prakhar‘, 17)
) AS t(rollno, name, age)
),
girls AS (
SELECT * FROM (VALUES
(9, ‘Seema‘, ‘Pune‘),
(10, ‘Rimi‘, ‘Delhi‘),
(15, ‘Mona‘, ‘Delhi‘),
(16, ‘Nisha‘, ‘Goa‘),
(17, ‘Anaya‘, ‘Bhopal‘)
) AS t(rollno, name, address)
)
SELECT
b.name AS boy_name,
b.age AS boy_age,
g.address
FROM boys b
INNER JOIN girls g
ON b.rollno = g.rollno
ORDER BY b.rollno;
What this produces (conceptually):
- It returns only roll numbers that exist in both sets (because INNER JOIN).
- It returns a row where the boy’s name/age and the matching address appear in the same row.
That “same row” property is the whole point of JOIN.
One thing I always emphasize: JOIN doesn’t “merge tables,” it “matches rows.” That sounds pedantic, but it changes how you debug. If something looks wrong, you don’t stare at columns—you ask “which rows matched, and why?”
JOIN types you actually use
I’m not going to enumerate every JOIN variant as a vocabulary list. Instead, here’s how I decide:
INNER JOIN: I require a match on both sides. This is my default when missing relationships should exclude the row.LEFT JOIN: I keep everything from the left side even if there’s no match on the right. This is my default for enrichment (adding optional details).FULL OUTER JOIN: I want everything from both sides, matched where possible. I mainly use this for audits and reconciliation.
Two practical notes that save time:
1) RIGHT JOIN is just LEFT JOIN with tables swapped. I almost never use RIGHT JOIN in production code because it makes scanning harder.
2) CROSS JOIN (Cartesian product) has valid uses (generating date grids, building all combinations for testing), but if you didn’t intentionally choose it, it’s usually a bug.
The most common JOIN bug: accidental row multiplication
JOIN is not “safe” by default. If you join one row to many rows, you’ll multiply rows.
Example: one customer has three orders. If you join customers to orders, the customer row appears three times.
That’s correct behavior, but it’s a frequent source of incorrect totals.
I use this quick checklist:
- If I’m joining for display (show customer + order info), row multiplication is fine.
- If I’m joining then aggregating (SUM, COUNT), I stop and confirm cardinality.
A practical guardrail is to aggregate before you join:
WITH orders AS (
SELECT * FROM (VALUES
(1, 101, 29.99),
(2, 101, 14.50),
(3, 102, 9.00)
) AS t(orderid, customerid, amount)
),
customers AS (
SELECT * FROM (VALUES
(101, ‘Ava‘),
(102, ‘Noah‘)
) AS t(customerid, customername)
),
order_totals AS (
SELECT customerid, SUM(amount) AS totalamount
FROM orders
GROUP BY customer_id
)
SELECT c.customername, ot.totalamount
FROM customers c
LEFT JOIN order_totals ot
ON c.customerid = ot.customerid
ORDER BY c.customer_id;
Notice what I did: I produced one row per customer in order_totals before joining. This keeps the final result stable.
If you take nothing else from this section, take this: aggregation “collapses” multiplicity; JOIN can “expand” it. If you do them in the wrong order, you can end up counting the same money twice.
Cardinality: the question you should answer before every JOIN
When I’m about to write a JOIN, I try to classify it as one of these:
- 1-to-1 (rare, but ideal): each row matches at most one row on the other side.
- 1-to-many (common): one customer → many orders.
- many-to-many (dangerous unless intentional): tags, enrollments, memberships.
If it’s many-to-many, I don’t proceed until I can explain why the multiplication is correct. Many-to-many joins are valid, but they are exactly where “looks fine on small data” turns into “my totals doubled in production.”
A simple technique I use during development is to run a quick multiplicity check.
— For a given join key, how many matches exist on each side?
SELECT customerid, COUNT(*) AS ordersper_customer
FROM orders
GROUP BY customer_id
HAVING COUNT(*) > 1;
If I expect 1-to-1 and I see counts above 1, I know the join will multiply rows.
NULLs and JOINs: what actually happens
People often say “NULLs don’t match.” That’s generally true for equality joins (a.col = b.col). Two NULLs are not equal in SQL’s three-valued logic.
So if you have NULL join keys, these rows typically won’t match in an INNER JOIN, and in a LEFT JOIN they’ll show up with NULLs for the right-side columns.
This matters when your “key” isn’t truly a key. If someone tries to join on email (which may be NULL, may change, may differ by case/collation), you can get silent mismatches.
My rule: join on stable identifiers (surrogate keys, natural keys with enforced constraints). If you can’t enforce it, treat it as an approximate match and expect data quality issues.
JOINs for existence vs JOINs for enrichment (EXISTS can be better)
Sometimes people use JOIN when they only want to filter based on whether a related row exists.
Example: “show customers who placed at least one order.”
A JOIN works, but it can multiply customers if they have multiple orders. You can fix it with DISTINCT, but that’s often a smell.
I reach for EXISTS because it expresses intent and avoids accidental multiplication:
SELECT c.customerid, c.customername
FROM customers c
WHERE EXISTS (
SELECT 1
FROM orders o
WHERE o.customerid = c.customerid
);
If your goal is “keep these rows if a match exists,” EXISTS is often safer and faster.
UNION: Stacking Result Sets Into New Rows
UNION combines the output of two (or more) SELECT statements into a single result set by stacking rows.
The key constraints:
- Each SELECT must return the same number of columns.
- Corresponding columns should have compatible data types.
UNIONremoves duplicates (distinct rows).
A runnable example
Let’s create two lists of names and stack them.
WITH boys AS (
SELECT * FROM (VALUES
(10, ‘Ram‘),
(15, ‘Ritik‘),
(16, ‘Sanjay‘),
(17, ‘Prakhar‘)
) AS t(rollno, name)
),
girls AS (
SELECT * FROM (VALUES
(9, ‘Seema‘),
(10, ‘Rimi‘),
(15, ‘Mona‘),
(16, ‘Nisha‘),
(17, ‘Anaya‘)
) AS t(rollno, name)
)
SELECT name
FROM boys
WHERE rollno < 16
UNION
SELECT name
FROM girls
WHERE rollno > 9
ORDER BY name;
This produces a single-column list of names.
What matters here is not that the underlying data came from two sources; what matters is that the output schema matches: both SELECTs return one column, name, of a compatible type.
UNION vs UNION ALL (this is where teams get burned)
UNION= distinct rows only (duplicate elimination)UNION ALL= keep duplicates (no duplicate elimination)
In production, I default to UNION ALL unless I have a specific reason to remove duplicates.
Why? Because duplicate elimination has a cost (sort/hash), and because “duplicates” might be meaningful (for example, two different systems emitted the same event and you want to investigate that, not hide it).
Here’s a small demo:
WITH a AS (
SELECT * FROM (VALUES
(‘Mona‘),
(‘Seema‘)
) AS t(name)
),
b AS (
SELECT * FROM (VALUES
(‘Mona‘),
(‘Rimi‘)
) AS t(name)
)
SELECT name FROM a
UNION
SELECT name FROM b
ORDER BY name;
This returns Mona once.
Now:
WITH a AS (
SELECT * FROM (VALUES
(‘Mona‘),
(‘Seema‘)
) AS t(name)
),
b AS (
SELECT * FROM (VALUES
(‘Mona‘),
(‘Rimi‘)
) AS t(name)
)
SELECT name FROM a
UNION ALL
SELECT name FROM b
ORDER BY name;
This returns Mona twice.
My rule: if your downstream logic is doing counts, you should be explicit about whether duplicates are allowed. “Accidentally distinct” is a classic analytics bug.
Schema alignment: the hidden work in most UNIONs
UNION feels simple until you do it with real tables. In real systems, different sources rarely line up perfectly:
- column names differ (
useridvsaccountid) - types differ (
uuidvstext) - meaning differs (one table stores UTC timestamps, the other stores local time)
A UNION is only as good as the contract you define for the output.
When I build a unioned dataset, I explicitly choose:
- a stable column order
- consistent aliases
- explicit casts
- a discriminator column (
source,event_type,system)
That discriminator sounds optional, but it’s the difference between “useful unified feed” and “mystery meat dataset.”
Ordering and limiting UNION results (common gotcha)
One subtle point: ORDER BY applies to the final combined result, not individual branches.
So if you want “top 10 from each branch” and then combine, you generally need to apply LIMIT inside each subquery/CTE, then union those results.
WITH top_a AS (
SELECT userid, occurredat
FROM events_a
ORDER BY occurred_at DESC
LIMIT 10
),
top_b AS (
SELECT userid, occurredat
FROM events_b
ORDER BY occurred_at DESC
LIMIT 10
)
SELECT * FROM top_a
UNION ALL
SELECT * FROM top_b
ORDER BY occurred_at DESC;
Without that structure, you might accidentally take “top 10 overall,” not “top 10 per source.”
A Direct Comparison (and the Rules I Actually Follow)
Here’s the crisp difference I teach:
- JOIN combines related tables into a wider result (more columns).
- UNION combines compatible SELECT outputs into a longer result (more rows).
A practical comparison table:
JOIN
—
Attach related columns across tables
Typically increases columns
A join predicate (match condition)
Can create duplicates via many-to-many matches
UNION removes duplicates; UNION ALL keeps them Row multiplication and incorrect aggregates
And here are my operational rules:
1) If I need attributes from another table to describe the same entity row, I JOIN.
2) If I need one output list that comes from multiple sources with the same schema, I UNION ALL.
3) If I catch myself wanting UNION to “combine tables,” I pause—most likely I actually need JOIN (or I need to remodel the query).
4) If I catch myself joining two tables without knowing the cardinality, I pause—most likely I’m about to inflate counts.
5) If I’m using UNION (distinct) to “clean up” duplicates, I pause—most likely I’m hiding a data quality problem or I’m deduplicating in the wrong place.
6) If I’m joining just to filter for existence, I try EXISTS first.
A quick self-test I do before I run a query
I literally ask myself which of these I’m trying to produce:
- “One row per X, enriched with details” → JOIN (and/or pre-aggregate)
- “A combined stream of X from multiple sources” → UNION ALL
- “Only X where a relationship exists” → EXISTS
If I can’t name the grain (“one row per what?”), I stop. Not because I’m being academic, but because that’s where shape bugs originate.
Picking the Right Tool in Real Work (Scenarios You’ll Recognize)
This is where it becomes straightforward.
Scenario A: “Show orders with customer emails” (JOIN)
You have orders(orderid, customerid, total) and customers(customer_id, email).
You want each order row plus the email.
That’s a JOIN:
SELECT
o.order_id,
o.total,
c.email
FROM orders o
INNER JOIN customers c
ON o.customerid = c.customerid;
If a customer might be missing (data quality issues, soft-deleted accounts), I use LEFT JOIN and explicitly handle NULLs.
Also, when I do LEFT JOIN, I’ll often add a quick “missing rate” check in development:
SELECT
COUNT(*) AS total_orders,
SUM(CASE WHEN c.customerid IS NULL THEN 1 ELSE 0 END) AS missingcustomers
FROM orders o
LEFT JOIN customers c
ON o.customerid = c.customerid;
If missing_customers isn’t ~0 and I didn’t expect it, I investigate before shipping.
Scenario B: “Build a single activity feed from two event tables” (UNION ALL)
You store purchases and logins separately:
purchaseevents(userid, occurred_at, amount)loginevents(userid, occurredat, ipaddress)
For a unified feed, you project them into the same schema and UNION ALL:
SELECT
user_id,
occurred_at,
‘purchase‘ AS event_type,
CAST(amount AS text) AS event_detail
FROM purchase_events
UNION ALL
SELECT
user_id,
occurred_at,
‘login‘ AS event_type,
ipaddress AS eventdetail
FROM login_events
ORDER BY occurred_at DESC;
Note the deliberate normalization:
- I add a discriminator column (
event_type). - I align types (
event_detailas text). I do this intentionally so the meaning is stable.
If you don’t like collapsing into event_detail, you can also build a wider schema with multiple nullable columns:
SELECT userid, occurredat, ‘purchase‘ AS eventtype, amount, NULL AS ipaddress
FROM purchase_events
UNION ALL
SELECT userid, occurredat, ‘login‘ AS eventtype, NULL AS amount, ipaddress
FROM login_events;
That output is still “one schema,” but it keeps typed columns. I pick this approach when downstream users want to filter/aggregate without parsing text.
Scenario C: “Combine two tables with different columns” (neither, until you align)
If one SELECT returns (id, name) and the other returns (id, name, region), you cannot UNION them without shaping.
You can:
- add a NULL placeholder
- or project both down to a shared subset
Example:
SELECT id, name, region
FROM customers
UNION ALL
SELECT id, name, NULL AS region
FROM leads;
I’m explicit about the placeholder. Otherwise, someone will read the output later and assume region is “real” for both.
One more thing: if customers.region is not nullable, your unioned output will be nullable. That’s an output-contract change. Sometimes that’s fine, but I like to acknowledge it.
Scenario D: “I want to match rows and also stack categories” (JOIN + UNION)
Sometimes you genuinely need both.
Example: you have current customers and archived customers in separate tables, and you also need to attach subscription data.
I typically UNION ALL the customer sources into one CTE (same schema), then JOIN once:
WITH all_customers AS (
SELECT customerid, email, createdat
FROM customers_current
UNION ALL
SELECT customerid, email, createdat
FROM customers_archived
)
SELECT ac.customer_id, ac.email, s.plan
FROM all_customers ac
LEFT JOIN subscriptions s
ON ac.customerid = s.customerid;
This keeps the query readable: “stack first, then enrich.”
If there’s a risk of overlap (a customer appears in both current and archived), I’ll either:
- enforce it upstream (best)
- or deduplicate with a deterministic rule (next best)
For example, pick current over archived:
WITH all_customers AS (
SELECT customerid, email, createdat, 1 AS source_priority
FROM customers_current
UNION ALL
SELECT customerid, email, createdat, 2 AS source_priority
FROM customers_archived
),
chosen AS (
SELECT *
FROM (
SELECT
*,
ROWNUMBER() OVER (PARTITION BY customerid ORDER BY source_priority) AS rn
FROM all_customers
) x
WHERE rn = 1
)
SELECT c.customer_id, c.email, s.plan
FROM chosen c
LEFT JOIN subscriptions s
ON c.customerid = s.customerid;
That’s a “real production” pattern: union multiple sources, then choose one record per key.
Common Mistakes I See (and How You Avoid Them)
These show up in code reviews constantly.
Mistake 1: Using UNION when you meant JOIN
Symptom: you want name, age, address but you write separate SELECTs and UNION them.
That can only produce more rows, not attach address to the matching person.
Fix: decide the relationship key (like rollno), then JOIN.
A quick diagnostic question I ask: “If I print the output, do I expect one line per person, or two lines per person?” If it’s one line per person with multiple columns, UNION can’t be correct.
Mistake 2: Using JOIN when you meant UNION
Symptom: you want “all names from both tables,” but you JOIN on roll number. You’ll only get names where roll numbers match, and you’ll lose non-matching rows.
Fix: project the same schema and UNION (usually UNION ALL).
When someone says “combine” in English, I ask them to clarify: “combine side-by-side or stack?” That one clarification prevents half the mistakes.
Mistake 3: Relying on UNION to remove “bad duplicates”
Sometimes duplicates are a data problem, not a query problem.
If you use UNION to hide duplicates, you can mask:
- a broken ingestion job
- a missing uniqueness constraint
- an accidental fan-out earlier in the pipeline
Fix: start with UNION ALL, measure duplicates, and only then decide whether to deduplicate—and where. Often the right place is upstream with constraints or a cleanup job.
A practical technique: surface duplicates explicitly.
WITH combined AS (
SELECT id FROM source_a
UNION ALL
SELECT id FROM source_b
)
SELECT id, COUNT(*) AS occurrences
FROM combined
GROUP BY id
HAVING COUNT(*) > 1
ORDER BY occurrences DESC;
Now you’re not hiding duplicates—you’re learning from them.
Mistake 4: Silent type coercion across UNION branches
Different databases resolve types differently. Even when a UNION works, you might get:
- numeric coerced to text
- timestamps coerced to date
- different collations or encodings causing comparison surprises
Fix: cast deliberately in each branch so the output contract is clear.
When I want long-lived maintainability, I’ll often add a “schema CTE” pattern:
WITH typed AS (
SELECT
CAST(userid AS bigint) AS userid,
CAST(occurredat AS timestamp) AS occurredat,
CAST(eventtype AS text) AS eventtype,
CAST(eventdetail AS text) AS eventdetail
FROM (
SELECT userid, occurredat, ‘purchase‘ AS eventtype, CAST(amount AS text) AS eventdetail
FROM purchase_events
UNION ALL
SELECT userid, occurredat, ‘login‘ AS eventtype, ipaddress AS event_detail
FROM login_events
) u
)
SELECT *
FROM typed;
It’s a little verbose, but it makes the contract explicit and keeps surprises from creeping in when schemas evolve.
Mistake 5: LEFT JOIN filters placed in the wrong clause
This one is subtle: a predicate in WHERE can turn a LEFT JOIN into “effectively inner.”
If you write:
SELECT c.customer_id, s.plan
FROM customers c
LEFT JOIN subscriptions s
ON c.customerid = s.customerid
WHERE s.plan = ‘pro‘;
You’ve excluded customers without subscriptions because s.plan is NULL for them.
If you truly want to keep customers and only filter matched rows, push the filter into the JOIN condition:
SELECT c.customer_id, s.plan
FROM customers c
LEFT JOIN subscriptions s
ON c.customerid = s.customerid
AND s.plan = ‘pro‘;
That’s not a UNION topic, but it’s part of the same “shape discipline”: you should know what rows you’re keeping before you add conditions.
Mistake 6: Joining on non-unique keys and then “fixing” it with DISTINCT
This is a classic:
- someone joins on a column that isn’t unique (like
email) - the result multiplies
- they add
SELECT DISTINCTto “clean it up”
That might hide the symptom, but it doesn’t fix the cause. You can still end up with the wrong associated values (which email matched which record?), and you can still overcount when you aggregate later.
Fix: enforce uniqueness in data (constraints) or change the join key to something truly unique.
Mistake 7: UNION-ing incompatible meaning
Even if two branches have the same schema, the semantics might not match.
Example: one table stores amount as cents (integer), another stores amount as dollars (decimal). A UNION can succeed and still be wrong.
Fix: normalize units and meaning in each branch before unioning. I treat that normalization as part of the query, not an afterthought.
Performance and Maintainability: What I Watch in Production
I avoid promising exact timings because hardware, data size, and engine choices vary. But you can still reason about typical behavior.
JOIN performance considerations
JOIN cost is driven by:
- cardinality (how many rows match how many rows)
- indexes on join keys
- join strategy (hash join, merge join, nested loop)
If you join a large table to another large table on non-indexed columns, it can jump from “tens of milliseconds” to “seconds” quickly. The fix is rarely “add random indexes everywhere.” I start with:
- confirm the join predicate uses the real key
- reduce rows early (filter in a CTE/subquery)
- aggregate early if you only need summary rows
I also keep an eye out for accidental many-to-many joins in analytics pipelines. Even if the query “works,” it can explode intermediate row counts and thrash memory.
UNION / UNION ALL performance considerations
UNION ALLis usually cheap: concatenate outputs.UNIONrequires duplicate elimination, which usually needs extra work (sort/hash). On large result sets, that can be noticeable.
If you need uniqueness, consider whether you can deduplicate using a key and a deterministic rule instead of full-row distinctness. For example, “latest event per user” is not the same thing as UNION.
A pattern I like is: UNION ALL → then deduplicate with a window function.
WITH combined AS (
SELECT userid, occurredat, payload, ‘a‘ AS source
FROM events_a
UNION ALL
SELECT userid, occurredat, payload, ‘b‘ AS source
FROM events_b
),
ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY userid, occurredat, payload
ORDER BY source
) AS rn
FROM combined
)
SELECT userid, occurredat, payload
FROM ranked
WHERE rn = 1;
That approach makes dedup logic explicit. I can explain it, test it, and modify it later.
Maintainability: make the output contract obvious
For UNION queries, I like to make the schema visible:
- use consistent column names via
ASaliases - add
eventsource/eventtypecolumns - cast types explicitly so later schema changes don’t silently alter results
For JOIN queries, I like to make the grain obvious:
- name CTEs by grain (
orderspercustomer,lateststatusper_order) - avoid selecting
*across multiple tables (it hides collisions and makes diffs noisy) - prefer
JOIN ... ONpredicates that read like keys, not like ad hoc filters
And for both: I put “shape assertions” in my head (and sometimes in tests).
How I validate output shape before it ships
When correctness matters (dashboards, billing, exports), I validate in three layers:
1) Row count expectations
- If I’m enriching a fact table with optional dimensions (
LEFT JOIN), I expect row count to stay the same. - If row count changes, I investigate.
2) Key uniqueness expectations
- If the result should be one row per
order_id, I check that.
SELECT order_id, COUNT(*)
FROM (
— your query here
) q
GROUP BY order_id
HAVING COUNT(*) > 1;
3) Null-rate expectations
- If
customer_emailis expected to be present 99%+ of the time, I compute the null rate. A spike often means a join mismatch or a key change.
These checks take minutes and save hours of “why did revenue jump?” later.
Expansion Strategy
If you’re trying to get better at choosing between JOIN and UNION (and not just memorize definitions), here’s the practice loop I recommend.
Step 1: State the grain in one sentence
Before writing SQL, I write one sentence like:
- “One row per order, enriched with customer email.”
- “One row per user event across all sources.”
If you can’t say it, you can’t validate it.
Step 2: Decide whether you’re widening or lengthening
Then I choose:
- Widen (columns): JOIN
- Lengthen (rows): UNION ALL
If I find myself wanting “both,” I structure it as multiple CTEs so each step is obvious.
Step 3: Build a minimal runnable example
I don’t start with the full production query. I start with:
- 2–5 rows that include edge cases: missing matches, duplicates, many-to-many, NULLs
- a query that demonstrates the intended shape
This is where the earlier “VALUES” examples shine. Tiny test data reveals whether you’re widening vs lengthening.
Step 4: Add guardrails
When the query matters, I add guardrails:
- pre-aggregation before joins
- explicit casts in unions
- discriminators (
event_type,source) - uniqueness checks on expected keys
My goal is that future me (or a teammate) can modify the query without re-learning the pitfalls.
If Relevant to Topic
In modern teams, SQL rarely lives alone. It gets generated, templated, linted, and tested. A few workflow notes that tie directly to JOIN vs UNION correctness:
SQL review checklist (fast and practical)
When I review SQL that contains JOINs or UNIONs, I scan for:
- “What is the grain?” (Is it stated implicitly by grouping keys, or explicitly in comments/CTE names?)
- Are JOIN keys truly keys (unique on at least one side), or are we joining on messy attributes?
- Are
LEFT JOINfilters accidentally placed inWHERE? - Is the query using
UNIONwhen it should beUNION ALL? - Are types aligned explicitly across UNION branches?
Testing and monitoring for shape regressions
If the query feeds something important (billing, KPIs), I like lightweight automated checks:
- unit-style checks: row counts by day, null rates, uniqueness on primary keys
- anomaly alerts: sudden changes in duplicate rate after a union, or sudden growth in row count after a join
You don’t need heavy infrastructure to do this; even a scheduled job that logs these metrics can catch regressions.
Using AI-assisted workflows safely (optional)
If you use tools that help draft SQL (including AI assistants), I’ve found one constraint dramatically improves safety:
- Require the assistant to state the intended grain (“one row per X”) and whether the query widens or lengthens.
That forces the shape decision to become explicit, which is exactly what prevents JOIN/UNION confusion in the first place.
—
If you want a quick rule to take into your next query:
- JOIN is for related attributes on the same row.
- UNION ALL is for the same kind of row from multiple places.
And if you want the even quicker version:
- JOIN: combine columns.
- UNION: combine rows.


