Difference Between JOIN and UNION in SQL (With Practical Examples)

A lot of SQL mistakes I review aren’t “syntax” mistakes—they’re shape mistakes. Someone needed one list, but accidentally produced a wide table. Or they needed to combine related attributes, but ended up stacking unrelated rows. Those errors can look reasonable in a quick spot-check, especially with small test data, and then quietly corrupt dashboards, exports, and downstream services.

The fastest way I’ve found to prevent this class of bug is to internalize one simple question before you write anything: Am I trying to combine columns or combine rows? If you’re combining columns across tables based on a relationship, you’re in JOIN territory. If you’re stacking result sets from multiple queries into one longer result, you’re in UNION territory.

I’ll walk you through that mental model with runnable examples, show where people slip (especially around duplicates, NULLs, and mismatched schemas), and give you concrete rules I use in production: when I reach for JOIN, when I reach for UNION, when I refuse to use either, and how I validate the output shape before it ships.

The “Shape” Mental Model: Columns vs Rows

When you JOIN, you’re “widening” your result. You take rows from a primary table and attach related columns from another table (or the same table) based on a match condition.

When you UNION, you’re “lengthening” your result. You take the output of one SELECT and stack it on top of the output of another SELECT, producing more rows.

Here’s a simple analogy I use when explaining this to new engineers:

  • JOIN is like merging two spreadsheets side-by-side using a shared key column (like Order ID).
  • UNION is like taking two spreadsheets with the same columns and pasting one under the other.

This is not just conceptual—it’s literal. JOIN increases the number of columns (unless you select fewer), and UNION increases the number of rows.

I also keep two “sanity checks” in my head:

1) With JOIN, row counts can change (sometimes dramatically) if your relationship isn’t truly 1-to-1.

2) With UNION, column definitions must align (count and compatible types), or you’re either going to get an error or you’ll force an implicit cast that bites later.

To make this concrete, I’ll use small tables you can run in PostgreSQL, MySQL 8+, SQL Server (minor tweaks), or SQLite (minor tweaks). I’ll write standard SQL where possible.

One extra shape rule I’ve learned to say out loud (because it catches mistakes early):

  • If I can draw an arrow between tables (customer → orders), I probably want a JOIN.
  • If I can describe it as “two sources of the same thing” (events from system A + events from system B), I probably want a UNION ALL.

Also: you can absolutely combine the two. A lot of “real” queries are: shape two streams into the same schema (UNION ALL), then enrich them (JOIN), then summarize (GROUP BY). The important part is that each step has a clear shape intention.

JOIN: Combining Related Data Into New Columns

A JOIN connects rows across tables using a predicate (most often equality on key columns). The output is a single rowset where columns from both sides appear together.

The reason JOIN exists is simple: normalized databases store different attributes in different tables. JOIN is the mechanism that lets you reconstruct a useful view without denormalizing everything.

A runnable example

Let’s model two tables: one holding students, another holding addresses. (This mirrors a common pattern: “entity” table plus “detail” table.)

— Works in PostgreSQL. For other databases, the idea is the same.

WITH boys AS (

SELECT * FROM (VALUES

(10, ‘Ram‘, 14),

(15, ‘Ritik‘, 15),

(16, ‘Sanjay‘, 16),

(17, ‘Prakhar‘, 17)

) AS t(rollno, name, age)

),

girls AS (

SELECT * FROM (VALUES

(9, ‘Seema‘, ‘Pune‘),

(10, ‘Rimi‘, ‘Delhi‘),

(15, ‘Mona‘, ‘Delhi‘),

(16, ‘Nisha‘, ‘Goa‘),

(17, ‘Anaya‘, ‘Bhopal‘)

) AS t(rollno, name, address)

)

SELECT

b.name AS boy_name,

b.age AS boy_age,

g.address

FROM boys b

INNER JOIN girls g

ON b.rollno = g.rollno

ORDER BY b.rollno;

What this produces (conceptually):

  • It returns only roll numbers that exist in both sets (because INNER JOIN).
  • It returns a row where the boy’s name/age and the matching address appear in the same row.

That “same row” property is the whole point of JOIN.

One thing I always emphasize: JOIN doesn’t “merge tables,” it “matches rows.” That sounds pedantic, but it changes how you debug. If something looks wrong, you don’t stare at columns—you ask “which rows matched, and why?”

JOIN types you actually use

I’m not going to enumerate every JOIN variant as a vocabulary list. Instead, here’s how I decide:

  • INNER JOIN: I require a match on both sides. This is my default when missing relationships should exclude the row.
  • LEFT JOIN: I keep everything from the left side even if there’s no match on the right. This is my default for enrichment (adding optional details).
  • FULL OUTER JOIN: I want everything from both sides, matched where possible. I mainly use this for audits and reconciliation.

Two practical notes that save time:

1) RIGHT JOIN is just LEFT JOIN with tables swapped. I almost never use RIGHT JOIN in production code because it makes scanning harder.

2) CROSS JOIN (Cartesian product) has valid uses (generating date grids, building all combinations for testing), but if you didn’t intentionally choose it, it’s usually a bug.

The most common JOIN bug: accidental row multiplication

JOIN is not “safe” by default. If you join one row to many rows, you’ll multiply rows.

Example: one customer has three orders. If you join customers to orders, the customer row appears three times.

That’s correct behavior, but it’s a frequent source of incorrect totals.

I use this quick checklist:

  • If I’m joining for display (show customer + order info), row multiplication is fine.
  • If I’m joining then aggregating (SUM, COUNT), I stop and confirm cardinality.

A practical guardrail is to aggregate before you join:

WITH orders AS (

SELECT * FROM (VALUES

(1, 101, 29.99),

(2, 101, 14.50),

(3, 102, 9.00)

) AS t(orderid, customerid, amount)

),

customers AS (

SELECT * FROM (VALUES

(101, ‘Ava‘),

(102, ‘Noah‘)

) AS t(customerid, customername)

),

order_totals AS (

SELECT customerid, SUM(amount) AS totalamount

FROM orders

GROUP BY customer_id

)

SELECT c.customername, ot.totalamount

FROM customers c

LEFT JOIN order_totals ot

ON c.customerid = ot.customerid

ORDER BY c.customer_id;

Notice what I did: I produced one row per customer in order_totals before joining. This keeps the final result stable.

If you take nothing else from this section, take this: aggregation “collapses” multiplicity; JOIN can “expand” it. If you do them in the wrong order, you can end up counting the same money twice.

Cardinality: the question you should answer before every JOIN

When I’m about to write a JOIN, I try to classify it as one of these:

  • 1-to-1 (rare, but ideal): each row matches at most one row on the other side.
  • 1-to-many (common): one customer → many orders.
  • many-to-many (dangerous unless intentional): tags, enrollments, memberships.

If it’s many-to-many, I don’t proceed until I can explain why the multiplication is correct. Many-to-many joins are valid, but they are exactly where “looks fine on small data” turns into “my totals doubled in production.”

A simple technique I use during development is to run a quick multiplicity check.

— For a given join key, how many matches exist on each side?

SELECT customerid, COUNT(*) AS ordersper_customer

FROM orders

GROUP BY customer_id

HAVING COUNT(*) > 1;

If I expect 1-to-1 and I see counts above 1, I know the join will multiply rows.

NULLs and JOINs: what actually happens

People often say “NULLs don’t match.” That’s generally true for equality joins (a.col = b.col). Two NULLs are not equal in SQL’s three-valued logic.

So if you have NULL join keys, these rows typically won’t match in an INNER JOIN, and in a LEFT JOIN they’ll show up with NULLs for the right-side columns.

This matters when your “key” isn’t truly a key. If someone tries to join on email (which may be NULL, may change, may differ by case/collation), you can get silent mismatches.

My rule: join on stable identifiers (surrogate keys, natural keys with enforced constraints). If you can’t enforce it, treat it as an approximate match and expect data quality issues.

JOINs for existence vs JOINs for enrichment (EXISTS can be better)

Sometimes people use JOIN when they only want to filter based on whether a related row exists.

Example: “show customers who placed at least one order.”

A JOIN works, but it can multiply customers if they have multiple orders. You can fix it with DISTINCT, but that’s often a smell.

I reach for EXISTS because it expresses intent and avoids accidental multiplication:

SELECT c.customerid, c.customername

FROM customers c

WHERE EXISTS (

SELECT 1

FROM orders o

WHERE o.customerid = c.customerid

);

If your goal is “keep these rows if a match exists,” EXISTS is often safer and faster.

UNION: Stacking Result Sets Into New Rows

UNION combines the output of two (or more) SELECT statements into a single result set by stacking rows.

The key constraints:

  • Each SELECT must return the same number of columns.
  • Corresponding columns should have compatible data types.
  • UNION removes duplicates (distinct rows).

A runnable example

Let’s create two lists of names and stack them.

WITH boys AS (

SELECT * FROM (VALUES

(10, ‘Ram‘),

(15, ‘Ritik‘),

(16, ‘Sanjay‘),

(17, ‘Prakhar‘)

) AS t(rollno, name)

),

girls AS (

SELECT * FROM (VALUES

(9, ‘Seema‘),

(10, ‘Rimi‘),

(15, ‘Mona‘),

(16, ‘Nisha‘),

(17, ‘Anaya‘)

) AS t(rollno, name)

)

SELECT name

FROM boys

WHERE rollno < 16

UNION

SELECT name

FROM girls

WHERE rollno > 9

ORDER BY name;

This produces a single-column list of names.

What matters here is not that the underlying data came from two sources; what matters is that the output schema matches: both SELECTs return one column, name, of a compatible type.

UNION vs UNION ALL (this is where teams get burned)

  • UNION = distinct rows only (duplicate elimination)
  • UNION ALL = keep duplicates (no duplicate elimination)

In production, I default to UNION ALL unless I have a specific reason to remove duplicates.

Why? Because duplicate elimination has a cost (sort/hash), and because “duplicates” might be meaningful (for example, two different systems emitted the same event and you want to investigate that, not hide it).

Here’s a small demo:

WITH a AS (

SELECT * FROM (VALUES

(‘Mona‘),

(‘Seema‘)

) AS t(name)

),

b AS (

SELECT * FROM (VALUES

(‘Mona‘),

(‘Rimi‘)

) AS t(name)

)

SELECT name FROM a

UNION

SELECT name FROM b

ORDER BY name;

This returns Mona once.

Now:

WITH a AS (

SELECT * FROM (VALUES

(‘Mona‘),

(‘Seema‘)

) AS t(name)

),

b AS (

SELECT * FROM (VALUES

(‘Mona‘),

(‘Rimi‘)

) AS t(name)

)

SELECT name FROM a

UNION ALL

SELECT name FROM b

ORDER BY name;

This returns Mona twice.

My rule: if your downstream logic is doing counts, you should be explicit about whether duplicates are allowed. “Accidentally distinct” is a classic analytics bug.

Schema alignment: the hidden work in most UNIONs

UNION feels simple until you do it with real tables. In real systems, different sources rarely line up perfectly:

  • column names differ (userid vs accountid)
  • types differ (uuid vs text)
  • meaning differs (one table stores UTC timestamps, the other stores local time)

A UNION is only as good as the contract you define for the output.

When I build a unioned dataset, I explicitly choose:

  • a stable column order
  • consistent aliases
  • explicit casts
  • a discriminator column (source, event_type, system)

That discriminator sounds optional, but it’s the difference between “useful unified feed” and “mystery meat dataset.”

Ordering and limiting UNION results (common gotcha)

One subtle point: ORDER BY applies to the final combined result, not individual branches.

So if you want “top 10 from each branch” and then combine, you generally need to apply LIMIT inside each subquery/CTE, then union those results.

WITH top_a AS (

SELECT userid, occurredat

FROM events_a

ORDER BY occurred_at DESC

LIMIT 10

),

top_b AS (

SELECT userid, occurredat

FROM events_b

ORDER BY occurred_at DESC

LIMIT 10

)

SELECT * FROM top_a

UNION ALL

SELECT * FROM top_b

ORDER BY occurred_at DESC;

Without that structure, you might accidentally take “top 10 overall,” not “top 10 per source.”

A Direct Comparison (and the Rules I Actually Follow)

Here’s the crisp difference I teach:

  • JOIN combines related tables into a wider result (more columns).
  • UNION combines compatible SELECT outputs into a longer result (more rows).

A practical comparison table:

Topic

JOIN

UNION —

— Primary purpose

Attach related columns across tables

Stack result sets from multiple queries Shape change

Typically increases columns

Increases rows Key requirement

A join predicate (match condition)

Same column count + compatible column types Duplicate handling

Can create duplicates via many-to-many matches

UNION removes duplicates; UNION ALL keeps them Typical risk

Row multiplication and incorrect aggregates

Silent type coercion or unintended de-duplication

And here are my operational rules:

1) If I need attributes from another table to describe the same entity row, I JOIN.

2) If I need one output list that comes from multiple sources with the same schema, I UNION ALL.

3) If I catch myself wanting UNION to “combine tables,” I pause—most likely I actually need JOIN (or I need to remodel the query).

4) If I catch myself joining two tables without knowing the cardinality, I pause—most likely I’m about to inflate counts.

5) If I’m using UNION (distinct) to “clean up” duplicates, I pause—most likely I’m hiding a data quality problem or I’m deduplicating in the wrong place.

6) If I’m joining just to filter for existence, I try EXISTS first.

A quick self-test I do before I run a query

I literally ask myself which of these I’m trying to produce:

  • “One row per X, enriched with details” → JOIN (and/or pre-aggregate)
  • “A combined stream of X from multiple sources” → UNION ALL
  • “Only X where a relationship exists” → EXISTS

If I can’t name the grain (“one row per what?”), I stop. Not because I’m being academic, but because that’s where shape bugs originate.

Picking the Right Tool in Real Work (Scenarios You’ll Recognize)

This is where it becomes straightforward.

Scenario A: “Show orders with customer emails” (JOIN)

You have orders(orderid, customerid, total) and customers(customer_id, email).

You want each order row plus the email.

That’s a JOIN:

SELECT

o.order_id,

o.total,

c.email

FROM orders o

INNER JOIN customers c

ON o.customerid = c.customerid;

If a customer might be missing (data quality issues, soft-deleted accounts), I use LEFT JOIN and explicitly handle NULLs.

Also, when I do LEFT JOIN, I’ll often add a quick “missing rate” check in development:

SELECT

COUNT(*) AS total_orders,

SUM(CASE WHEN c.customerid IS NULL THEN 1 ELSE 0 END) AS missingcustomers

FROM orders o

LEFT JOIN customers c

ON o.customerid = c.customerid;

If missing_customers isn’t ~0 and I didn’t expect it, I investigate before shipping.

Scenario B: “Build a single activity feed from two event tables” (UNION ALL)

You store purchases and logins separately:

  • purchaseevents(userid, occurred_at, amount)
  • loginevents(userid, occurredat, ipaddress)

For a unified feed, you project them into the same schema and UNION ALL:

SELECT

user_id,

occurred_at,

‘purchase‘ AS event_type,

CAST(amount AS text) AS event_detail

FROM purchase_events

UNION ALL

SELECT

user_id,

occurred_at,

‘login‘ AS event_type,

ipaddress AS eventdetail

FROM login_events

ORDER BY occurred_at DESC;

Note the deliberate normalization:

  • I add a discriminator column (event_type).
  • I align types (event_detail as text). I do this intentionally so the meaning is stable.

If you don’t like collapsing into event_detail, you can also build a wider schema with multiple nullable columns:

SELECT userid, occurredat, ‘purchase‘ AS eventtype, amount, NULL AS ipaddress

FROM purchase_events

UNION ALL

SELECT userid, occurredat, ‘login‘ AS eventtype, NULL AS amount, ipaddress

FROM login_events;

That output is still “one schema,” but it keeps typed columns. I pick this approach when downstream users want to filter/aggregate without parsing text.

Scenario C: “Combine two tables with different columns” (neither, until you align)

If one SELECT returns (id, name) and the other returns (id, name, region), you cannot UNION them without shaping.

You can:

  • add a NULL placeholder
  • or project both down to a shared subset

Example:

SELECT id, name, region

FROM customers

UNION ALL

SELECT id, name, NULL AS region

FROM leads;

I’m explicit about the placeholder. Otherwise, someone will read the output later and assume region is “real” for both.

One more thing: if customers.region is not nullable, your unioned output will be nullable. That’s an output-contract change. Sometimes that’s fine, but I like to acknowledge it.

Scenario D: “I want to match rows and also stack categories” (JOIN + UNION)

Sometimes you genuinely need both.

Example: you have current customers and archived customers in separate tables, and you also need to attach subscription data.

I typically UNION ALL the customer sources into one CTE (same schema), then JOIN once:

WITH all_customers AS (

SELECT customerid, email, createdat

FROM customers_current

UNION ALL

SELECT customerid, email, createdat

FROM customers_archived

)

SELECT ac.customer_id, ac.email, s.plan

FROM all_customers ac

LEFT JOIN subscriptions s

ON ac.customerid = s.customerid;

This keeps the query readable: “stack first, then enrich.”

If there’s a risk of overlap (a customer appears in both current and archived), I’ll either:

  • enforce it upstream (best)
  • or deduplicate with a deterministic rule (next best)

For example, pick current over archived:

WITH all_customers AS (

SELECT customerid, email, createdat, 1 AS source_priority

FROM customers_current

UNION ALL

SELECT customerid, email, createdat, 2 AS source_priority

FROM customers_archived

),

chosen AS (

SELECT *

FROM (

SELECT

*,

ROWNUMBER() OVER (PARTITION BY customerid ORDER BY source_priority) AS rn

FROM all_customers

) x

WHERE rn = 1

)

SELECT c.customer_id, c.email, s.plan

FROM chosen c

LEFT JOIN subscriptions s

ON c.customerid = s.customerid;

That’s a “real production” pattern: union multiple sources, then choose one record per key.

Common Mistakes I See (and How You Avoid Them)

These show up in code reviews constantly.

Mistake 1: Using UNION when you meant JOIN

Symptom: you want name, age, address but you write separate SELECTs and UNION them.

That can only produce more rows, not attach address to the matching person.

Fix: decide the relationship key (like rollno), then JOIN.

A quick diagnostic question I ask: “If I print the output, do I expect one line per person, or two lines per person?” If it’s one line per person with multiple columns, UNION can’t be correct.

Mistake 2: Using JOIN when you meant UNION

Symptom: you want “all names from both tables,” but you JOIN on roll number. You’ll only get names where roll numbers match, and you’ll lose non-matching rows.

Fix: project the same schema and UNION (usually UNION ALL).

When someone says “combine” in English, I ask them to clarify: “combine side-by-side or stack?” That one clarification prevents half the mistakes.

Mistake 3: Relying on UNION to remove “bad duplicates”

Sometimes duplicates are a data problem, not a query problem.

If you use UNION to hide duplicates, you can mask:

  • a broken ingestion job
  • a missing uniqueness constraint
  • an accidental fan-out earlier in the pipeline

Fix: start with UNION ALL, measure duplicates, and only then decide whether to deduplicate—and where. Often the right place is upstream with constraints or a cleanup job.

A practical technique: surface duplicates explicitly.

WITH combined AS (

SELECT id FROM source_a

UNION ALL

SELECT id FROM source_b

)

SELECT id, COUNT(*) AS occurrences

FROM combined

GROUP BY id

HAVING COUNT(*) > 1

ORDER BY occurrences DESC;

Now you’re not hiding duplicates—you’re learning from them.

Mistake 4: Silent type coercion across UNION branches

Different databases resolve types differently. Even when a UNION works, you might get:

  • numeric coerced to text
  • timestamps coerced to date
  • different collations or encodings causing comparison surprises

Fix: cast deliberately in each branch so the output contract is clear.

When I want long-lived maintainability, I’ll often add a “schema CTE” pattern:

WITH typed AS (

SELECT

CAST(userid AS bigint) AS userid,

CAST(occurredat AS timestamp) AS occurredat,

CAST(eventtype AS text) AS eventtype,

CAST(eventdetail AS text) AS eventdetail

FROM (

SELECT userid, occurredat, ‘purchase‘ AS eventtype, CAST(amount AS text) AS eventdetail

FROM purchase_events

UNION ALL

SELECT userid, occurredat, ‘login‘ AS eventtype, ipaddress AS event_detail

FROM login_events

) u

)

SELECT *

FROM typed;

It’s a little verbose, but it makes the contract explicit and keeps surprises from creeping in when schemas evolve.

Mistake 5: LEFT JOIN filters placed in the wrong clause

This one is subtle: a predicate in WHERE can turn a LEFT JOIN into “effectively inner.”

If you write:

SELECT c.customer_id, s.plan

FROM customers c

LEFT JOIN subscriptions s

ON c.customerid = s.customerid

WHERE s.plan = ‘pro‘;

You’ve excluded customers without subscriptions because s.plan is NULL for them.

If you truly want to keep customers and only filter matched rows, push the filter into the JOIN condition:

SELECT c.customer_id, s.plan

FROM customers c

LEFT JOIN subscriptions s

ON c.customerid = s.customerid

AND s.plan = ‘pro‘;

That’s not a UNION topic, but it’s part of the same “shape discipline”: you should know what rows you’re keeping before you add conditions.

Mistake 6: Joining on non-unique keys and then “fixing” it with DISTINCT

This is a classic:

  • someone joins on a column that isn’t unique (like email)
  • the result multiplies
  • they add SELECT DISTINCT to “clean it up”

That might hide the symptom, but it doesn’t fix the cause. You can still end up with the wrong associated values (which email matched which record?), and you can still overcount when you aggregate later.

Fix: enforce uniqueness in data (constraints) or change the join key to something truly unique.

Mistake 7: UNION-ing incompatible meaning

Even if two branches have the same schema, the semantics might not match.

Example: one table stores amount as cents (integer), another stores amount as dollars (decimal). A UNION can succeed and still be wrong.

Fix: normalize units and meaning in each branch before unioning. I treat that normalization as part of the query, not an afterthought.

Performance and Maintainability: What I Watch in Production

I avoid promising exact timings because hardware, data size, and engine choices vary. But you can still reason about typical behavior.

JOIN performance considerations

JOIN cost is driven by:

  • cardinality (how many rows match how many rows)
  • indexes on join keys
  • join strategy (hash join, merge join, nested loop)

If you join a large table to another large table on non-indexed columns, it can jump from “tens of milliseconds” to “seconds” quickly. The fix is rarely “add random indexes everywhere.” I start with:

  • confirm the join predicate uses the real key
  • reduce rows early (filter in a CTE/subquery)
  • aggregate early if you only need summary rows

I also keep an eye out for accidental many-to-many joins in analytics pipelines. Even if the query “works,” it can explode intermediate row counts and thrash memory.

UNION / UNION ALL performance considerations

  • UNION ALL is usually cheap: concatenate outputs.
  • UNION requires duplicate elimination, which usually needs extra work (sort/hash). On large result sets, that can be noticeable.

If you need uniqueness, consider whether you can deduplicate using a key and a deterministic rule instead of full-row distinctness. For example, “latest event per user” is not the same thing as UNION.

A pattern I like is: UNION ALL → then deduplicate with a window function.

WITH combined AS (

SELECT userid, occurredat, payload, ‘a‘ AS source

FROM events_a

UNION ALL

SELECT userid, occurredat, payload, ‘b‘ AS source

FROM events_b

),

ranked AS (

SELECT

*,

ROW_NUMBER() OVER (

PARTITION BY userid, occurredat, payload

ORDER BY source

) AS rn

FROM combined

)

SELECT userid, occurredat, payload

FROM ranked

WHERE rn = 1;

That approach makes dedup logic explicit. I can explain it, test it, and modify it later.

Maintainability: make the output contract obvious

For UNION queries, I like to make the schema visible:

  • use consistent column names via AS aliases
  • add eventsource / eventtype columns
  • cast types explicitly so later schema changes don’t silently alter results

For JOIN queries, I like to make the grain obvious:

  • name CTEs by grain (orderspercustomer, lateststatusper_order)
  • avoid selecting * across multiple tables (it hides collisions and makes diffs noisy)
  • prefer JOIN ... ON predicates that read like keys, not like ad hoc filters

And for both: I put “shape assertions” in my head (and sometimes in tests).

How I validate output shape before it ships

When correctness matters (dashboards, billing, exports), I validate in three layers:

1) Row count expectations

  • If I’m enriching a fact table with optional dimensions (LEFT JOIN), I expect row count to stay the same.
  • If row count changes, I investigate.

2) Key uniqueness expectations

  • If the result should be one row per order_id, I check that.

SELECT order_id, COUNT(*)

FROM (

— your query here

) q

GROUP BY order_id

HAVING COUNT(*) > 1;

3) Null-rate expectations

  • If customer_email is expected to be present 99%+ of the time, I compute the null rate. A spike often means a join mismatch or a key change.

These checks take minutes and save hours of “why did revenue jump?” later.

Expansion Strategy

If you’re trying to get better at choosing between JOIN and UNION (and not just memorize definitions), here’s the practice loop I recommend.

Step 1: State the grain in one sentence

Before writing SQL, I write one sentence like:

  • “One row per order, enriched with customer email.”
  • “One row per user event across all sources.”

If you can’t say it, you can’t validate it.

Step 2: Decide whether you’re widening or lengthening

Then I choose:

  • Widen (columns): JOIN
  • Lengthen (rows): UNION ALL

If I find myself wanting “both,” I structure it as multiple CTEs so each step is obvious.

Step 3: Build a minimal runnable example

I don’t start with the full production query. I start with:

  • 2–5 rows that include edge cases: missing matches, duplicates, many-to-many, NULLs
  • a query that demonstrates the intended shape

This is where the earlier “VALUES” examples shine. Tiny test data reveals whether you’re widening vs lengthening.

Step 4: Add guardrails

When the query matters, I add guardrails:

  • pre-aggregation before joins
  • explicit casts in unions
  • discriminators (event_type, source)
  • uniqueness checks on expected keys

My goal is that future me (or a teammate) can modify the query without re-learning the pitfalls.

If Relevant to Topic

In modern teams, SQL rarely lives alone. It gets generated, templated, linted, and tested. A few workflow notes that tie directly to JOIN vs UNION correctness:

SQL review checklist (fast and practical)

When I review SQL that contains JOINs or UNIONs, I scan for:

  • “What is the grain?” (Is it stated implicitly by grouping keys, or explicitly in comments/CTE names?)
  • Are JOIN keys truly keys (unique on at least one side), or are we joining on messy attributes?
  • Are LEFT JOIN filters accidentally placed in WHERE?
  • Is the query using UNION when it should be UNION ALL?
  • Are types aligned explicitly across UNION branches?

Testing and monitoring for shape regressions

If the query feeds something important (billing, KPIs), I like lightweight automated checks:

  • unit-style checks: row counts by day, null rates, uniqueness on primary keys
  • anomaly alerts: sudden changes in duplicate rate after a union, or sudden growth in row count after a join

You don’t need heavy infrastructure to do this; even a scheduled job that logs these metrics can catch regressions.

Using AI-assisted workflows safely (optional)

If you use tools that help draft SQL (including AI assistants), I’ve found one constraint dramatically improves safety:

  • Require the assistant to state the intended grain (“one row per X”) and whether the query widens or lengthens.

That forces the shape decision to become explicit, which is exactly what prevents JOIN/UNION confusion in the first place.

If you want a quick rule to take into your next query:

  • JOIN is for related attributes on the same row.
  • UNION ALL is for the same kind of row from multiple places.

And if you want the even quicker version:

  • JOIN: combine columns.
  • UNION: combine rows.
Scroll to Top