When I’m debugging a production incident, I often end up drawing the same picture: boxes connected by lines. Services calling services. Users linked to devices. Warehouses connected by roads. Even “simple” UI state can form a web of transitions. That picture is a graph, and once you see your system as vertices and edges, a lot of problems stop feeling mysterious.
A tree is great when your data is strictly hierarchical. Real systems rarely are. A user can belong to many groups, a router can connect to many routers, and a city map has cycles everywhere. Graphs give you the vocabulary to represent those connections, and graph algorithms give you the tools to answer questions like:
- “Can I reach this node at all?”
- “What’s the minimum number of steps?”
- “Is there a cycle that will cause infinite work?”
- “What’s the cheapest way to connect everything?”
I’ll walk through how I think about graphs in modern development: choosing a representation, using traversal patterns that actually ship, detecting cycles safely, computing shortest paths under real constraints, and building minimum spanning trees when “connect all the things” is the job.
Graphs as a Model (and When I Avoid Them)
A graph is a set of vertices \(V\) and edges \(E\). Vertices represent entities (people, servers, locations). Edges represent relationships (friendship, network link, road). The relationship can be directed (A → B) or undirected (A — B). It can also have weight (latency, distance, cost).
I reach for graphs when relationships are “many-to-many” and not forced into a strict parent/child shape. Common examples:
- Social connections: follows, likes, shared memberships
- Infrastructure: services calling services, dependency graphs
- Logistics: routes between hubs, delivery constraints
- Knowledge systems: entities linked by references
- State machines: transitions between states (directed graph)
When I do NOT use a graph:
- The data is truly hierarchical and won’t get cross-links later (use a tree)
- You only need simple grouping and there’s no path-like question (use sets / maps)
- The graph is enormous and dynamic, and you only need “neighbors” locally (use local indexes and avoid global algorithms)
A simple analogy I use when teaching: a tree is a company org chart; a graph is the whole company’s communication patterns. The org chart is neat, but it won’t tell you how information actually flows.
One more rule of thumb I’ve learned the hard way: graphs aren’t just a data structure; they’re also a commitment. Once you model something as a graph, people will ask graph questions. They’ll want reachability, they’ll want “what changed?”, they’ll want the cheapest path, they’ll want the blast radius of an incident. If you’re not ready to answer those questions at the scale your product might grow to, you either need to constrain the problem (limit node count, window queries, precompute) or pick a simpler model.
Representation Choices That Don’t Hurt Later
The first decision that decides performance and code clarity is how you store the graph.
Adjacency list vs adjacency matrix vs edge list
Here’s the decision table I keep in my head:
Best for
Typical operations
—
—
neighbors[u] = [...]) Sparse graphs, most app graphs
Iterate neighbors fast
adj[u][v]) Dense graphs, tiny V
Edge existence in O(1)
(u, v, w) records) Kruskal, bulk processing
Sorting edges, scansMy default in 2026 is still an adjacency list. Most real graphs are sparse: each node has a small number of edges compared to \(V\). In service dependency graphs, most services call a handful of other services, not thousands.
But I also ask a few “production reality” questions before committing:
- Do I need to look up “does edge (u, v) exist?” constantly? If yes, adjacency sets (or a matrix for tiny graphs) can beat lists.
- Do I need to store parallel edges (multiple links between the same two nodes)? If yes, list is fine; a set might accidentally deduplicate.
- Do I need to store self-loops (u → u)? Some algorithms assume they can exist; some treat them as cycles immediately.
- Do I need to mutate the graph as I traverse it? Usually I try not to. Mutating during traversal is a bug factory.
Directed vs undirected: the “double edge” cost
Undirected graphs in adjacency-list form typically store each edge twice: u lists v, and v lists u. That’s fine, but it matters at scale. If you’re modeling a million undirected edges, you’re actually storing two million neighbor records.
When memory is tight, I’ll sometimes store a single canonical edge list and build adjacency lists on-demand for queries, or I’ll store neighbors in a more compact format (arrays / typed arrays / CSR-like structures). In Python, simple lists are convenient, but they have a lot of per-object overhead. In lower-level languages, you can pack edges tightly.
Indexing and IDs
In interviews, nodes are often 0..n-1. In production, your nodes are UUIDs, database IDs, or strings. My approach:
- Map external IDs to dense integer indices for algorithm runs
- Keep a reverse map for reporting results
This gives you tight arrays (fast) without losing real identifiers.
A subtle benefit: dense indices also make your algorithms more predictable. In Python/JS, a dict keyed by strings is fine for correctness, but arrays are faster, smaller, and easier to reason about for visited/dist/parent.
Weighted edges and what “weight” really means
I force myself to name weights as what they are. Not w, but costms, distancekm, risk_score, capacity, penalty. Because weights are where assumptions hide.
- If weights are time, they should be non-negative.
- If weights are probabilities, combining them is not “add”; it’s often multiplication or log-space.
- If weights are “risk,” the right path might be minimizing the maximum edge risk (a minimax path), not the sum.
A lot of graph bugs come from picking a well-known algorithm and quietly applying it to a weight meaning it wasn’t designed for.
A practical Python graph skeleton
This is a small, runnable base that’s easy to extend:
from future import annotations
from dataclasses import dataclass
from typing import Dict, List, Tuple
@dataclass
class Graph:
# neighbors[u] = list of (v, weight)
neighbors: Dict[int, List[Tuple[int, int]]]
directed: bool = False
def add_edge(self, u: int, v: int, w: int = 1) -> None:
self.neighbors.setdefault(u, []).append((v, w))
self.neighbors.setdefault(v, []) # ensure vertex exists
if not self.directed:
self.neighbors[v].append((u, w))
if name == "main":
g = Graph(neighbors={}, directed=False)
g.add_edge(0, 1)
g.add_edge(1, 2)
g.add_edge(0, 2)
print(g.neighbors)
In larger codebases, I’ll often keep the algorithm functions separate from the Graph type. That separation makes testing easier and avoids mixing “data model” with “algorithm utilities.”
I also like to keep one “truth” for directionality. If the graph can be directed or undirected, I pass directed=True/False once and I don’t let algorithms guess. Confusion about direction is another common production failure mode.
Traversals That Ship: BFS and DFS as Patterns
Breadth-first search (BFS) and depth-first search (DFS) aren’t just academic. They’re reusable patterns.
BFS: shortest steps in unweighted graphs
If every edge has the same cost (or you only care about number of hops), BFS gives you shortest path length. I use BFS for:
- Minimum hops between services
- Shortest number of moves in grid puzzles
- “Spread” simulations (multi-source BFS)
- Checking connectivity quickly
A solid BFS template with parent reconstruction:
from collections import deque
from typing import Dict, List, Optional
def bfsshortestpath(neighbors: Dict[int, List[int]], start: int, target: int) -> Optional[List[int]]:
"""Returns the shortest path (by hop count) from start to target, or None."""
q = deque([start])
parent: Dict[int, Optional[int]] = {start: None}
while q:
u = q.popleft()
if u == target:
break
for v in neighbors.get(u, []):
if v in parent:
continue
parent[v] = u
q.append(v)
if target not in parent:
return None
# Reconstruct path backwards
path: List[int] = []
cur: Optional[int] = target
while cur is not None:
path.append(cur)
cur = parent[cur]
path.reverse()
return path
if name == "main":
neighbors = {
0: [1, 2],
1: [0, 2, 3],
2: [0, 1, 3],
3: [1, 2, 4],
4: [3],
}
print(bfsshortestpath(neighbors, 0, 4))
Common mistake: people mark “visited” when popping from the queue. Mark it when pushing, otherwise you can enqueue the same node multiple times and waste memory.
A second mistake I see in production code: BFS without a clear termination condition. If you only need “can I reach target?” don’t compute distances to everything. Stop early when you pop the target (or even when you first discover it, depending on what you need).
Multi-source BFS: “spreading” problems
If multiple sources start spreading at time 0 (think infection, signal propagation, cache warmup propagation, or “rotting” grid problems), you enqueue all sources first with distance 0. The queue naturally processes time layers.
Practical rule: multi-source BFS is the right tool when:
- Every step costs 1
- You want the earliest arrival time to each node
- There are multiple starting points
In system terms, I use multi-source BFS to answer questions like:
- If these services are already degraded, what’s the minimum hop distance to every other service in the call graph?
- If these warehouses are currently stocked, what’s the minimum number of legs to deliver to each region?
You can return a full distance map, or just compute “time to reach all nodes.” If some nodes remain unreachable, that’s a signal in itself.
DFS: exploring structure, components, and ordering
DFS is my go-to when I need:
- Connected components in undirected graphs
- Topological ordering in DAGs (directed acyclic graphs)
- Cycle detection via recursion stack / colors
- Flood fill in images (grid as graph)
In 2026, I’m careful with recursive DFS in Python and JavaScript because recursion depth can crash. For large graphs, I prefer iterative DFS with an explicit stack.
Iterative DFS for connectivity:
from typing import Dict, List, Set
def dfs_component(neighbors: Dict[int, List[int]], start: int) -> Set[int]:
stack = [start]
seen: Set[int] = set([start])
while stack:
u = stack.pop()
for v in neighbors.get(u, []):
if v in seen:
continue
seen.add(v)
stack.append(v)
return seen
if name == "main":
neighbors = {
10: [11],
11: [10, 12],
12: [11],
20: [21],
21: [20],
}
print(dfs_component(neighbors, 10))
A detail that matters when you do more than connectivity: sometimes you want “enter/exit” events (preorder/postorder). For that, iterative DFS uses a stack of (node, state) where state indicates whether you’re entering or exiting. That pattern becomes useful for topological sorting, SCC algorithms, and cycle detection without recursion.
Bipartite checking (a DFS/BFS “coloring” staple)
If you can color nodes with two colors so that every edge connects different colors, the graph is bipartite. This shows up in scheduling, “two-team” constraints, matching problems, and detecting odd cycles.
I usually implement it with BFS coloring because it’s iterative and clean, and because it naturally handles disconnected graphs if you loop over all nodes.
What I keep in mind conceptually:
- If a graph contains an odd-length cycle, it cannot be bipartite.
- If it’s bipartite, you can interpret the two colors as two “sides.”
- In real systems, bipartite checks often represent “mutual exclusivity” constraints.
A practical bipartite check for possibly disconnected graphs:
from collections import deque
from typing import Dict, List, Optional, Tuple
def is_bipartite(neighbors: Dict[int, List[int]]) -> Tuple[bool, Optional[Dict[int, int]]]:
"""Returns (ok, color_map). Colors are 0/1. If not bipartite, returns (False, None)."""
color: Dict[int, int] = {}
for start in neighbors.keys():
if start in color:
continue
color[start] = 0
q = deque([start])
while q:
u = q.popleft()
for v in neighbors.get(u, []):
if v not in color:
color[v] = 1 – color[u]
q.append(v)
elif color[v] == color[u]:
return False, None
return True, color
if name == "main":
g1 = {0: [1], 1: [0, 2], 2: [1]}
g2 = {0: [1, 2], 1: [0, 2], 2: [0, 1]} # triangle (odd cycle)
print(is_bipartite(g1))
print(is_bipartite(g2))
In product terms, if g2 is “people who can’t be scheduled together,” the triangle means there’s no way to split them into two teams without conflict. The algorithm doesn’t just say “no”—it tells you there’s an odd cycle somewhere. If you want to be extra helpful, you can extend it to return a witness cycle for debugging.
Cycles, Safety Checks, and Disjoint Sets
Cycles are not “bad” in general (road maps have cycles everywhere), but they’re critical in two places:
- Directed dependency graphs (a cycle often means deadlock or impossible ordering)
- Shortest path with negative weights (negative cycles make “shortest” undefined)
Cycle detection in directed graphs (color method)
I use 3 states:
- 0 = unvisited
- 1 = visiting (in current recursion/stack)
- 2 = done
If you see an edge to a “visiting” node, you found a cycle.
In production, I care about two extra details:
- I want the actual cycle (not just a boolean) to print in logs.
- I want an iterative version when graphs can get deep.
Here’s an iterative “color + stack” cycle detection that also captures a cycle path. It’s a little more code, but it’s the sort of thing I actually ship:
from typing import Dict, List, Optional, Tuple
def finddirectedcycle(neighbors: Dict[int, List[int]]) -> Optional[List[int]]:
"""Returns a list of nodes forming a directed cycle, or None if acyclic."""
UNVISITED, VISITING, DONE = 0, 1, 2
state: Dict[int, int] = {u: UNVISITED for u in neighbors}
parent: Dict[int, Optional[int]] = {u: None for u in neighbors}
for start in neighbors:
if state[start] != UNVISITED:
continue
# stack holds (node, nextneighborindex)
stack: List[Tuple[int, int]] = [(start, 0)]
state[start] = VISITING
while stack:
u, i = stack[-1]
nbrs = neighbors.get(u, [])
if i >= len(nbrs):
state[u] = DONE
stack.pop()
continue
v = nbrs[i]
stack[-1] = (u, i + 1)
if state.get(v, UNVISITED) == UNVISITED:
state[v] = VISITING
parent[v] = u
stack.append((v, 0))
elif state.get(v) == VISITING:
# reconstruct cycle: u -> … -> v and then back to v
cycle = [v]
cur = u
while cur is not None and cur != v:
cycle.append(cur)
cur = parent[cur]
cycle.append(v)
cycle.reverse()
return cycle
return None
This is the kind of thing I’ll run in CI for dependency graphs: “fail the build if we introduced a dependency cycle, and print the cycle.”
Cycle detection in undirected graphs
In undirected graphs, the parent edge always points back. The trick is: if you reach a visited neighbor that is not your parent, that’s a cycle.
Common pitfall: using “visited” without tracking parent. You’ll falsely report cycles everywhere because every undirected edge appears twice.
Disjoint Set Union (Union-Find)
When I need fast “are these connected?” checks while adding edges over time, I use DSU.
It’s the backbone of Kruskal’s minimum spanning tree and also shows up in connectivity problems like “how many components remain after merges?”
Key ideas:
find(x)returns the representative (root)union(a, b)merges sets- Path compression + union by rank keeps operations near-constant in practice
If you’re building features that repeatedly merge groups (accounts, clusters, network segments), DSU is often simpler and faster than rerunning DFS each time.
One practical warning: DSU handles connectivity under edge additions, not arbitrary deletions. If your graph changes by removing edges (e.g., a service disables a downstream dependency), DSU won’t help unless you rebuild or use a more advanced dynamic connectivity structure.
Topological Sort: When Ordering Is the Product
Whenever someone says “these steps have prerequisites,” I immediately think “directed graph + topological sort.”
Examples that show up outside interviews:
- Build systems (compile A before B)
- Data pipelines (ingest → transform → aggregate)
- Deployment workflows (migrate DB before flipping traffic)
- Feature flag rollouts (enable dependency flags first)
Topological sorting gives you an ordering of nodes such that every directed edge u → v means u appears before v.
There are two common approaches:
- Kahn’s algorithm (BFS-style using indegrees)
- DFS postorder (push onto list when done)
I prefer Kahn’s algorithm in production because:
- It’s iterative.
- It naturally detects cycles (if you can’t process all nodes, there’s a cycle).
- It works nicely when you want to process nodes “as they become ready.”
A Kahn’s algorithm template:
from collections import deque
from typing import Dict, List, Optional
def topo_sort(neighbors: Dict[int, List[int]]) -> Optional[List[int]]:
"""Returns a topological ordering, or None if a cycle exists."""
indeg: Dict[int, int] = {u: 0 for u in neighbors}
for u, vs in neighbors.items():
for v in vs:
indeg[v] = indeg.get(v, 0) + 1
q = deque([u for u, d in indeg.items() if d == 0])
order: List[int] = []
while q:
u = q.popleft()
order.append(u)
for v in neighbors.get(u, []):
indeg[v] -= 1
if indeg[v] == 0:
q.append(v)
if len(order) != len(indeg):
return None # cycle detected
return order
The useful production add-on is better diagnostics: if there’s a cycle, list nodes still having indegree > 0. Those are “stuck” behind the cycle. That’s often enough to point an engineer to the bad dependency.
Strongly Connected Components: Finding “Mutual Dependencies”
In directed graphs, a strongly connected component (SCC) is a set of nodes where every node can reach every other node.
Why I care:
- In service graphs, SCCs often mean mutual dependencies. That’s a reliability smell.
- In build graphs, SCCs are cycles “collapsed into blobs.”
- In workflow graphs, SCCs can represent steps that must be treated as one unit (or refactored).
I think of SCC detection as a “cycle detector with structure.” Instead of saying “there is a cycle,” it tells you the exact groups that form mutually reachable sets.
Two standard algorithms:
- Kosaraju (two passes + reverse graph)
- Tarjan (one pass with a stack)
Kosaraju is simpler to implement and explain. You do:
1) DFS on the original graph to compute finishing order
2) DFS on the reversed graph in decreasing finishing order to extract components
A compact Kosaraju implementation:
from typing import Dict, List
def reverse_graph(neighbors: Dict[int, List[int]]) -> Dict[int, List[int]]:
rev: Dict[int, List[int]] = {u: [] for u in neighbors}
for u, vs in neighbors.items():
for v in vs:
rev.setdefault(v, []).append(u)
return rev
def kosaraju_scc(neighbors: Dict[int, List[int]]) -> List[List[int]]:
rev = reverse_graph(neighbors)
seen = set()
order: List[int] = []
def dfs1(start: int) -> None:
stack = [(start, 0)]
seen.add(start)
while stack:
u, i = stack[-1]
vs = neighbors.get(u, [])
if i >= len(vs):
order.append(u)
stack.pop()
continue
v = vs[i]
stack[-1] = (u, i + 1)
if v not in seen:
seen.add(v)
stack.append((v, 0))
for u in list(neighbors.keys()):
if u not in seen:
dfs1(u)
comps: List[List[int]] = []
seen2 = set()
def dfs2(start: int) -> List[int]:
comp: List[int] = []
stack = [start]
seen2.add(start)
while stack:
u = stack.pop()
comp.append(u)
for v in rev.get(u, []):
if v not in seen2:
seen2.add(v)
stack.append(v)
return comp
for u in reversed(order):
if u not in seen2:
comps.append(dfs2(u))
return comps
In practice, I’ll run SCC detection and then treat any SCC of size > 1 as “this is a cycle cluster.” Then I decide whether it’s acceptable (rare) or it needs refactoring.
Bridges and Articulation Points: “Single Points of Failure” in Graph Form
Some of the most useful graph algorithms aren’t about shortest paths—they’re about resilience.
- A bridge is an edge whose removal increases the number of connected components.
- An articulation point (cut vertex) is a node whose removal disconnects the graph.
In infrastructure terms:
- A bridge can be a single network link that, if severed, partitions your network.
- An articulation point can be a single service or region that everything routes through.
When I’m doing reliability analysis, I’ll often:
1) Build an undirected graph representation of “connectivity.”
2) Find articulation points / bridges.
3) Ask: are these acceptable chokepoints?
The classic approach uses DFS timestamps (Tarjan-style). It’s a bit heavier than BFS/DFS basics, but it pays off when you’re modeling fault domains.
Even if you don’t implement the full algorithm, just adopting the mindset helps: “What edges/nodes are load-bearing?” You can approximate it by simulating removals on small graphs, or by analyzing degrees and betweenness-like signals.
Shortest Paths: Picking the Right Tool on Purpose
“Shortest path” is a family of problems. The correct algorithm depends on edge weights.
A selection rule I actually follow
- Unweighted edges: BFS
- Weights are 0 or 1: 0–1 BFS (deque)
- Non-negative weights: Dijkstra
- Negative weights allowed (no negative cycles): Bellman–Ford
- Many queries on a small dense graph: Floyd–Warshall
- DAG with weights: topological order + DP
If you pick Dijkstra when negative weights exist, you can get wrong answers without obvious errors.
Dijkstra (non-negative weights)
Dijkstra is my default for “distance/cost” problems: routing, time estimates, resource cost, and any graph where weights represent something that never goes below zero.
Runnable Dijkstra with a heap:
import heapq
from typing import Dict, List, Tuple
def dijkstra(neighbors: Dict[int, List[Tuple[int, int]]], start: int) -> Dict[int, int]:
"""Returns min distance from start to every reachable node. Requires non-negative weights."""
INF = 1018
dist: Dict[int, int] = {start: 0}
pq: List[Tuple[int, int]] = [(0, start)] # (distance, node)
while pq:
d, u = heapq.heappop(pq)
if d != dist.get(u, INF):
continue # stale heap entry
for v, w in neighbors.get(u, []):
if w < 0:
raise ValueError("Negative edge weight found; Dijkstra is unsafe")
nd = d + w
if nd < dist.get(v, INF):
dist[v] = nd
heapq.heappush(pq, (nd, v))
return dist
if name == "main":
neighbors = {
0: [(1, 5), (2, 2)],
1: [(3, 1)],
2: [(1, 1), (3, 7)],
3: [],
}
print(dijkstra(neighbors, 0))
Two production-grade details many snippets skip:
- Stale heap entries are normal; check
d != dist[u]and skip - Guard against negative weights explicitly if your inputs are not fully trusted
If you need the actual path, not just distances, store parent[v] = u when relaxing, similar to the BFS pattern.
0–1 BFS (weights are only 0 or 1)
This algorithm feels like a trick until you need it. If costs are only 0 or 1 (free vs paid, local vs remote, no wall vs wall-break), you can do shortest paths with a deque and get performance similar to BFS.
The mental model: edges of cost 0 go to the front of the deque; edges of cost 1 go to the back.
A runnable 0–1 BFS:
from collections import deque
from typing import Dict, List, Tuple
def zeroonebfs(neighbors: Dict[int, List[Tuple[int, int]]], start: int) -> Dict[int, int]:
"""Shortest paths where weights are 0 or 1."""
INF = 1018
dist: Dict[int, int] = {start: 0}
dq = deque([start])
while dq:
u = dq.popleft()
du = dist[u]
for v, w in neighbors.get(u, []):
if w not in (0, 1):
raise ValueError("0-1 BFS requires weights 0 or 1")
nd = du + w
if nd < dist.get(v, INF):
dist[v] = nd
if w == 0:
dq.appendleft(v)
else:
dq.append(v)
return dist
I’ve used this in routing-like problems where “cross-region call” costs 1 and “same-region call” costs 0, and we want a low-crossing path first.
Bellman–Ford (negative edges, and detecting negative cycles)
If negative edges exist, Bellman–Ford is the safe baseline. It’s slower (O(VE)), but it gives you a way to detect negative cycles: after \(V-1\) relaxations, any further improvement means a negative cycle is reachable.
I use it when correctness matters more than speed and the graph is modest in size.
I also use it as a validation tool. If someone insists negative edges can’t exist, I’ll still run a check in staging to prove it.
A Bellman–Ford template that detects reachable negative cycles:
from typing import Dict, List, Tuple
def bellman_ford(nodes: List[int], edges: List[Tuple[int, int, int]], start: int):
INF = 1018
dist: Dict[int, int] = {u: INF for u in nodes}
dist[start] = 0
# Relax edges V-1 times
for _ in range(len(nodes) – 1):
changed = False
for u, v, w in edges:
if dist[u] == INF:
continue
nd = dist[u] + w
if nd < dist[v]:
dist[v] = nd
changed = True
if not changed:
break
# One more pass to detect negative cycles reachable from start
for u, v, w in edges:
if dist[u] == INF:
continue
if dist[u] + w < dist[v]:
return None # indicates negative cycle reachable
return dist
In many systems, “negative cost” is a sign of a modeling bug. But in finance-like graphs or constraint graphs, negative weights can be legitimate. Either way, you want the algorithmic guardrail.
Floyd–Warshall (all pairs shortest paths)
For small graphs (hundreds of nodes), Floyd–Warshall is straightforward and often faster to implement than building complicated multi-query structures. It’s O(V^3), so it’s not for large V.
Where it shines in practice:
- You have a small fixed graph and lots of queries.
- You’re building a UI that needs fast “distance between any two nodes.”
- You want a simple reference implementation to test against.
DAG shortest paths
If the graph is a DAG, you can compute shortest paths in O(V + E) using topological ordering.
This shows up more than people expect because many “workflow” graphs are intentionally acyclic. If your DAG has weights (durations, costs), you can compute:
- Shortest path (minimum total cost)
- Longest path (critical path) if you flip the comparison (and handle weights appropriately)
In deployment pipelines, “critical path” is often more important than “shortest path”: it tells you the minimum time the whole workflow can complete, assuming parallelism.
A* (when you have a good heuristic)
I don’t reach for A* in most backend work, but it’s worth knowing when it’s the right tool: pathfinding on maps or grids where you have a heuristic like Manhattan distance.
My production take: A is only as good as the heuristic and the implementation details. If you’re in a non-spatial domain (like service dependencies), you usually don’t have a meaningful heuristic, so A becomes Dijkstra with extra complexity.
Minimum Spanning Trees: Connecting Everything as Cheaply as Possible
A minimum spanning tree (MST) is about connecting all vertices in an undirected weighted graph with minimum total cost, without cycles.
Real uses:
- Laying cable between facilities
- Building a low-cost backbone network
- Connecting clusters with minimal total link cost
Two classic algorithms:
- Prim: grows one tree outward (good with adjacency lists and heaps)
- Kruskal: sorts edges and unions sets (great when edges are already available as a list)
Kruskal with DSU (runnable)
from dataclasses import dataclass
from typing import List, Tuple
@dataclass
class DSU:
parent: List[int]
rank: List[int]
@classmethod
def create(cls, n: int) -> "DSU":
return cls(parent=list(range(n)), rank=[0] * n)
def find(self, x: int) -> int:
while self.parent[x] != x:
self.parent[x] = self.parent[self.parent[x]] # path compression
x = self.parent[x]
return x
def union(self, a: int, b: int) -> bool:
ra, rb = self.find(a), self.find(b)
if ra == rb:
return False
if self.rank[ra] < self.rank[rb]:
ra, rb = rb, ra
self.parent[rb] = ra
if self.rank[ra] == self.rank[rb]:
self.rank[ra] += 1
return True
def kruskal_mst(n: int, edges: List[Tuple[int, int, int]]) -> Tuple[int, List[Tuple[int, int, int]]]:
"""Returns (totalcost, mstedges). Graph assumed undirected."""
edges_sorted = sorted(edges, key=lambda e: e[2])
dsu = DSU.create(n)
total = 0
mst: List[Tuple[int, int, int]] = []
for u, v, w in edges_sorted:
if dsu.union(u, v):
mst.append((u, v, w))
total += w
if len(mst) == n – 1:
break
if len(mst) != n – 1:
raise ValueError("Graph is disconnected; MST does not span all vertices")
return total, mst
if name == "main":
n = 5
edges = [
(0, 1, 4),
(0, 2, 1),
(2, 1, 2),
(1, 3, 1),
(2, 3, 5),
(3, 4, 3),
]
print(kruskal_mst(n, edges))
When I recommend Prim over Kruskal:
- You already have an adjacency list and don’t want to materialize every edge twice
- The graph is large and sparse and you want to grow from a start node
When I recommend Kruskal:
- You already have the edges as records (common in data pipelines)
- You need a clear “picked edges” list and DSU fits nicely
Prim’s algorithm (heap-based, practical)
Prim grows a tree from an arbitrary start node by repeatedly adding the cheapest edge that connects the current tree to a new node.
A heap-based Prim template:
import heapq
from typing import Dict, List, Tuple
def prim_mst(neighbors: Dict[int, List[Tuple[int, int]]], start: int = 0):
"""Returns (totalcost, mstedges). Graph assumed undirected and connected."""
seen = set([start])
pq: List[Tuple[int, int, int]] = [] # (w, u, v)
for v, w in neighbors.get(start, []):
heapq.heappush(pq, (w, start, v))
total = 0
mst: List[Tuple[int, int, int]] = []
while pq:
w, u, v = heapq.heappop(pq)
if v in seen:
continue
seen.add(v)
mst.append((u, v, w))
total += w
for nxt, nw in neighbors.get(v, []):
if nxt not in seen:
heapq.heappush(pq, (nw, v, nxt))
return total, mst
If the graph might be disconnected, I don’t call it “MST” anymore—I call it a “minimum spanning forest” and run Prim/Kruskal per component.
A 2026 Playbook: Mistakes I Still See (and How I Avoid Them)
These are the recurring mistakes that show up in interviews, code reviews, and incident retrospectives.
1) Picking an algorithm before checking assumptions
I see “Dijkstra” thrown at any “shortest path” question, even when:
- edges can be negative
- weights are not additive
- the objective is not a sum (minimax, lexicographic, constraints)
My fix: I write down the contract first.
- Are weights non-negative?
- Are weights integers?
- Is the graph directed?
- Do I need exact paths or just distances?
- One query or many queries?
2) Mixing directed/undirected semantics
This is a quiet killer. Someone builds an adjacency list for a directed graph but adds edges both ways “because it’s easier,” and suddenly cycle detection and topo sort make no sense.
My fix: make directionality explicit, and in tests, include a tiny directed graph where reversing edges changes reachability.
3) Forgetting disconnected graphs exist
In real data, disconnectedness is normal.
- A new service has no dependencies yet.
- A customer’s data is isolated.
- A region is temporarily offline.
My fix: treat “unreachable” as a first-class outcome. BFS returning None is not an exception; it’s information.
4) Recursion depth failures
Recursive DFS is elegant until it hits a deep chain and your process crashes.
My fix: default to iterative DFS in languages with small recursion limits, or wrap recursion in a controlled script with safe limits (and still prefer iterative for services).
5) Using visited incorrectly
Common variants:
- Marking visited too late in BFS (blows up queue)
- In undirected cycle detection, not tracking parent (false cycles)
- Reusing a
seenset between multiple traversals (leaks state)
My fix: for each algorithm I keep a known-good template and I rarely “freestyle” the visitation logic.
6) Not returning evidence (paths, cycles, components)
A boolean answer is rarely enough in production.
- If there’s a cycle, I want the cycle.
- If there’s no path, I want the boundary: what was reachable?
- If there’s an SCC, I want the component nodes.
My fix: store parents (or predecessors) and return witness structures whenever possible.
7) Not budgeting time/memory
Graph algorithms can be deceptively expensive.
- BFS/DFS are O(V + E) but can still be huge if you materialize neighbors for millions of nodes.
- Bellman–Ford is O(VE) and will melt if you throw it at a big graph.
- Floyd–Warshall is O(V^3) and is only for small graphs.
My fix: I set explicit budgets (node/edge limits, timeouts), and for large systems I lean on:
- restricting the query to a subgraph
- caching expensive results
- precomputing summaries (like SCCs or reachability in a DAG)
Performance and Scaling: The Stuff That Matters After the Whiteboard
Most of the time, the big-O story is correct but incomplete. Practical performance comes down to memory layout, constant factors, and how much of the graph you actually touch.
Sparse vs dense in the real world
Most application graphs are sparse, but “sparse” can still be huge:
- 50 million nodes with average degree 3 is still 150 million directed edges.
At that point:
- Python dict-of-lists might be too heavy.
- You may need a compact representation (arrays, CSR, on-disk adjacency).
- You may need to process the graph in batches.
Don’t load the whole graph if you only need a neighborhood
A ton of production graph queries are localized:
- “What are the dependencies within 2 hops of this service?”
- “What devices are connected to this user within 3 edges?”
That’s BFS with a depth limit, and it saves you from scanning the universe.
Depth-limited BFS is a pattern I use constantly:
- Keep
(node, depth)in the queue. - Stop expanding when depth == limit.
Immutability and snapshots
If you run algorithms while edges are changing (services being deployed, relationships being updated), your results can become inconsistent.
My fix:
- For online systems, I run algorithms on a snapshot (versioned graph, read transaction, or a frozen adjacency export).
- If “eventual consistency” is okay, I make that explicit in the API contract.
Parallelism: it helps less than you think (until it helps a lot)
Some graph problems parallelize nicely (processing components independently, running multiple BFS queries). Others don’t (Dijkstra is hard to parallelize efficiently in general).
My practical approach:
- Parallelize across queries, not within one query, unless you really know the workload.
- Cache and reuse results: if you constantly ask shortest paths from the same source, store the distance map.
Testing and Debugging Graph Code
Graph code can be correct on the happy path and still fail on edge cases. I keep a small “graph test kit” mindset.
The tiny cases I always test
- Empty graph
- Single node, no edges
- Self-loop (u → u)
- Two nodes with one edge
- Disconnected graph with two components
- Triangle (odd cycle) for bipartite checks
- Directed cycle of length 2 or 3 for topo sort failure
Cross-check with a slower reference
When I’m not 100% confident, I’ll keep a slow but simple implementation around for small graphs and cross-check outputs. For example:
- For shortest paths, run BFS (unweighted) or Floyd–Warshall (small graphs) to validate Dijkstra results on random tiny graphs.
- For MST, validate the chosen edges are acyclic and connect all nodes, and compare total cost against a brute-force approach for n <= 8.
This is the kind of testing that catches “one-line” bugs early.
Logging that helps (not noise)
When graph logic is involved in an incident, I want logs that answer:
- What was the start node?
- How many nodes/edges were in the traversed subgraph?
- Did we hit a budget/time limit?
- If we found a cycle/path, what is the witness sequence?
I’m a big fan of returning structured debug info behind a feature flag, especially for internal tooling.
Production Patterns: Graph Thinking That Actually Ships
Algorithms are great, but systems are messy. Here are the patterns I keep coming back to.
Blast radius analysis
In incident response, I’ll model service calls as a directed graph and ask:
- From the failing node, what downstream nodes are reachable within k hops?
- What upstream nodes can reach the failing node?
That’s two BFS/DFS runs (forward and reverse graph), often with hop limits.
Dependency hygiene
For dependency graphs, I’ll run:
- cycle detection (reject or warn)
- SCC detection (report clusters)
- topo sort (provide ordering)
This becomes “graph linting.” It’s one of the highest ROI uses of graph algorithms in engineering orgs.
“Cheapest connection” vs “best redundancy”
MST is about cheapest connectivity, but MST can be fragile: it intentionally removes redundancy.
If you’re designing networks for resilience, you might need:
- k-edge-connected or k-vertex-connected considerations
- backup paths
- constraints like “must include at least one link per region”
I treat MST as a baseline, not a final answer, when the domain cares about fault tolerance.
Closing Thought
Once you start looking for graphs, you see them everywhere: dependencies, routes, state transitions, relationships, and failure propagation. The biggest shift for me wasn’t memorizing algorithms—it was learning to ask the right question:
- Is this reachability, ordering, connectivity, or optimization?
- What are the constraints (weights, direction, cycles, scale)?
- What evidence do I need to return to make the result actionable?
When those questions are clear, the algorithm choice stops being a guessing game. It becomes an engineering decision with a contract, a cost, and a failure mode you understand.


