The Comfortable Lie: Why We Don’t Actually Learn From Our Mistakes
We love a good comeback story. The entrepreneur who failed three times before striking it rich. The developer who learnt from a catastrophic production incident and never made ‘that mistake’ again. We tell these stories because they’re comforting—they suggest that failure has a purpose, that our pain is an investment in wisdom.
But what if this narrative is mostly fiction? What if, in the contexts where we most desperately want to learn from our mistakes—complex, adaptive systems like software development—it’s not just difficult to learn from failure, but actually impossible in any meaningful way?
The Illusion of Causality
Consider a typical software development post-mortem. A service went down at 2 AM. After hours of investigation, the team identifies the culprit: an innocuous configuration change made three days earlier, combined with a gradual memory leak, triggered by an unusual traffic pattern, exacerbated by a caching strategy that seemed fine in testing. The conclusion? ‘We learnt that we need better monitoring for memory issues and more rigorous review of configuration changes.’
But did they really learn anything useful?
The problem is that this wasn’t a simple cause-and-effect situation. It was the intersection of dozens of factors, most of which were present for months or years without issue. The memory leak existed in production for six months. The caching strategy had been in place for two years. The configuration change was reviewed by three senior engineers. None of these factors alone caused the outage—it required their precise combination at that specific moment.
In complex adaptive systems, causality is not linear. There’s no single mistake to point to, no clear lesson to extract. The system is a web of interacting components where small changes can have outsized effects, where the same action can produce wildly different outcomes depending on context, and where the context itself is always shifting.
The Context Problem
Here’s what makes this especially insidious: even if we could perfectly understand what went wrong, that understanding is locked to a specific moment in time. Software systems don’t stand still. By the time we’ve finished our post-mortem, the team composition has changed, two dependencies have been updated, traffic patterns have evolved, and three new features have been deployed. The system we’re analysing no longer exists.
This is why the most confident proclamations—’We’ll never let this happen again’—are often followed by remarkably similar failures. Not because teams are incompetent or negligent, but because they’re trying to apply lessons from System A to System B, when System B only superficially resembles its predecessor. The lesson learnt was ‘don’t deploy configuration changes on Fridays without additional review’, but the next incident happens on a Tuesday with a code change that went through extensive review. Was the lesson wrong? Or was it just irrelevant to the new context?
The Narrative Fallacy
Humans are storytelling machines. When something goes wrong, we instinctively construct a narrative that makes sense of the chaos. We identify villains (the junior developer who made the change), heroes (the senior engineer who diagnosed the issue), and a moral (the importance of code review). These narratives feel true because they’re coherent.
But coherence is not the same as accuracy. In the aftermath of failure, we suffer from hindsight bias—knowing the outcome, we see a clear path from cause to effect that was never actually clear at the time. We say ‘the warning signs were there’ when in reality those same ‘warning signs’ are present all the time without incident. We construct a story that couldn’t have been written before the fact.
This is why war stories in software development are simultaneously compelling and useless. The grizzled veteran who regales you with tales of production disasters is imparting wisdom that feels profound but often amounts to ‘this specific thing went wrong in this specific way in this specific system at this specific time’. And the specifics are rarely defined. The lesson learnt is over-fitted to a single data point.
Emergence and Irreducibility
Complex adaptive systems exhibit emergence—behaviour that arises from the interaction of components but cannot be predicted by analysing those components in isolation – c.f. Synergetics (Buckminster Fuller). Your microservices architecture might work perfectly in testing, under load simulation, and even in production for months. Then one day, a particular sequence of requests, combined with a specific distribution of data across shards, triggers a cascade failure that brings down the entire system.
You can’t ‘learn’ to prevent emergent failures because you can’t predict them. They arise from the system’s complexity itself. Adding more tests, more monitoring, more safeguards—these changes don’t eliminate emergence, they just add new components to the complex system, creating new possibilities for emergent behaviour.
The Adaptation Trap
Here’s the final twist: complex adaptive systems adapt. When you implement a lesson learnt, you’re not just fixing a problem—you’re changing the system. And when the system changes, the behaviours that emerge from it change too.
Add comprehensive monitoring after an outage? Now developers start relying on monitoring as a crutch, writing less defensive code because they know they’ll be alerted to issues. Implement mandatory code review after a bad deployment? Now developers become complacent, assuming that anything that passed review must be safe. The system adapts around your interventions, often in ways that undermine their original purpose.
This isn’t a failure of implementation—it’s a fundamental characteristic of complex adaptive systems. They don’t have stable equilibrium points. Every intervention shifts the system to a new state with its own unique vulnerabilities.
So What Do We Do?
If we can’t learn from our mistakes in any straightforward way, what’s the alternative? Are we doomed to repeat the same failures for ever?
Not quite. The solution is to stop pretending we can extract universal lessons from specific failures and instead focus on building systems that are resilient to the inevitable surprises we can’t predict.
This means designing for graceful degradation rather than preventing all failures. It means building systems that can absorb shocks and recover quickly rather than systems that need to be perfect. It means accepting that production is fundamentally different from any testing environment and that the only way to understand system behaviour is to observe it in production with real users and real data.
It also means being humble. Every post-mortem that ends with ‘we’ve identified the root cause and implemented fixes to prevent this from happening again’ is cosplaying certainty in a domain defined by uncertainty. A more honest conclusion might be: ‘This is what we think happened, given our limited ability to understand complex systems. We’re making some changes that might help, but we acknowledge that we’re also potentially introducing new failure modes we haven’t imagined yet.’
The Productivity of Failure
None of this means that failures are useless. Incidents do provide value—they reveal the system’s boundaries, expose hidden assumptions, and force us to confront our mental models. But the value isn’t in extracting a tidy lesson that we can apply next time. The value is in the ongoing process of engaging with complexity, building intuition through repeated exposure, and developing a mindset that expects surprise rather than seeking certainty.
The developer who has been through multiple production incidents isn’t valuable because they’ve learnt ‘lessons’ they can enumerate. They’re valuable because they’ve internalised a posture of humility, an expectation that systems will fail in ways they didn’t anticipate, and a comfort with operating in conditions of uncertainty.
That’s not the same as learning from mistakes. It’s something both more modest and more useful: developing wisdom about the limits of what we can learn.
The next time you hear someone confidently declare that they’ve learnt from a mistake, especially in a complex domain like software development, be sceptical. Not because they’re lying or incompetent, but because they’re human—and we all want to believe that our suffering has purchased something more substantial than just the experience of suffering. The truth is messier and less satisfying: in complex adaptive systems, the best we can hope for is not wisdom, but the wisdom to know how little wisdom we can extract from any single experience.
Further Reading
Allspaw, J. (2012). Fault injection in production: Making the case for resilience testing. Queue, 10(8), 30-35. https://doi.org/10.1145/2346916.2353017
Dekker, S. (2011). Drift into failure: From hunting broken components to understanding complex systems. Ashgate Publishing.
Dekker, S., & Pruchnicki, S. (2014). Drifting into failure: Theorising the dynamics of disaster incubation. Theoretical Issues in Ergonomics Science, 15(6), 534-544. https://doi.org/10.1080/1463922X.2013.856495
Fischhoff, B. (1975). Hindsight ≠ foresight: The effect of outcome knowledge on judgment under uncertainty. Journal of Experimental Psychology: Human Perception and Performance, 1(3), 288-299. https://doi.org/10.1037/0096-1523.1.3.288
Hollnagel, E., Woods, D. D., & Leveson, N. (Eds.). (2006). Resilience engineering: Concepts and precepts. Ashgate Publishing.
Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.
Kahneman, D., Slovic, P., & Tversky, A. (Eds.). (1982). Judgment under uncertainty: Heuristics and biases. Cambridge University Press.
Leveson, N. G. (2012). Engineering a safer world: Systems thinking applied to safety. MIT Press.
Perrow, C. (1999). Normal accidents: Living with high-risk technologies (Updated ed.). Princeton University Press. (Original work published 1984)
Roese, N. J., & Vohs, K. D. (2012). Hindsight bias. Perspectives on Psychological Science, 7(5), 411-426. https://doi.org/10.1177/1745691612454303
Woods, D. D., & Allspaw, J. (2020). Revealing the critical role of human performance in software. Queue, 18(2), 48-71. https://doi.org/10.1145/3406065.3394867

