System Design for Bunnies: The Case of the Missing Messenger
Why your code crashes when the "Other Guy" stops working... and how to build a burrow that survives.
Welcome to 2026!
If your New Year’s resolution was to “write better code,”
I have bad news... even perfect code breaks.
In the world of “Scale”, where we handle billions of events and traffic spikes that look like vertical walls, the biggest threat to your system usually isn’t your code. It’s the code you depend on.
Today, we’re going to talk about Resilience Engineering. But because we keep things simple here, we’re going to talk about the FastPaws Carrot Delivery Service.
The Happy Path: A Bunny with a Plan
Imagine a simple feature. A customer orders a carrot. We deliver the carrot. Then, we send them an SMS saying, “Enjoy your crunch!”
In the code, it looks innocent enough:
async function deliverCarrot(order) {
await database.save(order);
await inventory.decrement();
// The line that will ruin your life
await messengerBunny.sendSMS("Your carrot is here!");
return "Success!";
}On a normal Tuesday, this works perfectly. The Messenger Bunny runs to the 3rd party SMS Burrow, drops off the message, and comes back (in ~50 milliseconds).
But today isn’t Tuesday. Today is the Great Carrot Festival (think Christmas Sale). Traffic is up 100x. And suddenly, the SMS Burrow goes silent.
The Crisis: The Infinite Wait
When the SMS provider (let’s call them BunnySMS ) has an outage, they don’t always reject your connection immediately. Sometimes, they just... hang.
In our code, that await messengerBunny.sendSMS() line is a trap.
Your Delivery Bunny runs to the SMS office and knocks on the door.
No answer. So, the bunny waits... and waits... and waits!!
Meanwhile, a new customer orders a carrot. A new Delivery Bunny is dispatched. They also run to the SMS office, knock, and wait... and wait!!
Within minutes, you have 5,000 Delivery Bunnies standing outside the SMS office, doing absolutely nothing. You see where this is going?
The Engineering Reality: This is called Resource Starvation.
Your database is fine. Your servers are healthy. But your application has crashed because every single available bunny (or thread) is stuck waiting for a 3rd party service that isn’t responding.
Solution 1: The Stopwatch (Timeouts)
The first rule of distributed systems: Never trust the network.
We need to give every Delivery Bunny a stopwatch. We tell them: “If the SMS office doesn’t open the door in 2 seconds, walk away.”
await messengerBunny.sendSMS(message, {
timeout: 2000
}); // 2 secondsThis prevents the pile-up. The bunnies wait 2 seconds, fail, and free themselves up to handle the next order.
In plain language: Most HTTP clients have infinite timeouts by default. Always set a limit!
Advanced tip: Your timeout should be shorter than your SLA. If your API promises a response in 500ms, your internal timeout cannot be 2 seconds.
Solution 2: The Trap of Retries
So the bunny walks away. But the message wasn’t sent! A diligent bunny might think: “I should try again immediately.”
Do not do this.
If 10,000 bunnies fail and immediately retry, they will hit the SMS office again at the exact same millisecond. This is called a Thundering Herd. If the SMS office was struggling before, you just murdered it.
The Fix: Exponential Backoff with Jitter
Instead of knocking immediately, the bunny waits.
Attempt 1: Wait 1 second.
Attempt 2: Wait 2 seconds.
Attempt 3: Wait 4 seconds.
And we add Jitter (randomness).
i.e. Bunny A waits 1.1 seconds; Bunny B waits 0.9 seconds.
This spreads the load.
I wrote an article on queues earlier. (This is also a great time for me to remind you that failed messages shouldn’t be lost, but pushed to a Dead Letter Queue!)
Solution 3: The Circuit Breaker (The Guard Bunny)
If the SMS office is on fire, why are we still sending bunnies there to check? It’s a waste of time.
Enter the Circuit Breaker pattern. Imagine we hire a Guard Bunny to stand at the gate.
Closed State (Normal): The Guard lets bunnies through.
Open State (Broken): The Guard notices that 5 bunnies came back crying in the last minute. He locks the gate.
Now, when you call
sendSMS(), the Guard immediately says “No. It’s down.”No waiting. No timeouts. Instant failure. Your system speed returns to normal.
Half-Open State (Testing): After 5 minutes, the Guard lets one brave bunny through.
If they succeed? The gate re-opens.
If they fail? The gate stays locked.
Solution 4: Fallbacks (The Plan B)
So the Circuit Breaker is open. We can’t send SMS. Do we crash the order?
No.
You have to ask yourself: What is the core value?
The core value is The Carrot. The SMS is just a nice-to-have side effect.
Graceful Degradation:
If the SMS fails, we catch the error and:
Log it to a file to retry tonight.
Send an email instead (using a different system).
Or simply do nothing.
The user gets their carrot.
They might not get the text, but they aren’t hungry.
That is a win.
Conclusion: Optimizing for Failure
When you are building for scale... whether it’s for a startup or for a billion cricket fans... resilience isn’t about preventing failure. Things will fail. Cables get cut, data centers lose power, and vendors push bad code.
To the Junior Engineer: Your job isn’t just to make it work on the “Happy Path.” It’s to make sure that when the path breaks, your users don’t fall off a cliff.
To the Senior Engineer: Optimizing for success is engineering; optimizing for failure is architecture.
Here’s to a resilient 2026. May your latencies be low and your circuit breakers always reset. While everyone’s approach can be different depending upon the scenario, System Design is never “Absolute”. It is a set of decisions you take basis the tools you have at the time. The tools keep evolving and you need to keep learning the new ones 🙂
Have you battled a rogue downstream dependency recently? Come share your war stories at the next Impromptu Meetup. I’ll bring the coffee; you bring the architecture diagrams.
You can also follow me on other social platforms: https://bio.link/frontend






