Prod Breaker

Full-Stack Observability: Alert Bot notifies when Redis memory usage reaches high levels, ensure proactive backend bottleneck detection
Stress Test Alert: The monitor stack successfully captures a P95 latency spike to 1.76s during a high-concurrent user simulation
Chaos Engineering: Detects the Flask app is DOWN, and Docker Swarm automatically triggers the Self-Healing process to restore service
Grafana Dashboard
Metrics page
Json logs

Inspiration

In a world where digital services are the backbone of society, "it works on my machine" is no longer enough. We were inspired by the concept of Anti-fragility—the idea that a system shouldn't just withstand stress, but get better (or at least self-heal) because of it. We set out to build ProdBreaker, a fortress-like infrastructure where uptime is a guarantee, not a guess.

What it does

It is a web ecosystem that handles high-traffic data ingestion through a robust Flask backend, balances loads across multiple service replicas, and features a proactive "Observer" layer that detects failures and heals the system in real-time without human intervention.

How we built it

We built a multi-layered "Defense-in-Depth" architecture:

Orchestration: Leveraged Docker Swarm to manage 3 healthy replicas of our web service.
Traffic Management: Configured an Nginx Reverse Proxy to act as an intelligent load balancer.
Data Guardrails: Engineered a fault-tolerant CSV ingestion pipeline to prevent malformed data from causing system-wide crashes.
Observability Stack: Integrated Prometheus for metric scraping, Grafana for visualization, and Alertmanager for real-time Discord notifications.

Challenges we ran into

The Integration Maze: Merging local development environments with a production-grade Swarm cluster led to complex networking and configuration conflicts that required disciplined version control.
Failure Prediction: It’s easy to monitor "if" something is down; it’s hard to define "why." We had to conduct a deep Failure Mode and Effects Analysis (FMEA) to map out recovery paths for edge cases like database connection timeouts and memory leaks.

Accomplishments that we're proud of

Gold Tier Achievement: Successfully met the rigorous requirements for the Scalability and Incident Response quests.
Validated Self-Healing: We successfully simulated "Chaos Mode" where the system detected its own death and resurrected itself while notifying the team via Discord.
Robustness: Our backend survived intentionally corrupted data inputs that would have crashed a standard implementation.

What we learned

We learned that Reliability is a culture, not just a config file. We realized that writing a Failure Manual and documenting recovery procedures is just as critical to a project's success as the code itself. High availability is only possible when you "design for failure."

What's next for Prod Breaker

We plan to implement Kubernetes for more granular auto-scaling and explore Canary Deployments to ensure that even our updates don't compromise the rock-solid stability we've built.

Built With

alertmanager
discordwebhooks
docker
flask
grafana
gunicorn
javascript
k6
nginx
peewee
postgresql
prometheus
pytest
python
redis7
uv

Submitted to

Production Engineering Hackathon

Created by

I took a fragile Flask application and turned it into a system that can handle real traffic and recover from failure. The original setup relied on a development server, opened a new database connection on every request, and handled everything in a synchronous way. Under moderate load, it broke down quickly. Error rates spiked, latency climbed into seconds, and parts of the system failed without recovering. It worked as a prototype, but not as something you could rely on.
The first step was fixing how the app handled concurrency. Instead of the single threaded development server, I introduced a production server that could handle many requests at once using threads. That choice mattered because the workload spent most of its time waiting on the database and cache. Threads made better use of memory and reduced overhead. At the same time, he added proper database connection pooling so requests could reuse existing connections instead of opening new ones each time. That alone removed a major failure point where the system would run out of network resources under load.
Caching was another key improvement, but it was done carefully. Rather than deleting cached data on every update, which becomes slow and unstable at scale, Bryson used a generation counter strategy. Each write shifts the system to a new version of the cache, and old data simply fades out over time. This avoids expensive operations and keeps memory use stable. It is a small design choice, but it prevents a whole class of performance and reliability issues that often show up later.
I also removed unnecessary work from the critical path. The system originally wrote analytics data during each request, which added delay and made performance depend on database speed. Bryson moved those writes into a background executor so the main request could return immediately. That change alone brought response times down sharply. Alongside this, he added the right database indexes to eliminate full table scans, which made data retrieval predictable even as the dataset grew.
To improve resilience, I placed a reverse proxy in front of the application and then moved the system into an orchestrated environment that actively maintains the desired number of running instances. If a container fails, a new one is started quickly, and traffic is routed only to healthy instances. The proxy can retry requests when a backend fails mid request, so users do not see errors during these transitions. Under load, the system continues to operate even as individual components fail and recover.
Finally, I added full observability so the system could be understood and trusted. Metrics track latency, traffic, errors, database usage, and cache behavior in real time. Alerts trigger when key thresholds are crossed, which makes issues visible before they become outages. After all these changes, the system reached a stable state with zero observed errors, sub second latency, and consistent throughput under the same load that previously caused it to fail. It is not just faster. It is predictable, measurable, and resilient.

Bryson Nyamwange
Jonathan Kai Seng Kho
Brian Li
Asutosh M

Updates

Brian Li started this project — Apr 05, 2026 12:41 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.