-
-
Full-Stack Observability: Alert Bot notifies when Redis memory usage reaches high levels, ensure proactive backend bottleneck detection
-
Stress Test Alert: The monitor stack successfully captures a P95 latency spike to 1.76s during a high-concurrent user simulation
-
Chaos Engineering: Detects the Flask app is DOWN, and Docker Swarm automatically triggers the Self-Healing process to restore service
-
Grafana Dashboard
-
Metrics page
-
Json logs
-
-
-
-
-
-
-
-
Inspiration
In a world where digital services are the backbone of society, "it works on my machine" is no longer enough. We were inspired by the concept of Anti-fragility—the idea that a system shouldn't just withstand stress, but get better (or at least self-heal) because of it. We set out to build ProdBreaker, a fortress-like infrastructure where uptime is a guarantee, not a guess.
What it does
It is a web ecosystem that handles high-traffic data ingestion through a robust Flask backend, balances loads across multiple service replicas, and features a proactive "Observer" layer that detects failures and heals the system in real-time without human intervention.
How we built it
We built a multi-layered "Defense-in-Depth" architecture:
- Orchestration: Leveraged Docker Swarm to manage 3 healthy replicas of our web service.
- Traffic Management: Configured an Nginx Reverse Proxy to act as an intelligent load balancer.
- Data Guardrails: Engineered a fault-tolerant CSV ingestion pipeline to prevent malformed data from causing system-wide crashes.
- Observability Stack: Integrated Prometheus for metric scraping, Grafana for visualization, and Alertmanager for real-time Discord notifications.
Challenges we ran into
- The Integration Maze: Merging local development environments with a production-grade Swarm cluster led to complex networking and configuration conflicts that required disciplined version control.
- Failure Prediction: It’s easy to monitor "if" something is down; it’s hard to define "why." We had to conduct a deep Failure Mode and Effects Analysis (FMEA) to map out recovery paths for edge cases like database connection timeouts and memory leaks.
Accomplishments that we're proud of
- Gold Tier Achievement: Successfully met the rigorous requirements for the Scalability and Incident Response quests.
- Validated Self-Healing: We successfully simulated "Chaos Mode" where the system detected its own death and resurrected itself while notifying the team via Discord.
- Robustness: Our backend survived intentionally corrupted data inputs that would have crashed a standard implementation.
What we learned
We learned that Reliability is a culture, not just a config file. We realized that writing a Failure Manual and documenting recovery procedures is just as critical to a project's success as the code itself. High availability is only possible when you "design for failure."
What's next for Prod Breaker
We plan to implement Kubernetes for more granular auto-scaling and explore Canary Deployments to ensure that even our updates don't compromise the rock-solid stability we've built.
Built With
- alertmanager
- discordwebhooks
- docker
- flask
- grafana
- gunicorn
- javascript
- k6
- nginx
- peewee
- postgresql
- prometheus
- pytest
- python
- redis7
- uv
Log in or sign up for Devpost to join the conversation.