Inspiration
In the real world, code that works on your laptop often collapses under production load. We wanted to build something that doesn't just function — it survives. The MLH Production Engineering Hackathon challenged us to think like SREs: what happens when 500 users hit at once? What happens at 3 AM when the app crashes? Kadi is our answer.
What it does
Kadi is a production-grade URL shortener. You give it a long URL, it gives you a short code. Click the short code, get redirected instantly. Under the hood it handles user management, click analytics, and event tracking.
How we built it
- Flask + Peewee ORM for the API layer
- PostgreSQL for persistent storage
- Redis for caching hot redirects (eliminating DB reads on repeat clicks)
- Nginx as a reverse proxy load-balancing across 3 app instances
- Docker Compose to orchestrate the full stack
- Prometheus + Grafana for metrics and dashboards
- Alertmanager + Discord for real-time alerting
- pytest + GitHub Actions for CI with 76% code coverage
- k6 for load testing up to 500 concurrent users
Challenges we ran into
- Docker restart policy —
docker killsends SIGKILL which Docker treats as a manual stop and doesn't trigger restart. We had to simulate real crashes usingkill -TERM 1inside the container to demonstrate auto-recovery. - Alertmanager Discord integration — The native
discord_configsin Alertmanager v0.31 had breaking changes. We built a lightweight Flask bridge that converts Alertmanager webhooks to Discord's format. - PostgreSQL sequence drift — After bulk-loading CSV seed data with explicit IDs, the auto-increment sequence was out of sync. New inserts collided with existing IDs until we reset the sequences post-import.
- Redis caching with inactive URLs — Had to ensure the cache is invalidated when a URL is deactivated, so stale cached entries don't serve redirects for dormant routes.
Accomplishments that we're proud of
- 29/29 automated tests passing including all hidden edge case challenges
- 0% error rate at 500 concurrent users with 143 req/sec throughput
- Full observability stack — metrics, structured JSON logs, alerting, and a live Grafana dashboard all wired together
- Sub-40ms p95 latency at 50 users — the Redis cache makes redirects nearly instant after the first hit
- Complete documentation: API reference, runbook, decision log, capacity plan, failure modes, and deploy guide
What we learned
- Production engineering is less about writing code and more about anticipating failure
- Caching is the single highest-leverage optimization — Redis cut our DB load by ~50% under heavy traffic
- Observability isn't optional — without Prometheus and Grafana we would have been flying blind during load tests
- Docker restart policies only trigger on non-zero exit codes from natural crashes, not from
docker kill
What's next for Kadi
- Custom short codes — let users choose their own slug
- Link expiry — TTL-based deactivation
- Dashboard UI — a frontend for managing links and viewing analytics
- Async event logging — write click events to a Redis queue and flush to DB in background to reduce write latency under load
- Auto-scaling — move to ECS or Kubernetes for dynamic horizontal scaling based on traffic
Built With
- amazon-web-services
- docker
- flask
- grafana
- prometheus
- redis
Log in or sign up for Devpost to join the conversation.