Inspiration

In the real world, code that works on your laptop often collapses under production load. We wanted to build something that doesn't just function — it survives. The MLH Production Engineering Hackathon challenged us to think like SREs: what happens when 500 users hit at once? What happens at 3 AM when the app crashes? Kadi is our answer.

What it does

Kadi is a production-grade URL shortener. You give it a long URL, it gives you a short code. Click the short code, get redirected instantly. Under the hood it handles user management, click analytics, and event tracking.

How we built it

  • Flask + Peewee ORM for the API layer
  • PostgreSQL for persistent storage
  • Redis for caching hot redirects (eliminating DB reads on repeat clicks)
  • Nginx as a reverse proxy load-balancing across 3 app instances
  • Docker Compose to orchestrate the full stack
  • Prometheus + Grafana for metrics and dashboards
  • Alertmanager + Discord for real-time alerting
  • pytest + GitHub Actions for CI with 76% code coverage
  • k6 for load testing up to 500 concurrent users

Challenges we ran into

  • Docker restart policydocker kill sends SIGKILL which Docker treats as a manual stop and doesn't trigger restart. We had to simulate real crashes using kill -TERM 1 inside the container to demonstrate auto-recovery.
  • Alertmanager Discord integration — The native discord_configs in Alertmanager v0.31 had breaking changes. We built a lightweight Flask bridge that converts Alertmanager webhooks to Discord's format.
  • PostgreSQL sequence drift — After bulk-loading CSV seed data with explicit IDs, the auto-increment sequence was out of sync. New inserts collided with existing IDs until we reset the sequences post-import.
  • Redis caching with inactive URLs — Had to ensure the cache is invalidated when a URL is deactivated, so stale cached entries don't serve redirects for dormant routes.

Accomplishments that we're proud of

  • 29/29 automated tests passing including all hidden edge case challenges
  • 0% error rate at 500 concurrent users with 143 req/sec throughput
  • Full observability stack — metrics, structured JSON logs, alerting, and a live Grafana dashboard all wired together
  • Sub-40ms p95 latency at 50 users — the Redis cache makes redirects nearly instant after the first hit
  • Complete documentation: API reference, runbook, decision log, capacity plan, failure modes, and deploy guide

What we learned

  • Production engineering is less about writing code and more about anticipating failure
  • Caching is the single highest-leverage optimization — Redis cut our DB load by ~50% under heavy traffic
  • Observability isn't optional — without Prometheus and Grafana we would have been flying blind during load tests
  • Docker restart policies only trigger on non-zero exit codes from natural crashes, not from docker kill

What's next for Kadi

  • Custom short codes — let users choose their own slug
  • Link expiry — TTL-based deactivation
  • Dashboard UI — a frontend for managing links and viewing analytics
  • Async event logging — write click events to a Redis queue and flush to DB in background to reduce write latency under load
  • Auto-scaling — move to ECS or Kubernetes for dynamic horizontal scaling based on traffic

Built With

Share this project:

Updates