MLH Production Engineering: URL Shortener

Inspiration

Building a production-grade system isn't about code—it's about discipline, observability, and planning. We set out to prove that operational excellence is learnable and measurable by building across all MLH PE tracks: Can your service survive failure? Handle 500+ users? Auto-diagnose issues? Document itself?


What It Does

Production-grade URL Shortener API serving sub-100ms requests with full observability, automatic scaling, and incident response playbooks.

  • User Management — CSV bulk import, metadata tracking
  • URL Shortening — Base62 codes, collision detection, inactive support
  • Event Audit Trail — Full tracking of creates, updates, deletes
  • Multi-tier Caching — Redis with LFU eviction and circuit breaker; graceful fallback to PostgreSQL
  • Distributed Tracing — OpenTelemetry + Jaeger for end-to-end request visibility
  • Health Endpoints — Liveness + readiness (Kubernetes-compatible)
  • Prometheus Metrics — Request counts, latency histogram, memory/CPU

Scale Targets: Bronze: 50 users. Silver: 200+ users. Gold: 500+ users with <5% errors.


How We Built It

Architecture: Nginx proxy → 2–5 Flask/Gunicorn replicas → PostgreSQL + Redis

Tech Stack:

  • Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23 (gthread), uv package manager
  • PostgreSQL 16 with 9 indexes, Redis 7 with LFU eviction
  • Docker Compose with custom autoscaler sidecar (2–5 replicas, CPU-based)
  • Prometheus + Grafana (8-panel dashboard) + Loki 3.4.2 (logs) + Alertmanager (SMTP + Discord) + Jaeger 2.16.0 (traces)
  • GitHub Actions CI/CD (lint → test 91% coverage → build → Trivy scan → load test → deploy)

Key Implementation: Circuit breaker detects Redis failure in 0.5s, disables cache for 30s, falls back to direct DB queries. Zero user errors, 6.5× latency spike but within SLO.


Challenges We Solved

  1. Broken Alert Notifications — Alertmanager webhook pointed to itself. Fixed: Resend SMTP with severity routing (critical: instant, warnings: batched 10s) plus Discord webhook for real-time paging.

  2. Connection Pool Exhaustion — 500 users under load exhausted PostgreSQL. Solution: Peewee connection pooling (40 max), stale timeout 300s, 9 indexes on frequently filtered columns.

  3. Cache Invalidation — Switched from TTL-based expiry to LFU eviction without TTL: hot URLs stay cached indefinitely, one-off hits decay naturally. Explicit pattern-based deletion on write guarantees read consistency and avoids thundering-herd storms.

  4. Stale Dashboards — Manual refresh showed 15m old data. Fixed: Auto-refresh every 10s + alert annotations (vertical lines on metric graphs).

  5. RCA Narrative — Judges needed how to diagnose, not just what happened. Created step-by-step Grafana walkthrough: latency spiked 6.5×, error rate stayed 0% (circuit breaker), Loki query {job="app"} |= "Redis unavailable" confirmed fallback path.


What We're Proud Of

Track 1 (Reliability): 91% test coverage (177 tests, Gold threshold is 70%), CI/CD pipeline blocks merges on failures, chaos testing (kill container → auto-recover), N+1 query regression detection, Trivy CVE scanning.

Track 2 (Scalability): 5 k6 load test scripts (stress/soak/sustained/breakpoint/smoke); 500-VU stress test: 482 req/s, 0.00% errors, p95=42ms; multi-replica Nginx least-connections load balancing; Redis LFU cache (6× latency improvement on cache hits); custom Docker autoscaler sidecar (60s scale-up / 120s scale-down cooldown).

Track 3 (Incident Response): Structured JSON logging, 7 Prometheus alert rules (all fire <90s), SMTP + Discord notifications, 8-panel Grafana dashboard, OpenTelemetry + Jaeger distributed tracing, chaos-test.sh script, RCA-001 with Grafana panel walkthroughs, postmortem template (Google SRE format), 20-section incident playbook.

Cross-track: Auto-scaling (2–5 replicas, CPU-based), GitHub Pages documentation site, OpenAPI contract, 21+ docs (architecture, runbooks, RCA, capacity plan, decision log).


What We Learned

  1. Observability is non-negotiable — We couldn't optimize latency until we saw which queries were slow. Structured logging + Prometheus + Jaeger traces = clarity.

  2. Graceful degradation beats hard failures — Circuit breaker meant Redis failure = slower (not broken). That's good design.

  3. Alert on business impact, not metrics — "CPU > 50%" fired constantly. "Error rate > 1%" actually mattered. SLOs are your north star.

  4. Documentation multiplies your team — Runbook meant anyone could diagnose ServiceDown in 5min, not 30. Write for your 3 AM self.

  5. Load test early — Hitting 500 concurrent users revealed connection pool bugs we'd never catch in code review.


Built With

  • Backend: Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23, uv
  • Database: PostgreSQL 16, Redis 7
  • Proxy: Nginx (rate limiting, gzip, TLS)
  • Observability: Prometheus, Grafana, Loki 3.4.2, Alertmanager, Jaeger 2.16.0, OpenTelemetry
  • Orchestration: Docker Compose with custom autoscaler sidecar
  • Testing: pytest (91% coverage), GitHub Actions CI/CD, k6 load tests (5 scripts)

Try It

docker compose up -d --build
curl -X POST http://localhost/users \
  -H "Content-Type: application/json" \
  -d '{"username": "alice", "email": "alice@example.com"}'

# Monitor: Grafana (localhost:3000) → Prometheus (9090) → Alertmanager (9093) → Jaeger (localhost:16686)
./scripts/chaos-test.sh --service-down  # See alerts fire in real-time

One-Liner

A production-grade URL shortener built to win MLH PE by demonstrating reliability (91% test coverage + CI/CD), scalability (auto-scaling, LFU caching, 482 req/s at zero errors), and operational excellence (7 alerts, distributed tracing, dashboards, runbooks, incident playbooks).


Docs: See docs/Incident Response/README.md for incident response track details, docs/ARCHITECTURE.md for architecture, docs/DEPLOYMENT.md for setup, or the full GitHub Pages site.

Built With

  • alertmanager
  • docker-compose
  • docker-swarm
  • flask-3.1
  • github-actions
  • grafana
  • gunicorn-(gthread)
  • jaeger
  • loki
  • nginx
  • opentelemetry
  • peewee-orm
  • postgresql-16
  • prometheus
  • pytest
  • python-3.13
  • redis-7
  • swagger
Share this project:

Updates