MLH Production Engineering: URL Shortener
Inspiration
Building a production-grade system isn't about code—it's about discipline, observability, and planning. We set out to prove that operational excellence is learnable and measurable by building across all MLH PE tracks: Can your service survive failure? Handle 500+ users? Auto-diagnose issues? Document itself?
What It Does
Production-grade URL Shortener API serving sub-100ms requests with full observability, automatic scaling, and incident response playbooks.
- User Management — CSV bulk import, metadata tracking
- URL Shortening — Base62 codes, collision detection, inactive support
- Event Audit Trail — Full tracking of creates, updates, deletes
- Multi-tier Caching — Redis with LFU eviction and circuit breaker; graceful fallback to PostgreSQL
- Distributed Tracing — OpenTelemetry + Jaeger for end-to-end request visibility
- Health Endpoints — Liveness + readiness (Kubernetes-compatible)
- Prometheus Metrics — Request counts, latency histogram, memory/CPU
Scale Targets: Bronze: 50 users. Silver: 200+ users. Gold: 500+ users with <5% errors.
How We Built It
Architecture: Nginx proxy → 2–5 Flask/Gunicorn replicas → PostgreSQL + Redis
Tech Stack:
- Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23 (gthread), uv package manager
- PostgreSQL 16 with 9 indexes, Redis 7 with LFU eviction
- Docker Compose with custom autoscaler sidecar (2–5 replicas, CPU-based)
- Prometheus + Grafana (8-panel dashboard) + Loki 3.4.2 (logs) + Alertmanager (SMTP + Discord) + Jaeger 2.16.0 (traces)
- GitHub Actions CI/CD (lint → test 91% coverage → build → Trivy scan → load test → deploy)
Key Implementation: Circuit breaker detects Redis failure in 0.5s, disables cache for 30s, falls back to direct DB queries. Zero user errors, 6.5× latency spike but within SLO.
Challenges We Solved
Broken Alert Notifications — Alertmanager webhook pointed to itself. Fixed: Resend SMTP with severity routing (critical: instant, warnings: batched 10s) plus Discord webhook for real-time paging.
Connection Pool Exhaustion — 500 users under load exhausted PostgreSQL. Solution: Peewee connection pooling (40 max), stale timeout 300s, 9 indexes on frequently filtered columns.
Cache Invalidation — Switched from TTL-based expiry to LFU eviction without TTL: hot URLs stay cached indefinitely, one-off hits decay naturally. Explicit pattern-based deletion on write guarantees read consistency and avoids thundering-herd storms.
Stale Dashboards — Manual refresh showed 15m old data. Fixed: Auto-refresh every 10s + alert annotations (vertical lines on metric graphs).
RCA Narrative — Judges needed how to diagnose, not just what happened. Created step-by-step Grafana walkthrough: latency spiked 6.5×, error rate stayed 0% (circuit breaker), Loki query
{job="app"} |= "Redis unavailable"confirmed fallback path.
What We're Proud Of
Track 1 (Reliability): 91% test coverage (177 tests, Gold threshold is 70%), CI/CD pipeline blocks merges on failures, chaos testing (kill container → auto-recover), N+1 query regression detection, Trivy CVE scanning.
Track 2 (Scalability): 5 k6 load test scripts (stress/soak/sustained/breakpoint/smoke); 500-VU stress test: 482 req/s, 0.00% errors, p95=42ms; multi-replica Nginx least-connections load balancing; Redis LFU cache (6× latency improvement on cache hits); custom Docker autoscaler sidecar (60s scale-up / 120s scale-down cooldown).
Track 3 (Incident Response): Structured JSON logging, 7 Prometheus alert rules (all fire <90s), SMTP + Discord notifications, 8-panel Grafana dashboard, OpenTelemetry + Jaeger distributed tracing, chaos-test.sh script, RCA-001 with Grafana panel walkthroughs, postmortem template (Google SRE format), 20-section incident playbook.
Cross-track: Auto-scaling (2–5 replicas, CPU-based), GitHub Pages documentation site, OpenAPI contract, 21+ docs (architecture, runbooks, RCA, capacity plan, decision log).
What We Learned
Observability is non-negotiable — We couldn't optimize latency until we saw which queries were slow. Structured logging + Prometheus + Jaeger traces = clarity.
Graceful degradation beats hard failures — Circuit breaker meant Redis failure = slower (not broken). That's good design.
Alert on business impact, not metrics — "CPU > 50%" fired constantly. "Error rate > 1%" actually mattered. SLOs are your north star.
Documentation multiplies your team — Runbook meant anyone could diagnose ServiceDown in 5min, not 30. Write for your 3 AM self.
Load test early — Hitting 500 concurrent users revealed connection pool bugs we'd never catch in code review.
Built With
- Backend: Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23, uv
- Database: PostgreSQL 16, Redis 7
- Proxy: Nginx (rate limiting, gzip, TLS)
- Observability: Prometheus, Grafana, Loki 3.4.2, Alertmanager, Jaeger 2.16.0, OpenTelemetry
- Orchestration: Docker Compose with custom autoscaler sidecar
- Testing: pytest (91% coverage), GitHub Actions CI/CD, k6 load tests (5 scripts)
Try It
docker compose up -d --build
curl -X POST http://localhost/users \
-H "Content-Type: application/json" \
-d '{"username": "alice", "email": "alice@example.com"}'
# Monitor: Grafana (localhost:3000) → Prometheus (9090) → Alertmanager (9093) → Jaeger (localhost:16686)
./scripts/chaos-test.sh --service-down # See alerts fire in real-time
One-Liner
A production-grade URL shortener built to win MLH PE by demonstrating reliability (91% test coverage + CI/CD), scalability (auto-scaling, LFU caching, 482 req/s at zero errors), and operational excellence (7 alerts, distributed tracing, dashboards, runbooks, incident playbooks).
Docs: See docs/Incident Response/README.md for incident response track details, docs/ARCHITECTURE.md for architecture, docs/DEPLOYMENT.md for setup, or the full GitHub Pages site.
Log in or sign up for Devpost to join the conversation.