Application Architecture
Data Flow
Landing Dashboard
Grafana Dashboard
Docker Compose App container replication
K6 concurrency test results
K6 concurrency test results
Discord webhook alerts.

MLH Production Engineering: URL Shortener

Inspiration

Building a production-grade system isn't about code—it's about discipline, observability, and planning. We set out to prove that operational excellence is learnable and measurable by building across all MLH PE tracks: Can your service survive failure? Handle 500+ users? Auto-diagnose issues? Document itself?

What It Does

Production-grade URL Shortener API serving sub-100ms requests with full observability, automatic scaling, and incident response playbooks.

User Management — CSV bulk import, metadata tracking
URL Shortening — Base62 codes, collision detection, inactive support
Event Audit Trail — Full tracking of creates, updates, deletes
Multi-tier Caching — Redis with LFU eviction and circuit breaker; graceful fallback to PostgreSQL
Distributed Tracing — OpenTelemetry + Jaeger for end-to-end request visibility
Health Endpoints — Liveness + readiness (Kubernetes-compatible)
Prometheus Metrics — Request counts, latency histogram, memory/CPU

Scale Targets: Bronze: 50 users. Silver: 200+ users. Gold: 500+ users with <5% errors.

How We Built It

Architecture: Nginx proxy → 2–5 Flask/Gunicorn replicas → PostgreSQL + Redis

Tech Stack:

Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23 (gthread), uv package manager
PostgreSQL 16 with 9 indexes, Redis 7 with LFU eviction
Docker Compose with custom autoscaler sidecar (2–5 replicas, CPU-based)
Prometheus + Grafana (8-panel dashboard) + Loki 3.4.2 (logs) + Alertmanager (SMTP + Discord) + Jaeger 2.16.0 (traces)
GitHub Actions CI/CD (lint → test 91% coverage → build → Trivy scan → load test → deploy)

Key Implementation: Circuit breaker detects Redis failure in 0.5s, disables cache for 30s, falls back to direct DB queries. Zero user errors, 6.5× latency spike but within SLO.

Challenges We Solved

Broken Alert Notifications — Alertmanager webhook pointed to itself. Fixed: Resend SMTP with severity routing (critical: instant, warnings: batched 10s) plus Discord webhook for real-time paging.
Connection Pool Exhaustion — 500 users under load exhausted PostgreSQL. Solution: Peewee connection pooling (40 max), stale timeout 300s, 9 indexes on frequently filtered columns.
Cache Invalidation — Switched from TTL-based expiry to LFU eviction without TTL: hot URLs stay cached indefinitely, one-off hits decay naturally. Explicit pattern-based deletion on write guarantees read consistency and avoids thundering-herd storms.
Stale Dashboards — Manual refresh showed 15m old data. Fixed: Auto-refresh every 10s + alert annotations (vertical lines on metric graphs).
RCA Narrative — Judges needed how to diagnose, not just what happened. Created step-by-step Grafana walkthrough: latency spiked 6.5×, error rate stayed 0% (circuit breaker), Loki query {job="app"} |= "Redis unavailable" confirmed fallback path.

What We're Proud Of

Track 1 (Reliability): 91% test coverage (177 tests, Gold threshold is 70%), CI/CD pipeline blocks merges on failures, chaos testing (kill container → auto-recover), N+1 query regression detection, Trivy CVE scanning.

Track 2 (Scalability): 5 k6 load test scripts (stress/soak/sustained/breakpoint/smoke); 500-VU stress test: 482 req/s, 0.00% errors, p95=42ms; multi-replica Nginx least-connections load balancing; Redis LFU cache (6× latency improvement on cache hits); custom Docker autoscaler sidecar (60s scale-up / 120s scale-down cooldown).

Track 3 (Incident Response): Structured JSON logging, 7 Prometheus alert rules (all fire <90s), SMTP + Discord notifications, 8-panel Grafana dashboard, OpenTelemetry + Jaeger distributed tracing, chaos-test.sh script, RCA-001 with Grafana panel walkthroughs, postmortem template (Google SRE format), 20-section incident playbook.

Cross-track: Auto-scaling (2–5 replicas, CPU-based), GitHub Pages documentation site, OpenAPI contract, 21+ docs (architecture, runbooks, RCA, capacity plan, decision log).

What We Learned

Observability is non-negotiable — We couldn't optimize latency until we saw which queries were slow. Structured logging + Prometheus + Jaeger traces = clarity.
Graceful degradation beats hard failures — Circuit breaker meant Redis failure = slower (not broken). That's good design.
Alert on business impact, not metrics — "CPU > 50%" fired constantly. "Error rate > 1%" actually mattered. SLOs are your north star.
Documentation multiplies your team — Runbook meant anyone could diagnose ServiceDown in 5min, not 30. Write for your 3 AM self.
Load test early — Hitting 500 concurrent users revealed connection pool bugs we'd never catch in code review.

Built With

Backend: Python 3.13, Flask 3.1, Peewee ORM, Gunicorn 23, uv
Database: PostgreSQL 16, Redis 7
Proxy: Nginx (rate limiting, gzip, TLS)
Observability: Prometheus, Grafana, Loki 3.4.2, Alertmanager, Jaeger 2.16.0, OpenTelemetry
Orchestration: Docker Compose with custom autoscaler sidecar
Testing: pytest (91% coverage), GitHub Actions CI/CD, k6 load tests (5 scripts)

Try It

docker compose up -d --build
curl -X POST http://localhost/users \
  -H "Content-Type: application/json" \
  -d '{"username": "alice", "email": "alice@example.com"}'

# Monitor: Grafana (localhost:3000) → Prometheus (9090) → Alertmanager (9093) → Jaeger (localhost:16686)
./scripts/chaos-test.sh --service-down  # See alerts fire in real-time

One-Liner

A production-grade URL shortener built to win MLH PE by demonstrating reliability (91% test coverage + CI/CD), scalability (auto-scaling, LFU caching, 482 req/s at zero errors), and operational excellence (7 alerts, distributed tracing, dashboards, runbooks, incident playbooks).

Docs: See docs/Incident Response/README.md for incident response track details, docs/ARCHITECTURE.md for architecture, docs/DEPLOYMENT.md for setup, or the full GitHub Pages site.