Snip — Production-Grade URL Shortener

Elevator pitch: 11-container URL shortener surviving 600 concurrent users at 0% errors on a $6/mo server. Full Gold across Reliability, Scalability, Incident Response, and Documentation.

About the project:

## Inspiration

The MLH Production Engineering Hackathon challenged us to go beyond "it works on my laptop" and prove our code can survive real production pressure. We wanted to dominate every quest track — not just pass, but build something an SRE team would be proud to operate.

## What it does

Snip is a URL shortener that creates short links, tracks redirect analytics, and manages users — deployed as a production-grade service with horizontal scaling, caching, monitoring, alerting, and comprehensive documentation.

## How we built it

Starting from the MLH Flask+Peewee template, we built an 11-container Docker Compose architecture: - 3 Flask instances behind Nginx load balancer (least_conn routing) - PostgreSQL with connection pooling and synchronous_commit=off for write performance - Redis caching with 600s TTL and full cache warm-up on startup (95%+ hit ratio) - Prometheus scraping metrics from all instances with 12 custom alert rules - Grafana dashboard with 8 panels covering all four golden signals - Alertmanager routing to a custom webhook receiver that forwards to Discord with rich embeds - GitHub Actions CI/CD with ruff linting, 259 pytest tests, and a 70% coverage gate that blocks deploys

## Challenges we ran into

- $6/mo constraint: Fitting 11 containers on 1 vCPU / 1 GB RAM required careful tuning — Gunicorn gthread workers, PostgreSQL synchronous_commit=off, Nginx microcaching, and 2 GB swap - Cache invalidation: Balancing cache freshness with performance under 600 concurrent users - Alert pipeline: Getting Prometheus → Alertmanager → webhook-receiver → Discord working end-to-end inside Docker networking - Oracle hidden tests: Reverse-engineering 6 undocumented evaluator checks from cryptic hints

## Accomplishments that we're proud of

- 0% error rate at 600 concurrent users on a $6/month Droplet - p95 = 2,970ms at Gold tier (under the 5s threshold by 1.7x) - 217 req/s sustained throughput — 9.9x improvement from baseline - 29/29 evaluator tests including all 6 Oracle hidden tests - 25-second alert detection — from container failure to Discord notification - 6,500+ lines of documentation across 14 files - 88% test coverage with 259 tests

## What we learned

- Production engineering is about the system around the code, not just the code itself - synchronous_commit=off was our single biggest performance win (37% latency reduction) — understanding your database's durability tradeoffs matters - Chaos engineering isn't optional — we found real bugs by killing containers on purpose - Good documentation is a force multiplier — our runbook saved us hours during debugging

## What's next for PE Hackathon

- Horizontal multi-node scaling with DigitalOcean Load Balancer - PgBouncer for connection pooling at scale - Distributed tracing with OpenTelemetry - Automated chaos engineering in CI/CD pipeline