Grafana Dashboard

Project Story — MLH Production Engineering Hackathon 2026

Inspiration

We were given a simple URL shortener and one goal: make it production-ready. Not just "it works on my machine" production-ready — but the real kind. Load balanced. Observable. Deployable. Resilient to failure. The kind of system you'd actually trust to run in production.

That challenge pushed us to think beyond just writing code. We had to think like production engineers: how does this system behave under load? What happens when a container crashes? How do we know when something is wrong at 3am?

What We Built

We took a barebones Flask URL shortener and transformed it into a production-grade system capable of handling 500 concurrent users with near-zero error rates. The final stack includes:

Flask + Gunicorn — multi-worker, multi-threaded WSGI server
PostgreSQL — primary data store with connection pooling
Redis — shared cache layer across all replicas for URL lookups
Nginx — reverse proxy and load balancer with least_conn routing
2 web replicas — horizontal scaling with automatic failover
Fluent Bit + Better Stack — structured log shipping and aggregation
Prometheus + Grafana — real-time metrics dashboards
Better Stack Uptime — alerting for service-down and high error rate events
GitHub Actions CI/CD — automated testing and deployment to a DigitalOcean Droplet on every push to main

The Scaling Challenge

The biggest technical challenge was getting from a broken system under load to one that could handle 500 concurrent users with a 0% error rate. We got there iteratively, and every step taught us something.

Where we started: Running Flask's built-in development server. At 500 VUs, we saw a ~43% error rate. The server simply couldn't handle concurrent requests.

Fix 1 — Gunicorn with gthread workers. Switching to Gunicorn with --workers=4 --threads=4 dropped the error rate dramatically. The dev server is single-threaded by design — it was never meant for this.

Fix 2 — Redis TTL tuning. We extended the cache TTL from 60 seconds to 300 seconds. URL redirects are read-heavy and rarely change, so keeping them in cache longer reduced DB pressure significantly under sustained load.

Fix 3 — Connection pooling. At 500 VUs we started hitting Postgres connection limits. Every Gunicorn worker was opening its own connection, and with 2 replicas × 4 workers × 4 threads, we were exhausting the pool fast. Switching to PooledPostgresqlDatabase with a shared pool of 32 connections per instance dropped the error rate from ~20% to 2.6%.

Fix 4 — Nginx DNS caching bug. The final 2.6% error rate turned out to be a subtle Nginx issue. Nginx resolves Docker service names at startup and caches them. After a redeploy that creates new containers, Nginx kept routing all traffic to the original container IP — effectively sending 100% of requests to web-1 while web-2 sat idle. Forcing a full Nginx restart on every deploy so it re-resolves DNS dropped the error rate from 2.6% to 0%.

Productionizing the System

Scaling was only half the work. The other half was making the system observable, recoverable, and deployable.

Observability — We configured structured JSON logging from the Flask app, shipped via Fluent Bit to Better Stack for aggregation and search. Prometheus scrapes a /metrics endpoint on each web replica every 15 seconds, and Grafana dashboards give us real-time visibility into CPU, memory, request rates, and error rates.

Alerting — Better Stack uptime monitors hit our /health endpoint every minute. If the service goes down or error rates spike, alerts fire to a Discord channel within 3 minutes.

Resilience — Docker's restart: on-failure policy automatically restarts crashed containers. With 2 replicas running, Nginx continues routing traffic to the healthy replica while the failed one recovers — no user-facing outage for a single container failure.

CI/CD — Every push to main triggers a GitHub Actions pipeline that runs the full test suite (unit + system tests) and, on success, SSHs into our DigitalOcean Droplet, rebuilds the web image, and redeploys with zero-downtime scaling.

What We Learned

Silence is dangerous. Our biggest debugging headaches came from errors being silently swallowed — bare except: pass blocks, wrong logger arguments, cache imports that captured None at startup. Structured logging and proper error surfacing would have saved hours.
Each layer of the stack has its own failure modes. Fixing the app server wasn't enough — we also had to fix the connection pool, the cache TTL, and a DNS bug in the load balancer. Production readiness means understanding every layer.
Observability is not optional. The Nginx DNS bug would have been nearly invisible without using docker ps stats. This gave us more of a reason to track server loads using grafana.

What's Next

Pin uv:latest to a specific version in the Dockerfile for reproducible builds
Add DB-level retry logic for transient connection failures
Implement cache pre-warming for high-traffic short codes
Add per-endpoint latency metrics (p50, p95, p99) to the Grafana dashboard
Explore PgBouncer as a connection pooler to support higher replica counts without saturating Postgres

Built With

Python · Flask · Gunicorn · PostgreSQL · Redis · Nginx · Docker · Fluent Bit · Better Stack · Prometheus · Grafana · GitHub Actions · DigitalOcean · k6 · uv