Lazarus

Dashboard
Container Metrics + Network metrics
System Metrics
Load testing with Locust
25 concurrent users Load testing result
50 concurrent users simulating

1. Implementing the URL shortener

Using the template: Flask + peewee + PostgreSQL
Initial Setup: Put PostgreSQL onto Docker.
Testing: Written unit tests and integration tests using pytest.
Documentation: Added Swagger for each endpoint (long-time FastAPI user habit! :D).

2. Database Connection Harnessing

DB Pooling: Noticed the app was connecting, querying, then closing every time, so I implemented pooling.
Timeouts: Added timeouts everywhere (query timeout, waiting for connection, stale timeout, etc.) to ensure queries do not hang.
Load Testing: Added a locust file to simulate user flows and weighted endpoints.

3. Noticed Bottleneck

Observation: During load testing, the database was overloaded with connections, tanking performance.
Solution: Added Redis caching as a simple key-value layer between the client and the database.

Why Redis? Dead simple, lightweight, and very fast. It's the industry standard and a total no-brainer for this use case.

4. Logging

Level-based logs: Included extra metadata like method, status code, and duration.
SQL Monitoring: Recorded SQL execution time specifically to detect N+1 queries.

5. Prometheus + Grafana

Infrastructure: Added Prometheus and Grafana to Docker; exposed the /metrics endpoint.
Instrumentation: Added metrics recording (HISTOGRAM, COUNTER, etc.) within the logging logic.
Processing: Configured metrics to record before/after requests, placing processing times into buckets.

Why Prometheus + Grafana? Best open-source combo—very effective and free!

What are some alternatives? OpenTelemetry is a strong contender (Enterprise-grade metrics, logging, and monitoring). However, given the hardware constraints (2-core CPU, 2GB RAM), OpenTelemetry would be overkill for this app.

6. Alerting Rules

Alertmanager: Configured to ping me for abnormalities (high 400+ error rates, high p95 latency, degraded DB, etc.).
Rules Logic: Initialized selective rules (e.g., alert on high p95 over 5 minutes rather than single spikes).
Integration: Added a Discord Webhook as the point of contact.

7. CI/CD & Deployment

Host: Deployed to a DigitalOcean Droplet.
Containerization: Containerized all components, performed local dry runs, and included a deploy.sh script.
Workflows: Wrote ci.yml and deploy.yml files.
Discovery: Containerized the NGINX proxy for the first time—didn't realize how easy that was!
Effort: Spent a significant amount of time here, despite not being my "first rodeo."

8. Miscellaneous

Deployment Strategy: Attempted Blue-Green deployment for 99.99% uptime. It failed miserably due to the lack of Kubernetes; stick to in-place deployment for lower downtime in this setup.
Log Aggregation: Added Loki + Promtail to view logs without SSH-ing into the server.
Resource Monitoring: Added cadvisor to monitor VPS usage; discovered I could safely run 4 Gunicorn workers instead of 2.
Security: Used a Docker network bridge to expose only client-facing services (Server, Grafana) while hiding internal services from port discovery.

Why Loki + Promtail? It took longer to set up than a simple tool like Dozzle, but the built-in Grafana dashboard integration and container-based querying make it much more powerful than a basic logging snapshot.

9. What’s Next

Scalability: The "elephant in the room." The app struggles to maintain sub-5s p95 latency with 50+ concurrent users.
Cache Optimization: The dashboard shows only a 50% hit rate—this should be much higher for a URL shortener.
Task Queue: Implement a proper distributed worker for the /users/bulk endpoint using a DLQ (Dead Letter Queue) and SSE (Server-Sent Events) to notify clients instead of polling.

Built With

docker
flask
git
github
grafana
locust
nginx
peewee
postgresql
prometheus
redis

Updates

Trung Nguyen (Tyler) started this project — Apr 05, 2026 09:39 AM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.