1. Implementing the URL shortener
- Using the template: Flask + peewee + PostgreSQL
- Initial Setup: Put PostgreSQL onto Docker.
- Testing: Written unit tests and integration tests using
pytest. - Documentation: Added Swagger for each endpoint (long-time FastAPI user habit! :D).
2. Database Connection Harnessing
- DB Pooling: Noticed the app was connecting, querying, then closing every time, so I implemented pooling.
- Timeouts: Added timeouts everywhere (query timeout, waiting for connection, stale timeout, etc.) to ensure queries do not hang.
- Load Testing: Added a
locustfile to simulate user flows and weighted endpoints.
3. Noticed Bottleneck
- Observation: During load testing, the database was overloaded with connections, tanking performance.
- Solution: Added Redis caching as a simple key-value layer between the client and the database.
Why Redis? Dead simple, lightweight, and very fast. It's the industry standard and a total no-brainer for this use case.
4. Logging
- Level-based logs: Included extra metadata like method, status code, and duration.
- SQL Monitoring: Recorded SQL execution time specifically to detect N+1 queries.
5. Prometheus + Grafana
- Infrastructure: Added Prometheus and Grafana to Docker; exposed the
/metricsendpoint. - Instrumentation: Added metrics recording (HISTOGRAM, COUNTER, etc.) within the logging logic.
- Processing: Configured metrics to record before/after requests, placing processing times into buckets.
Why Prometheus + Grafana? Best open-source combo—very effective and free!
What are some alternatives? OpenTelemetry is a strong contender (Enterprise-grade metrics, logging, and monitoring). However, given the hardware constraints (2-core CPU, 2GB RAM), OpenTelemetry would be overkill for this app.
6. Alerting Rules
- Alertmanager: Configured to ping me for abnormalities (high 400+ error rates, high p95 latency, degraded DB, etc.).
- Rules Logic: Initialized selective rules (e.g., alert on high p95 over 5 minutes rather than single spikes).
- Integration: Added a Discord Webhook as the point of contact.
7. CI/CD & Deployment
- Host: Deployed to a DigitalOcean Droplet.
- Containerization: Containerized all components, performed local dry runs, and included a
deploy.shscript. - Workflows: Wrote
ci.ymlanddeploy.ymlfiles. - Discovery: Containerized the NGINX proxy for the first time—didn't realize how easy that was!
- Effort: Spent a significant amount of time here, despite not being my "first rodeo."
8. Miscellaneous
- Deployment Strategy: Attempted Blue-Green deployment for 99.99% uptime. It failed miserably due to the lack of Kubernetes; stick to in-place deployment for lower downtime in this setup.
- Log Aggregation: Added Loki + Promtail to view logs without SSH-ing into the server.
- Resource Monitoring: Added
cadvisorto monitor VPS usage; discovered I could safely run 4 Gunicorn workers instead of 2. - Security: Used a Docker network bridge to expose only client-facing services (Server, Grafana) while hiding internal services from port discovery.
Why Loki + Promtail? It took longer to set up than a simple tool like Dozzle, but the built-in Grafana dashboard integration and container-based querying make it much more powerful than a basic logging snapshot.
9. What’s Next
- Scalability: The "elephant in the room." The app struggles to maintain sub-5s p95 latency with 50+ concurrent users.
- Cache Optimization: The dashboard shows only a 50% hit rate—this should be much higher for a URL shortener.
- Task Queue: Implement a proper distributed worker for the
/users/bulkendpoint using a DLQ (Dead Letter Queue) and SSE (Server-Sent Events) to notify clients instead of polling.
Log in or sign up for Devpost to join the conversation.