As a full-stack developer and DevOps engineer, having deep visibility into all aspects of your application performance is critical. Understanding metrics like request rates, error rates, response times, resource usage etc. allows you to optimize bottlenecks and quickly resolve issues.

In this comprehensive 3200+ word guide, I will provide my insider perspective as a Prometheus expert on best practices for monitoring Python applications.

Why Prometheus?

Over the last 5 years, Prometheus has emerged as the de-facto standard open source monitoring and alerting solution designed specifically for modern cloud-native applications.

Here are some key capabilities that make Prometheus an ideal choice:

Multi-dimensional data model: Prometheus has an extremely powerful data model allowing you to slice and dice metrics across multiple dimensions like instance, endpoints, response code etc. This is invaluable while troubleshooting sporadic issues.

Customizable dashboards: While Prometheus has a built-in expression browser, tools like Grafana provide unlimited flexibility for custom monitoring dashboards to suit your needs.

Efficiency: The Prometheus server has a very small resource footprint, requiring only 512MB of RAM with durable time series storage using local storage.

Alerting: Complex alerting rules can be configured to send alerts only when critical conditions are met based on multiple metric criteria.

Scalability: Prometheus was designed ground up for monitoring large scale environments with 1000s of instances. The federation capabilities make it possible to monitor even larger environments.

Simply put, you get an enterprise-grade solution completely for free that works at any scale while being easier to operate than many commercial products.

Architecture Overview

At a high level, monitoring Python apps with Prometheus involves:

  1. Instrumenting application code via Prometheus client libraries to expose metrics
  2. Scrapping and storing metrics in a Prometheus server
  3. Analyzing metrics through Prometheus UI and Grafana dashboards
  4. Creating alerting rules for notifications

The Python client libraries expose an easy API to create Counters, Gauges, Summaries and Histograms. The Prometheus server scrapes these over HTTP when it discovers the app endpoints.

Here is a diagram depicting the workflow:

prometheus architecture

Now that you understand the basics, let‘s go through the components one by one.

Installing Prometheus Server

The Prometheus server can be easily deployed using docker with just one command:

docker run -d --name=prometheus -p 9090:9090 prom/prometheus  

This launches Prometheus in a docker container with the following defaults:

  • Local data stored in /prometheus/data (persists restarts)
  • Metrics exposed on port 9090
  • Web UI available on port 9090

The initial blank config monitors just the Prometheus server itself exposing stats like its own memory usage, scrape metrics etc.

In most cases, additional configuration is not needed as service discovery mechanisms allow for automatic target discovery in Kubernetes based environments.

For our demo, we will add scrape targets manually pointing to our sample Python app which will be deployed separately.

Python Prometheus Client

The Prometheus python client library is available as prometheus_client package installable via pip.

It has no external dependencies making it very easy to integrate:

pip install prometheus_client

The client exposes a metrics registry that can be used to create Counters, Gauges, Summaries and Histograms that serves as the exposition endpoint for scraping.

Instrumentation Example

Here is a sample Flask application that illustrates how instrumenting with a Counter and Histogram looks like:

from flask import Flask

from prometheus_client import Counter, Histogram
import random
import time

app = Flask(__name__)

REQUEST_COUNT = Counter(‘app_requests_total‘,‘Total Request Count‘)
RESPONSE_TIME = Histogram(‘app_response_time_seconds‘, ‘Response time distribution‘ )  

@app.route(‘/‘)
def hello():

  REQUEST_COUNT.inc()

  # Simulate random response time
  rand = random.random() 
  time.sleep(rand)

  # Observe request time
  RESPONSE_TIME.observe(rand) 

  return "Hello World!"

if __name__ == "__main__":
    app.run(host=‘0.0.0.0‘, port=80)  

This defines two metrics:

  • app_requests_total: Increments per request
  • app_response_time_seconds: Observes response time

That‘s all that‘s needed to instrument a Flask application. Just two lines of code gives deep visibility. More advanced instrumentation would involve:

  • Breaking down requests counter across endpoints
  • Adding error counters
  • Capturing outlier response times

But this shows just how easy the Python Prometheus client makes the process.

Now let‘s look at how the server configuration works.

Configuring Prometheus Server

For Prometheus to collect metrics from our example app, we need to add the app as a target.

The Prometheus config file prometheus.yml accepts a list of scrape configurations each having a set of targets.

So for our app listening on port 80, we can add:

scrape_configs:
  - job_name: ‘pythonapp‘  
    static_configs:
      - targets: [‘localhost:80‘]

This defines our app as part of the ‘pythonapp‘ job with the metrics exposed on our local port 80. Multiple apps can be added under the same job.

Once this config change is reloaded after storing the file on the container‘s /etc/prometheus folder, our app metrics will automatically start getting scraped every 15 seconds by default.

The metrics show up instantly under Status > Targets.

Prometheus Query Language

One of the most powerful Prometheus capabilities is the ability to slice and dice through metrics using the Prometheus Query Language (PromQL).

Some examples of analysis that can be done on metrics collected from our sample app:

Total Requests

app_requests_total

Requests per second

rate(app_requests_total[1m]) 

99th Percentile Latency

histogram_quantile(0.99, rate(app_response_time_seconds_bucket[5m]))

The syntax allows mathematical and boolean operators on metrics enabling complex analysis.

While running ad-hoc queries in the UI is helpful during investigations, standing up a Grafana instance connected to Prometheus data source allows creating persistent dashboards tailored to application needs.

Grafana for Metrics Visualization

Grafana has become the standard tool in monitoring stacks for building custom dashboards tied to data sources like Prometheus.

The out of box support for Prometheus in Grafana makes it very easy to get started. Simply:

  1. Install Grafana
  2. Add Prometheus data source
  3. Create dashboards/panels for target metrics

And you instantly get beautiful graphs like this one displaying requests per second:

grafana screenshot

Some types of panels you can build on Python app metrics:

Timeseries: Plots metric value over time

Graph: Displays metric correlation

Bar gauge: Indicates threshold crossing

Heatmaps: Visualize latency distribution

Tables: For a metric summary

The customization options are endless allowing you to build the exact dashboards for visibility needs.

Here I‘ve put together a sample dashboard with 4 panels:

sample dashboard

  • Requests per second timeseries
  • Average response time gauge
  • Response time 95th percentile indicator
  • Latency distribution heatmap

The power here is the ability to slice and dice easily by endpoints, instances etc. using template variables.

Recording Rules for Aggregated Views

On high traffic sites generating billions of metrics each day, querying rates and latency directly can get expensive.

This is where recording rules come in very handy.

Recording rules allow pre-computing frequently needed aggregated views from raw metrics to optimize querying.

For example:

- record: app_requests_per_second
  expr: rate(app_requests_total[1m])

This computes the per second request rate saving additional processing each query. Any frequently used complex PromQL expressions can be simplified in this way.

Think of recording rules as OLAP cubes for metrics allowing fast access to aggregations.

Alerting Rules for Real-Time Notifications

Metrics and dashboards provide insight only when actively looked at. To get real-time notifications when critical conditions occur, Prometheus alerting rules need to be leveraged.

Here is a sample rule:

alert: HighLatency
expr: histogram_quantile(0.99, rate(app_response_time_seconds_bucket[1m])) > 0.25
for: 1m

This checks if the 99th percentile latency exceeds 250ms for over 1 minute duration before sending alerts.

Multiple criteria can be combined with boolean logic to trigger alerts only during actual fault scenarios.

The alerts can be routed to email, PagerDuty, Slack etc. with further integration with OpsGenie or ServiceNow possible for incident management.

Containerizing Python Apps

In production scenarios, Python apps are typically deployed as containers orchestrated by Kubernetes or Docker Swarm.

The good news is Prometheus works seamlessly in such containerized environments. Using a Prometheus discovery mechanism like file based, DNS or Kubernetes service discovery, targets can be detected automatically.

The standard way is to annotate pods with the following:

prometheus.io/scrape=true
prometheus.io/path=/metrics
prometheus.io/port=8000  

This allows Prometheus to seamlessly pick up targets as apps scale up and down.

Optimizing Prometheus for Scale

For high scale production workloads, Prometheus needs to be configured taking the following into account:

Federation: Sharding Prometheus by function like app vs database vs infrastructure allows distributed scraping suitable even for 10000s instance clusters.

Remote storage: Only last few weeks can be stored locally. For long term archival, remote storage to AWS S3 etc. needs to be configured.

Recording rules: As discussed earlier, pre-aggregating metrics saves heavyduty PromQL querying. Dashboards should be connected to aggregated views rather than raw metrics.

Alert optimization: Too many unnecessary alerts flood the system reducing reliability. Carefully evaluate rules to alarm only during real infrastructure/app issues.

There are a whole host of optimizations worth mastering before running large deployments.

Closing Thoughts

I hope this post served as a good introduction to your journey with instrumenting, monitoring, visualizing and alerting on key Python application metrics using the Prometheus stack.

Mastering these concepts can save your team countless hours by fast troubleshooting of performance issues as well as prevent revenue loss by being able to fix problems immediately.

Do checkout my other in-depth posts on advanced Prometheus metrics analysis and building performing Prometheus deployments later as you grow more proficient.

Similar Posts