How to Use Elasticsearch in Python: A Complete Guide

Elasticsearch is one of the most popular open-source search and analytics engines used by developers. Its scalability and rich feature set make Elasticsearch a great fit for full-text search, log analytics, infrastructure monitoring, application performance management, and more.

This comprehensive technical guide will cover how to integrate Elasticsearch into Python applications and leverage it effectively in a production environment.

Overview of Elasticsearch

Here‘s a quick overview of Elasticsearch‘s capabilities:

Distributed document store where each "document" is a JSON object
Schemaless design, handles structured, unstructured and time-series data
Powerful search APIs with support for full-text, geo, filtering, faceting and more
Near real-time search and analytics with sub-second latency
Horizontally scalable to hundreds of servers and petabytes of data
Strong consistency and high availability with no single point of failure

Under the hood, Elasticsearch builds on Apache Lucene and provides distributed capabilities out of the box with no setup needed.

The following architecture diagram gives an overview of Elasticsearch:

            Client
              ^ 
              |
        Request/Response
              |
              v
 ---------------------------- 
|          Node           |
|------------------------|
|Document | Index  | Cache|
|------------------------|
|  Lucene Index          |   
 ----------------------------
              ^
             Shards 
              | 
       ---------------------
      |       Node         | 
       |-------------------|
       |Document | Index    |
       |-------------------|

Now let‘s see how we can leverage all this power from Python.

Installing Elasticsearch

I prefer to run Elasticsearch in Docker for development and testing:

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" elasticsearch:7.9.2

This spins up the latest Elasticsearch in a Docker container, exposes the HTTP (9200) and cluster (9300) ports. I‘ve configured it in single-node mode for simplicity.

For production, you should deploy a multi-node Elasticsearch cluster with proper security, access control and hardware sizing based on your data volume and query patterns.

Installing the Python Client

The elasticsearch-py module allows connecting to Elasticsearch from Python code:

pip install elasticsearch

It is well-maintained by the Elasticsearch team and supports all the latest features.

Connecting to Elasticsearch

Connecting to our Elasticsearch container from Python is straightforward:

from elasticsearch import Elasticsearch
es = Elasticsearch([{‘host‘: ‘localhost‘, ‘port‘: 9200}])

We can verify connectivity:

if es.ping():
    print(‘Connected‘)
else:
    print(‘Connection failed‘)

And check the Elasticsearch version details:

print(es.info())

With Docker or single-node, we have a single Elasticsearch endpoint. In production with scale, you would connect to the load balancer or connecting nodes host list.

Now let‘s explore the developer experience…

Indexing Data

To index data for search, we can use the index() API and point it at the target index:

doc = {
    ‘author‘: ‘Mary Doe‘,
    ‘text‘: ‘Sample document text‘,  
    ‘timestamp‘: datetime.now()
}
res = es.index(index="my_index", body=doc)
print(res[‘result‘])

Indexing performance tips:

Keep index documents under 1MB
Avoid indexing duplicates which increase index size without improving search
Index similar document types under the same index for efficiency

We can customize field mappings and data types in the Elasticsearch index based on how we want to model and query the data.

Searching Data

The simplest search syntax is the query string query over all fields:

  
res = es.search(index="my_index", body={"query": {"query_string": {"query": "sample text"}}})
print("Got %d hits:" % res[‘hits‘][‘total‘][‘value‘]) 
for hit in res[‘hits‘][‘hits‘]:
    print(hit["_source"])

This searches my_index for "sample text" and returns matching hits.

More advanced features like operators, fuzziness, term boosting can be added to refine search relevance.

Here is an example search request body demonstrating this:

{
  "query": { 
    "bool": {
      "must": [
        { "match": { "text":   "full text search"}}
      ],
      "filter": [ 
        { "range": { "date": { "gte": "2020-01-01" }}},
        { "term":  { "user": "kimchy"}}
      ]
    }
  }
}

This does a full text search for "full text search", and filters by date range and exact term match.

Search Best Practices

Analyze index usage and optimize search queries
Add correct data mappings to optimize text search
Finetune ranking by promoting key content signals
Store statistical data to drive better relevance signals

Properly tuned search delivers higher quality results and minimizes latency.

Analytics and Aggregations

Elasticsearch supports aggregations for analytics use cases:

{
  "size": 0,
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field": "timestamp",
        "interval": "day"
      }
    }
  }
}

This calculates daily aggregates over the timestamp field without returning documents. The response contains per-day counts, useful for metrics and visualizations.

Aggregations open up many analytics possibilities:

Report on trends – usage over time, sales by product etc
Summarize data – top keywords, average rating, 95th percentile order value
Generate charts and graphs for rich dashboards

For complex reporting, I recommend using Kibana which provides great out-of-box visualization over Elasticsearch data.

Monitoring Elasticsearch

Having visibility into Elasticsearch performance and health is critical in production.

Some useful stats APIs:

es.cat.indices() # List indices
es.cat.nodes() # View server nodes
es.cat.health() # Get cluster health  
es.indices.stats() # Index metrics
es.nodes.stats() # Node usage metrics

These provide metrics on data volume, document counts, CPU/Memory usage, node availability and more.

Setting up index usage reporting and cluster alerts is essential:

Track index growth over time
Set thresholds for node capacity usage e.g. 50%
Get alerts for cluster yellow/red events

This allows allocating sufficient capacity and investigating issues early.

Commercial Elasticsearch offerings like the Elastic Stack provide rich capabilities here including anomaly detection, predictive capacity planning etc.

Python Client Best Practices

When using the Python client in applications, keep these tips in mind:

Handle connection errors and timeouts gracefully
Set timeouts on requests to prevent stuck threads
Use connection pooling for frequent requests
Compress requests and responses to minimize network overhead
Consider a queue like Kafka for smoother ingestion spikes

Here is sample code demonstrating some patterns:

from elasticsearch import Elasticsearch
import backoff
es = Elasticsearch(
hosts=[{"host": "localhost", "port": 9200}],
timeout=30, max_retries=10, retry_on_timeout=True) 
@backoff.on_exception(backoff.expo, ConnectionError)
def safe_indexing(doc):
es.index(index="mylogs", body=doc)
for log in incoming_logs(): 
threading.Thread(target=safe_indexing, args=(log,)).start()

This wraps the indexing API in error handling and sends requests concurrently to maximize throughput.

Conclusion

Elasticsearch is feature-packed out of the box and Python makes it even more accessible for developers. With some care around system design, Elasticsearch can power everything from simple document search to large scale analytics applications.

I hope this guide gave you a good foundation on integrating Elasticsearch into Python apps and operating it reliably in production. Reach out in the comments with any other questions!

How to Use Elasticsearch in Python: A Complete Guide

Overview of Elasticsearch

Installing Elasticsearch

Installing the Python Client

Connecting to Elasticsearch

Indexing Data

Searching Data

Search Best Practices

Analytics and Aggregations

Monitoring Elasticsearch

Python Client Best Practices

Conclusion

How to Install and Use Wine 8 on Linux Mint 21: An Expert‘s Guide

Unlocking the Power of Hugging Face Datasets in Python: An In-Depth Practical Guide

Kubernetes HostPath Volumes: A Comprehensive Technical Deep Dive

How to Encrypt a Drive in Ubuntu 22.04

A Complete 3000+ Word Guide to Fixing Ubuntu‘s "E: Unable to Locate Package" Errors

Unlocking the Power of JSON Schema Validation in MongoDB with $jsonSchema

Linuxhaxor.net – About Open Source & Linux

Overview of Elasticsearch

Installing Elasticsearch

Installing the Python Client

Connecting to Elasticsearch

Indexing Data

Searching Data

Search Best Practices

Analytics and Aggregations

Monitoring Elasticsearch

Python Client Best Practices

Conclusion

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux