How to Use Elasticsearch Bulk API

Elasticsearch provides a bulk API that allows executing multiple CRUD (create, read, update, delete) operations with a single API request. Using the bulk API can help reduce overhead and improve indexing performance compared to making separate requests for each operation. This comprehensive technical guide will illustrate how to leverage the bulk API to perform multiple Elasticsearch operations efficiently.

Bulk API Overview

The Elasticsearch bulk API allows batching multiple indexing, create, update and delete operations into a single API call. This is more efficient than separate requests for each action.

To use the bulk API, you send an HTTP POST request to the _bulk endpoint with the operations defined in the request body in newline delimited JSON format.

Here is an example bulk request:

POST _bulk
{"index":{"_index":"myindex","_id":1}}
{"title":"My First Document"}
{"update":{"_index":"myindex","_id":1}}
{"doc":{"title":"My First Updated Document"}} 
{"delete":{"_index":"myindex","_id":1}}

This request performs three actions in one call:

Indexes a document
Updates the document
Deletes the document

The bulk API will process each action sequentially and return a response containing the status for each operation.

Bulk Request Body Format

The request body for the Elasticsearch bulk API uses the following newline delimited JSON structure:

<action_and_meta_data>\n 
<optional_source>\n
<action_and_meta_data>\n
<optional_source>\n
...

Looking closer at the anatomy of a bulk request:

Action and Metadata: Each action line must include the action descriptor (index, create, update, delete) and the metadata like index, type and id.
Source: The optional source parameter contains the document source for index and create actions. For update actions, it contains the partial document.

Action and Metadata Parameters

The action and metadata lines include key information like the index, id, and action type:

{
  "index": {
    "_index": "myindex", 
    "_id": 1
  }
}

This indexes a document with an ID of 1 in the myindex index.

The metadata supports the following parameters:

_index – The target index name
_id – The document ID
_type – The type, deprecated in ES 6+

Update Action Source

The update action type allows partial document updates by passing the doc parameter:

{ "update": { "_index": "myindex", "_id": 1 } }
{ "doc" : { "title" : "Updated" } }

This will update just the title field to "Updated" in that document.

You can see the basic gist – define actions with metadata, and documents as the sources. Now let‘s discuss when you should use bulk APIs instead of single requests.

When to Use Bulk vs. Single Requests

The bulk API improves performance by bundling requests into a single call rather than individual API hits. This reduces network round trips and minimizes serialization costs.

Based on benchmarks from Elasticsearch, the bulk API delivers significant gains in indexing speed over single requests:

[INSERT IMAGE]

As you can see, the bulk API achieves up to a 300% increase in indexing performance based on the number of documents indexed. These gains require properly formatted bulk requests within reasonable size limits.

In general, it is best practice to use the bulk API when:

Indexing a batch of new documents
Making lots of creates, updates or deletes
Reindexing large datasets during migration

For simple queries and returning single documents, standard gets are just fine. The bulk API excels at writing, modifying and deleting documents in bulk.

Handling Bulk Request Failures

When executing a bulk request, pay close attention to the response to handle failures properly. The bulk API response will return the status for each operation.

A sample response containing a failed delete:

{
  "took":107,
  "errors": true,
  "items":[
    {
      "index":{
        "_index":"myindex",
        "_id":"1",
        "_version":1,
        "result":"created",
        "_shards":{"total":2,"successful":1,"failed":0},
        "created":true, 
        "status":201  
      }
    },
    {
      "delete":{
        "_index":"myindex",
        "_id":"1",
        "status":404,  
        "error":"document not found"  
      }
    }
  ]
}

Note the 404 status and error for the delete action, indicating that document was not found.

When a failure occurs:

The bulk request will continue executing other actions
The entire request will be marked "errors":true however
You must handle the failed actions appropriately in code

This highlights the need to check for errors before assuming bulk actions succeeded. Often retries or recovery procedures have to be coded additionally.

Idempotent Bulk Requests

Idempotence means that an operation can be performed multiple times without changing the result. This is an important consideration with the bulk API.

To make bulk requests idempotent:

Uniquely identify documents through _id rather than just _index
Check the response codes to avoid assuming success
Design the indexing system to handle duplicate requests

This prevents documents being indexed twice if a bulk request is made multiple times.

Bulk Request Import Tools

In addition to raw bulk requests via clients, Elasticsearch provides additional tools for bulk importing and ingesting data:

Logstash Bulk Import

Logstash is a popular tool for collecting and transforming data before loading into Elasticsearch. To bulk import from Logstash:

Specify stdout { codec => json } in Logstash config
Pipe Logstash output to Elasticsearch _bulk endpoint

This streams the transformed data in newline delimited JSON format directly into Elasticsearch in an efficient manner.

Kafka Bulk Import

Similarly, Kafka connect can be used to stream data from Kafka topics directly into Elasticsearch using the bulk endpoint. This leverages Kafka‘s distributed streaming architecture for scalable ingestion pipelines.

There are also other open source libraries like elastic-import and PyBulk which simplify sending bulk requests from various data stores.

Performance Considerations

When importing data, be aware of the following performance considerations:

Thread Pool

Bulk requests operate on the bulk thread pool. The number of threads can be tuned to scale ingestion:

thread_pool.bulk.size: 16 
thread_pool.bulk.queue_size: 1000

Refresh Setting

Adjust the refresh interval when bulk loading data to make new documents visible:

POST /my_index/_settings
{
  "refresh_interval": "30s"
}

Document Size

Average document size should be kept reasonable to avoid hitting the HTTP request size limit or causing out of memory errors:

Target 10-15MB payload sizes
Avoid documents over 100KB

Properly tuning based on document count, size, and throughput requirements ensures stable bulk indexing performance.

Monitoring Bulk Requests

To monitor running bulk requests, metrics can be checked through the _stats and _tasks APIs:

Bulk Statistics

GET /_stats/bulk

Returns currently running bulk requests and throughput statistics.

Active Tasks API

GET /_tasks?detailed=true&actions=*bulk

This lists active bulk import tasks including status and runtime statistics.

Example dashboard visualizing bulk request status over time:

[INSERT IMAGE]

Additional metrics like thread pool utilization, segment counts, and search latency will indicate how well clusters are handling bulk import workloads.

Paginating Bulk Requests

Very large bulk imports can be broken into pages to avoid overloading the cluster. There are two common approaches:

Scrolling

The scroll API can walk over an entire index, exporting documents into bulk files for re-importing:

POST /_search?scroll=1m
{
  "size": 10000,
  "query": {
    "match_all": {}  
  }
}

This scrolls 10K documents per page, which can be exported and re-imported via the bulk API.

Search After

The search_after parameter paginates queries by passing the last document id:

POST /_search?search_after=996
{
  "size": 1000,
  "query": {
    "match_all": {}
  }
}

This method fetches the next 1K documents after id 996.

Paginating using scrolling or search after works well for breaking up huge bulk imports.

Summary of Bulk API Benefits

Some key benefits to using Elasticsearch‘s bulk API:

Increased throughput – Bulk consolidates requests for better network and CPU efficiency
Faster indexing – Documents index substantially faster than individual create/index calls
Atomicity – Failed operations won‘t impact other actions in the bulk request
Background ingestion – Imports can run in the background without blocking searches

In situations indexing, updating or deleting batches of documents – the bulk API should be leveraged to improve throughput and reduce latency.

Conclusion

The Elasticsearch bulk API provides an efficient method for performing multiple CRUD operations within a single request. Allowing actions to be batched reduces overhead and improves documentation indexing speed compared to individual requests.

This guide covered bulk API syntax, performance comparisons, error handling, tools, best practices and considerations when importing or ingesting data in bulk.

Key takeaways include:

Structure requests using newline delimited JSON
Check responses for failed actions
Tune threads and memory for large imports
Use Logstash/Kafka pipelines for scalable ingestion
Monitor performance using statistical APIs

Refer to the Elasticsearch documentation on the bulk API for additional technical details.

How to Use Elasticsearch Bulk API

Bulk API Overview

Bulk Request Body Format

Action and Metadata Parameters

Update Action Source

When to Use Bulk vs. Single Requests

Handling Bulk Request Failures

Idempotent Bulk Requests

Bulk Request Import Tools

Logstash Bulk Import

Kafka Bulk Import

Performance Considerations

Monitoring Bulk Requests

Bulk Statistics

Active Tasks API

Paginating Bulk Requests

Scrolling

Search After

Summary of Bulk API Benefits

Conclusion

How to Multiply in Java

How to Create a New Microsoft Account

The Complete Guide to Mastering Python on Windows 11

An In-Depth Guide on Checking Virtualization Support on Windows Systems

Demystifying the Raspberry Pi‘s Red and Green LED Indicators: A Professional Coder‘s Guide

A Detailed Guide to NUMA Architecture

Linuxhaxor.net – About Open Source & Linux

Bulk API Overview

Bulk Request Body Format

Action and Metadata Parameters

Update Action Source

When to Use Bulk vs. Single Requests

Handling Bulk Request Failures

Idempotent Bulk Requests

Bulk Request Import Tools

Logstash Bulk Import

Kafka Bulk Import

Performance Considerations

Monitoring Bulk Requests

Bulk Statistics

Active Tasks API

Paginating Bulk Requests

Scrolling

Search After

Summary of Bulk API Benefits

Conclusion

Related posts:

Similar Posts

Linuxhaxor.net – About Open Source & Linux