Description
In order to ensure, only more relevant data is iterated during query execution, we suggest collocating related data into same segment or group of segments. Group of a segment can be determined by a grouping criteria function. The goal is to align segment boundaries with the anticipated query patterns, ensuring that documents frequently queried together resides in the same segments. For eg: For log analytics scenarios, users often queries for anomalies (4xx and 5xx status code logs) over success logs (2xx). By applying a grouping criteria function based on status code anomalies and success logs are segregated into distinct segments (or groups of segments). This will ensure that search queries like “number of faults in the last hour” or “number of errors in the last three hours” will be more efficient, as they will need to only process segments with 4xx or 5xx status codes, which will be a much smaller dataset, improving query performance.
In this approach, we are maintaining context aware segments using a pool of disposable IndexWriters. We are modelling disposable IndexWriters similar to how Lucene model DWPTs inside IndexWriters. At any point of time, OpenSearch will now maintain a map of IndexWriters for each group, along with a common accumulating IndexWriter. Indexing request for a document belonging to a group will be redirected to respective group specific IndexWriter. Any operations like opening a reader for search, getting checkpoints during replication or snapshot, etc is performed on the accumulating IndexWriters. In order to ensure this accumulating IndexWriters remains in sync with group specific child writers, we periodically sync the child level writer with Accumulating writer during refresh via IndexWriter’s addIndexes api call.
Disposable IndexWriters
All write operations will now be handled by a pool of group specific disposable IndexWriters. These disposable IndexWriters will be modelled after Lucene’s DWPTs.
States of disposable IndexWriters
Similar to DWPTs these disposable IndexWriters will have three states:
Active
IndexWriters in this state will handle all write requests coming to InternalEngine. For each group/tenant, there will be at most a single IndexWriter that will be in the active state. OpenSearch maintains a mapping of active IndexWriters, each associated with a specific group. During indexing, the specific IndexWriter selected for indexing a document will depend on the outcome of the document for the grouping criteria function. Should there is no active IndexWriter for a group, a new IndexWriter will be instantiated for this criteria and added to the pool.
Mark for Refresh
During refresh, we transition all group specific active IndexWriters from active pool to an intermediate refresh pending state. At this stage, these IndexWriters will not be accepting any active writes, but will continue to handle any ongoing operation.
Close
At this stage, OpenSearch will sync the content of group specific IndexWriters with an accumulating parent IndexWriter via Lucene’s addIndexes API call. Post the sync, we remove all group specific IndexWriters from Mark for refresh stage and close them.
CompositeIndexWriter
InternalEngine will now delegate all IndexWriter specific operations through a new CompositeIndexWriter class rather than directly interacting with IndexWriter. This wrapper serves as a unified interface for coordinating write operations with group specific IndexWriter and managing read operation through accumulating parent IndexWriter. This wrapper class also takes care of syncing group specific IndexWriter with the accumulating IndexWriter during refresh by implementing the RefreshListener interface.
In addition to managing group-specific IndexWriters, CompositeIndexWriter tracks all updates and deletion applied during each refresh cycle. This state is maintained using a refresh-rotating map structure analogous to LiveVersionMap’s implementation.

Class Diagram
Indexing
During indexing, CompositeIndexWriter will first evaluates the group for this document using a grouping criteria function. The specific IndexWriter selected for indexing a document will depend on the outcome of the document for the grouping criteria function. Should the relevant IndexWriter entry inside map is null, a new IndexWriter will be instantiated for this criteria and added to this map.
Resolve version
InternalEngine currently resolves the current version of the document before indexing it to figure out whether this request is an indexing or update request. InternalEngine does this by first doing a lookup in version map. Incase no version of this document is present in version map, it queries the Lucene via searcher to look for current version of document. Since the version map is maintained till an entire refresh cycle, there is no change in how we will resolve version in the above approach. InternalEngine will do a lookup for the document first in version map followed by querying doc associated with parent IndexWriter.
Locks
OpenSearch currently utilises ReentrantReadWriteLock to ensure underlying IndexWriter does not closed during active Indexing. With context aware segments, we will use an extra lock for each IndexWriterMap inside CompositeIndexWriter.
During each write/update/delete operations, we will take a ReadLock the ReentrantLock associated with this map. This lock gets released when Indexing completes. During refresh, we obtain a write lock on the same ReentrantLock just before rotating the WriterMap. Since write lock will be acquired only when there is no active read lock on this writer, all the writers of a map is closed and synced with parent writer only when their are no active writes happening on these IndexWriters.
Updates and deletes
Current Scenario
OpenSearch currently handles both updates and deletes by performing a soft update on the document, where previous version of the document is marked as soft deleted in the live bitset.
For Updates:
- OpenSearch makes a soft update document call on IndexWriter.
- This softUpdate call indexes new version of this document. It also attaches a DocValueUpdate node in the delete queue of IndexWriter (global and local) containing term (id) of the document that is being deleted. This will remove any previous version of this document term during Lucene flush (OpenSearch refresh).
- When OpenSearch does a refresh, Lucene’s flush is called internally, post which DirectoryReader is refreshed. During Lucene’s prepare Flush call, all updates present in the DeleteQueue are frozen.
- This FrozenBufferedUpdates are applied when DirectoryReader is reopened post Lucene’s flush. To apply these updates, we fetch the current live docs bit set. Then for each updates, we query the doc ids for this term (_id of document being deleted). Then for each of this doc ids, we clear the bitset in the liveDocs bitset iterator.
For Deletes:
- Delete is similar to update except in delete we index a tombstone entry during softUpdate call, along with applying a delete node for this term (document _id) in the delete queue.
Issue for updates with Context Aware Segments
For context aware segments, this works perfectly incase both indexing and updates are on the same IndexWriter. Issue happens incase indexing and updates are on the different IndexWriters, since delete queue is now maintained and applied at individual writers level. For eg: Say there are 3 active IndexWriter, one for each group and a single accumulating parent IndexWriter. Now an update request came in for a document, whose previous version exists in the parent accumulating IndexWriter. Say the group evaluated for this document is 3 and it got routed to IW3 for indexing.

Post refresh, two versions of the document will be existing in the parent accumulating IndexWriter
Proposed Solution
We can eliminate the version synchronization complexity across IndexWriters discussed in above approach by preventing updates to document fields that determines group of the document. This simplifies version synchronisation as latest version always exists in the group specific writer marked for refresh and previous version may or may not exist in parent writer.
We continue to track updates/deletes inside CompositeIndexWriter using a map which keeps track of id/term of the document per group on which update has been performed. During refresh, for each updates, if a previous version of the document is existing in the parent writer, we delete it. Post which we sync the mark for refresh IndexWriter with accumulating IndexWriter. Similarly for each delete, we will remove the term entry from both parent writer as well as mark for refresh group specific writer.
Soft Update
Currently OpenSearch handles both updates and deletes through soft update operations. For updates, it inserts a new document version to IndexWriter and attaches a soft_delete node to delete previous version of the document. Similarly for deletes, it inserts a tombstone entry (ID = -1) to writer and attaches a soft_delete node to remove all versions of the document. Internally, both these operations are translated to a Lucene’s DocValueUpdate operation which inserts a new document and attach a soft delete node which marks all the versions of this term as deleted during refresh.
Now, in order to achieve version synchronisation across IndexWriters, simply performing a soft update with tombstone entries (mirroring delete operations) in IndexWriters containing previous versions would create issues as this also inserts unnecessary tombstone entry for each of this IndexWriters. Since during peer recovery, we replay document as an operation between two sequence number, each of these extra set of tombstone entry will be replayed as DELETE operation on replica causing consistency issues.
We can try hard deleting the previous versions of the documents in the IndexWriter, but this means all the previous version of the document will get lost.
In order to support this soft delete of document version across IndexWriter, we handle version synchronisation across IndexWriters by performing two sets of operations just before refresh:
- For each update operation, we will perform a DocValueUpdate on IndexWriters containing previous version of documents, where we will insert a special Tombstone entry with id=-2 and attach a delete Node for this term for this IndexWriter.
- Post this we will perform a hard delete of document with id = -2.
Since we are inserting the tombstone entry temporarily and this gets cleared before refresh, we will end up attaching only a delete node on IndexWriters for the deleted term containing previous version of the document and avoid inserting any unnecessary tombstone entry. This delete node will be applied to IndexWriter during Lucene’s flush, soft deleting the previous version of this document.
Refresh
CompositeIndexWriter will periodically sync group level IndexWriters with accumulating parent writer during refresh. This synchronisation is achieved through following operations:
- Freeze writes on all the current group specific IndexWriters by marking these writers for refresh.
- Invalidate the map for active IndexWriters so that all the new writes goes to new set of writers. We also invalidate update/delete maps so that all new deletes goes to new IndexWriter.
- Apply any pending deletes/updates on group specific/parent IndexWriter to sync versions across IndexWriter.
- Sync the group level writers with parent IndexWriters via Lucene’s addIndexes API call.
- Take write block on group specific writers and close them.
All updates/deletes will be applied within the same refresh cycle as their corresponding IndexWriter rotation to maintain data integrity. This atomicity requirement prevents any data inconsistencies. This can handled calling an explicit refresh just before commit.
Segment Merge
In order to ensure that group invariant is maintained during segment merges, we will introduce a new merge Context aware/based merge policy. This new merge policy will ensure that segments corresponding to the same group are merged together. We will create a separate codec for context aware segments which will attach a tenant specific SegmentInfo attribute value and a field attribute to the segment. This codec will also ensure that this tenant attribute is persisted in the segment even post segment merges.
Checkpoints
Local, Global and processed checkpoints will continue to be updated in the same fashion as it is happening now. During commit, local checkpoint persisted as commit data will be equal to maximum value of local checkpoint for group level IndexWriter.
Related component
Indexing
Additional context
RFC Link: #18576
Description
In order to ensure, only more relevant data is iterated during query execution, we suggest collocating related data into same segment or group of segments. Group of a segment can be determined by a grouping criteria function. The goal is to align segment boundaries with the anticipated query patterns, ensuring that documents frequently queried together resides in the same segments. For eg: For log analytics scenarios, users often queries for anomalies (4xx and 5xx status code logs) over success logs (2xx). By applying a grouping criteria function based on status code anomalies and success logs are segregated into distinct segments (or groups of segments). This will ensure that search queries like “number of faults in the last hour” or “number of errors in the last three hours” will be more efficient, as they will need to only process segments with 4xx or 5xx status codes, which will be a much smaller dataset, improving query performance.
In this approach, we are maintaining context aware segments using a pool of disposable IndexWriters. We are modelling disposable IndexWriters similar to how Lucene model DWPTs inside IndexWriters. At any point of time, OpenSearch will now maintain a map of IndexWriters for each group, along with a common accumulating IndexWriter. Indexing request for a document belonging to a group will be redirected to respective group specific IndexWriter. Any operations like opening a reader for search, getting checkpoints during replication or snapshot, etc is performed on the accumulating IndexWriters. In order to ensure this accumulating IndexWriters remains in sync with group specific child writers, we periodically sync the child level writer with Accumulating writer during refresh via IndexWriter’s addIndexes api call.
Disposable IndexWriters
All write operations will now be handled by a pool of group specific disposable IndexWriters. These disposable IndexWriters will be modelled after Lucene’s DWPTs.
States of disposable IndexWriters
Similar to DWPTs these disposable IndexWriters will have three states:
Active
IndexWriters in this state will handle all write requests coming to InternalEngine. For each group/tenant, there will be at most a single IndexWriter that will be in the active state. OpenSearch maintains a mapping of active IndexWriters, each associated with a specific group. During indexing, the specific IndexWriter selected for indexing a document will depend on the outcome of the document for the grouping criteria function. Should there is no active IndexWriter for a group, a new IndexWriter will be instantiated for this criteria and added to the pool.
Mark for Refresh
During refresh, we transition all group specific active IndexWriters from active pool to an intermediate refresh pending state. At this stage, these IndexWriters will not be accepting any active writes, but will continue to handle any ongoing operation.
Close
At this stage, OpenSearch will sync the content of group specific IndexWriters with an accumulating parent IndexWriter via Lucene’s addIndexes API call. Post the sync, we remove all group specific IndexWriters from Mark for refresh stage and close them.
CompositeIndexWriter
InternalEngine will now delegate all IndexWriter specific operations through a new CompositeIndexWriter class rather than directly interacting with IndexWriter. This wrapper serves as a unified interface for coordinating write operations with group specific IndexWriter and managing read operation through accumulating parent IndexWriter. This wrapper class also takes care of syncing group specific IndexWriter with the accumulating IndexWriter during refresh by implementing the RefreshListener interface.
In addition to managing group-specific IndexWriters, CompositeIndexWriter tracks all updates and deletion applied during each refresh cycle. This state is maintained using a refresh-rotating map structure analogous to LiveVersionMap’s implementation.
Class Diagram
Indexing
During indexing, CompositeIndexWriter will first evaluates the group for this document using a grouping criteria function. The specific IndexWriter selected for indexing a document will depend on the outcome of the document for the grouping criteria function. Should the relevant IndexWriter entry inside map is null, a new IndexWriter will be instantiated for this criteria and added to this map.
Resolve version
InternalEngine currently resolves the current version of the document before indexing it to figure out whether this request is an indexing or update request. InternalEngine does this by first doing a lookup in version map. Incase no version of this document is present in version map, it queries the Lucene via searcher to look for current version of document. Since the version map is maintained till an entire refresh cycle, there is no change in how we will resolve version in the above approach. InternalEngine will do a lookup for the document first in version map followed by querying doc associated with parent IndexWriter.
Locks
OpenSearch currently utilises ReentrantReadWriteLock to ensure underlying IndexWriter does not closed during active Indexing. With context aware segments, we will use an extra lock for each IndexWriterMap inside CompositeIndexWriter.
During each write/update/delete operations, we will take a ReadLock the ReentrantLock associated with this map. This lock gets released when Indexing completes. During refresh, we obtain a write lock on the same ReentrantLock just before rotating the WriterMap. Since write lock will be acquired only when there is no active read lock on this writer, all the writers of a map is closed and synced with parent writer only when their are no active writes happening on these IndexWriters.
Updates and deletes
Current Scenario
OpenSearch currently handles both updates and deletes by performing a soft update on the document, where previous version of the document is marked as soft deleted in the live bitset.
For Updates:
For Deletes:
Issue for updates with Context Aware Segments
For context aware segments, this works perfectly incase both indexing and updates are on the same IndexWriter. Issue happens incase indexing and updates are on the different IndexWriters, since delete queue is now maintained and applied at individual writers level. For eg: Say there are 3 active IndexWriter, one for each group and a single accumulating parent IndexWriter. Now an update request came in for a document, whose previous version exists in the parent accumulating IndexWriter. Say the group evaluated for this document is 3 and it got routed to IW3 for indexing.

Post refresh, two versions of the document will be existing in the parent accumulating IndexWriter
Proposed Solution
We can eliminate the version synchronization complexity across IndexWriters discussed in above approach by preventing updates to document fields that determines group of the document. This simplifies version synchronisation as latest version always exists in the group specific writer marked for refresh and previous version may or may not exist in parent writer.
We continue to track updates/deletes inside CompositeIndexWriter using a map which keeps track of id/term of the document per group on which update has been performed. During refresh, for each updates, if a previous version of the document is existing in the parent writer, we delete it. Post which we sync the mark for refresh IndexWriter with accumulating IndexWriter. Similarly for each delete, we will remove the term entry from both parent writer as well as mark for refresh group specific writer.
Soft Update
Currently OpenSearch handles both updates and deletes through soft update operations. For updates, it inserts a new document version to IndexWriter and attaches a soft_delete node to delete previous version of the document. Similarly for deletes, it inserts a tombstone entry (ID = -1) to writer and attaches a soft_delete node to remove all versions of the document. Internally, both these operations are translated to a Lucene’s DocValueUpdate operation which inserts a new document and attach a soft delete node which marks all the versions of this term as deleted during refresh.
Now, in order to achieve version synchronisation across IndexWriters, simply performing a soft update with tombstone entries (mirroring delete operations) in IndexWriters containing previous versions would create issues as this also inserts unnecessary tombstone entry for each of this IndexWriters. Since during peer recovery, we replay document as an operation between two sequence number, each of these extra set of tombstone entry will be replayed as DELETE operation on replica causing consistency issues.
We can try hard deleting the previous versions of the documents in the IndexWriter, but this means all the previous version of the document will get lost.
In order to support this soft delete of document version across IndexWriter, we handle version synchronisation across IndexWriters by performing two sets of operations just before refresh:
Since we are inserting the tombstone entry temporarily and this gets cleared before refresh, we will end up attaching only a delete node on IndexWriters for the deleted term containing previous version of the document and avoid inserting any unnecessary tombstone entry. This delete node will be applied to IndexWriter during Lucene’s flush, soft deleting the previous version of this document.
Refresh
CompositeIndexWriter will periodically sync group level IndexWriters with accumulating parent writer during refresh. This synchronisation is achieved through following operations:
All updates/deletes will be applied within the same refresh cycle as their corresponding IndexWriter rotation to maintain data integrity. This atomicity requirement prevents any data inconsistencies. This can handled calling an explicit refresh just before commit.
Segment Merge
In order to ensure that group invariant is maintained during segment merges, we will introduce a new merge Context aware/based merge policy. This new merge policy will ensure that segments corresponding to the same group are merged together. We will create a separate codec for context aware segments which will attach a tenant specific SegmentInfo attribute value and a field attribute to the segment. This codec will also ensure that this tenant attribute is persisted in the segment even post segment merges.
Checkpoints
Local, Global and processed checkpoints will continue to be updated in the same fashion as it is happening now. During commit, local checkpoint persisted as commit data will be equal to maximum value of local checkpoint for group level IndexWriter.
Related component
Indexing
Additional context
RFC Link: #18576