Skip to content

Feature Proposal : Pluggable Translog  #1319

@Bukhtawar

Description

@Bukhtawar

Overview

The purpose of this document is to propose various mechanisms of achieving durability of uncommitted indexing operations and outlines the advantages of making the implementation pluggable. This is not a complete design document, it is a proposal to seek community feedback on.

Motivation

  1. Provide flexible user controlled durability options
  2. Support for highly durable translogs

Background

What is the role of a translog?
All operations that affect an OpenSearch index (i.e. adding, updating, or removing documents) are added to an in-memory buffer, which is periodically flushed to create segments on disk on a Lucene commit. These operations also need to be durably written to a write-ahead transaction log (aka translog) before they are acknowledged back to the client since Lucene commits are too expensive to be performed per request. In the event of a process crash, recent operations that have been acknowledged but not yet included in the last Lucene commit are recovered from the translog.

The following are the various mechanisms through which the engine interacts with the translog—

  1. Ingestion Interactions
    For efficiency, the translog operations are first buffered to an in-memory queue so that not all the indexing threads have to perform the costly fsync operation ** to disk. One of the indexing threads acquire a lock to dequeue and subsequently fsync, the buffered translog operations to disk. Once the fsync operation is complete, for operation consistency, OpenSearch also writes a checkpoint to the disk with the last sequence number that is written to the translog, which helps prevent translog corruption. Once all translog operations for a given request have been durably persisted to disk, the OpenSearch sends an acknowledgment back to the client

    When the translog grows beyond a configurable size, in order to prevent recoveries from taking too long, the OpenSearch engine triggers a flush in the background. OpenSearch flush performs a Lucene commit, moving segments to disk also rolling over and starting a new translog file.

Untitled (2) (1)

  1. Shard Recovery
    In the event of a crash, existing segment files are first recovered from the local disk. Translog files are opened and recent operations that have not been committed to Lucene are recovered by replaying all operations from the translog beyond the sequence number from the last Lucene commit checkpoint. A flush is triggered creating Lucene segments on disk and purging all unreferenced translog files.
    Recovery mechanisms

    1. Local recovery/ Primary failover : In case of recovery from an existing store(has in-sync translog and segments locally on disk) all uncommitted operations are replayed from local translog on disk.
    2. Peer recovery : Translog doesn’t play a role in peer recovery instead peer shard copy(primary-primary or primary-replica) is orchestrated via shard retention leases by replaying indexing operations that are indexed into Lucene and deleted operations from the preserved soft-deletes on the peer primary.
  2. Cross cluster replication
    Cross Cluster Replication in OpenSearch uses a pull-based replication model to pull translog operations into the follower index. The translog in the leader cluster is pulled incrementally from the current sequence number of the follower shard upto the global checkpoint of the shard being replicated from at the leader, once the bootstrap from committed segments are complete.

Downsides of current translog implementation

  1. Less flexible : The engine is tightly coupled with local disk based translogs and forces users to heavily rely on them for durability
  2. Lesser durability options: The current option supported for translog persistence is local disk which makes full-recovery impossible even with faster snapshots in the event of losing a single node in a non-replicated setup.
  3. Slow Ingestion : The present primary-backup replication model requires translogs to be synchronously written to all shard copies in the replication group before acknowledging a write request. There is a possibility that one degraded node can stall ingestion. With remote store and segment based replication model, we do not need to replicate operations to all copies synchronously
  4. Costly : Users achieve high durability by replicating translogs on additional shard copies, even if that copy isn’t needed for read availability

Use Cases

The pluggable translog support is meant to cater to a broad spectrum of users who would want to customise their durability options based on specific needs.

  1. The translogs are essential for durability but not the only way to achieve the same. Some users might choose not to use a translog at all, as they might want to self-manage durability by periodically checkpointing indexing requests and making calls to commit Lucene segments to disk. In the event of a primary node crash, all uncommitted changes could be re-driven from the previous checkpoint which is what Yelp does today. The commits can alternatively be made at the end of a batch of requests controlled by the user.
  2. Users might want to choose a separate durability semantic based on document search-ability(refresh). So all changes since the last refresh which haven’t been made visible to search is also not guaranteed to persist
  3. Users might need high durability guarantees on their data that can be achieved by configuring a remote storage for persisting translogs in addition to keeping segments.
  4. In future we might want to support streaming APIs which would trigger an explicit commit on events like connection close

High Level Proposal

We propose decoupling translog from the engine by abstracting out translog code from the server code base and moving it to a separate module. The engine would provide knobs to choose the translog option, whether or not translogs need to be maintained by the system. The translog module would further support extension points for a remote store which would be implemented as separate extended plugins. The translog module would support the default store as the local store so that the code is fully backward compatible, while providing users an option to customize their own durability by installing the specific durability extension plugins. All IO interactions across different storage interactions would be checksummed to detect or prevent translog corruption.

Translog durability

  1. Local : This is same as the durability option we have today, translogs will be co-located on the local disk with the segments and would behave the same for both segment and document based replication i.e all indexing operation would require translog to be replicated synchronously within the shard replication group. For a non-replicated index, incase of a node failure, shards cannot be recovered on a different node.

  2. Remote : Translogs can be stored on a remote store as provided by the durability extension plugin, in which case we don’t need translogs to be redundantly written on replicas. The indexing operations in case of segment based replication, need not even be replicated to the replica as primary is responsible for both segment creation and translog persistence to a durable store

    Local recovery process will need to pull translogs from the remote store and replay the needed translog operations once segments have been restored.
    For a non-replicated index, incase of a node failure, shards can be recovered on a different node if segments are also present on the remote store, by first restoring the segments and then pulling the translogs and re-playing the missing operations

  3. No translog : Users can choose to call commit periodically to make recent changes durable on disk. In the event of a primary node failure all changes since the previous Lucene commit would be lost and would need to re-driven.

Low-level proposal

translog (2) (2) (1)

Future Work

  1. Support for request checkpointing For users who wouldn’t want to choose translogs for durability, we would need to support additional durability semantics based on checkpoints. We would need to add support for checkpointing requests that might need API changes(backward compatible) to return a monotonically increasing seq no corresponding to the last indexing operation (per shard). The same seqno would be made available via APIs like flush to return seqno corresponding to the last Lucene commit. Indexing requests beyond the last successful commit checkpoint can be safely purged.
  2. Performance benchmarks with remote stores The remote translog durability options could have an indexing performance impact. We would be benchmarking indexing performance with the supported durability extensions

FAQs

What is a Lucene commit?
Lucene commit is a process of flushing all pending changes (added & deleted documents, segment merges, added indexes, etc.) to the index, and syncing all referenced index files. The data on disk is ready to be used for searching. and the index updates will survive an OS or machine crash or power loss.

What are shard retention leases?
Shard retention leases is a mechanism used for peer recoveries, aimed at preventing shard recoveries from having to fallback to expensive file copy operations if shard history is available from a certain sequence number. This lease would be acquired by a recovering replica copy to prevent any operations on the primary beyond this sequence number from being merged away

What is a soft-delete?
Lucene has a functionality to keep deleted documents alive based on time or any other constraint in the index. The soft delete merge policy allows to control when soft deletes are claimed by merges.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepRFCIssues requesting major changesenhancementEnhancement or improvement to existing feature or requestlucenev3.0.0Issues and PRs related to version 3.0.0

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions