[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion

## Background

The purpose of this RFC is to discuss the solution to the **long primary promotion time for segment replication** mentioned in https://github.com/opensearch-project/OpenSearch/issues/20118. Primary promotion may take up to `15` minutes or even longer. During this period, write throughput significantly decreases.

## Reproduction

### Primary promotion takes a long time
- The test cluster includes `4 16U64G` data nodes, with segment replication and merged segment warmer enabled. Execute the following command.

```
opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:235::33]:9463 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:4, number_of_replicas: 1" --kill-running-processes
```

- After restarting the data-0 node, a write throughput decline lasting more than `15` minutes can be observed.

<img width="2826" height="1246" alt="Image" src="https://github.com/user-attachments/assets/c27201d9-2b65-434d-9179-fc24140f9364" />


- Through the logs, we can see that a total of `3,911,465` operations were recovered from the translog.

```
[2025-11-22T01:29:43,325][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3911465], current translog generation [1693]
```

## Analysis
### Reasons for the long translog recovery time
[In the segment replication scenario, replicas only record the translog](https://github.com/opensearch-project/OpenSearch/blob/1ee30dcc8334f63c51bed276397fa4d50406a8e6/server/src/main/java/org/opensearch/index/engine/NRTReplicationEngine.java#L242). The default refresh interval for the `nyc_taxis` is `30` seconds. In a production environment, a refresh interval of `30` to `60` seconds is also a common configuration. This means that any document that has not been synchronized through segment replication within the refresh interval will be recovered via the translog during the primary promotion process.

## Solution
### Accelerate translog recovery
1. Add a dedicated thread pool `translog_recovery` for translog recovery.
2. Provides dynamic configuration at the index level to enable concurrent recovery of translog and specify the number of translog operations processed by a single thread.
3. When [performing translog recovery](https://github.com/opensearch-project/OpenSearch/blob/1ee30dcc8334f63c51bed276397fa4d50406a8e6/server/src/main/java/org/opensearch/index/shard/IndexShard.java#L5269), if the total number of translogs exceeds the threshold for single-threaded processing, it will be split into multiple threads for concurrent execution.
4. We can consider first restricting the concurrency mechanism of translog recovery to the segment replication scenario. Meanwhile, some backoff strategies can be considered, such as falling back to the current logic when the `translog_recovery` thread pool is busy.

## Evaluation
### Recovery speed increased by 13 times
Using the same test, enable translog recovery concurrency, with each thread handling 200,000 operations.

- After restarting the data-0 node, the write throughput decline lasted `1min7s`

<img width="3648" height="1454" alt="Image" src="https://github.com/user-attachments/assets/7277f9f9-48c3-47a0-baf9-e8a3a905cab9" />

- Through the logs, we can see that a total of `3,804,928` operations were recovered from the translog.

```
[2025-11-27T14:45:17,623][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3804928], current translog generation [1285]
```

- During this period, the newly introduced translog recovery thread can be observed working.

<img width="3664" height="1492" alt="Image" src="https://github.com/user-attachments/assets/ec8071a6-ba10-44d5-b190-2ba46f97b4a2" />

### Related component

Indexing:Replication

### Describe alternatives you've considered

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion #20131

Background

Reproduction

Primary promotion takes a long time

Analysis

Reasons for the long translog recovery time

Solution

Accelerate translog recovery

Evaluation

Recovery speed increased by 13 times

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion #20131

Description

Background

Reproduction

Primary promotion takes a long time

Analysis

Reasons for the long translog recovery time

Solution

Accelerate translog recovery

Evaluation

Recovery speed increased by 13 times

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions