Skip to content

[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion #20131

@guojialiang92

Description

@guojialiang92

Background

The purpose of this RFC is to discuss the solution to the long primary promotion time for segment replication mentioned in #20118. Primary promotion may take up to 15 minutes or even longer. During this period, write throughput significantly decreases.

Reproduction

Primary promotion takes a long time

  • The test cluster includes 4 16U64G data nodes, with segment replication and merged segment warmer enabled. Execute the following command.
opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:235::33]:9463 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:4, number_of_replicas: 1" --kill-running-processes
  • After restarting the data-0 node, a write throughput decline lasting more than 15 minutes can be observed.
Image
  • Through the logs, we can see that a total of 3,911,465 operations were recovered from the translog.
[2025-11-22T01:29:43,325][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3911465], current translog generation [1693]

Analysis

Reasons for the long translog recovery time

In the segment replication scenario, replicas only record the translog. The default refresh interval for the nyc_taxis is 30 seconds. In a production environment, a refresh interval of 30 to 60 seconds is also a common configuration. This means that any document that has not been synchronized through segment replication within the refresh interval will be recovered via the translog during the primary promotion process.

Solution

Accelerate translog recovery

  1. Add a dedicated thread pool translog_recovery for translog recovery.
  2. Provides dynamic configuration at the index level to enable concurrent recovery of translog and specify the number of translog operations processed by a single thread.
  3. When performing translog recovery, if the total number of translogs exceeds the threshold for single-threaded processing, it will be split into multiple threads for concurrent execution.
  4. We can consider first restricting the concurrency mechanism of translog recovery to the segment replication scenario. Meanwhile, some backoff strategies can be considered, such as falling back to the current logic when the translog_recovery thread pool is busy.

Evaluation

Recovery speed increased by 13 times

Using the same test, enable translog recovery concurrency, with each thread handling 200,000 operations.

  • After restarting the data-0 node, the write throughput decline lasted 1min7s
Image
  • Through the logs, we can see that a total of 3,804,928 operations were recovered from the translog.
[2025-11-27T14:45:17,623][TRACE][o.o.i.e.Engine           ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3804928], current translog generation [1285]
  • During this period, the newly introduced translog recovery thread can be observed working.
Image

Related component

Indexing:Replication

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Indexing:ReplicationIssues and PRs related to core replication framework eg segrepenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions