-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[RFC] Introduce concurrent translog recovery to accelerate segment replication primary promotion #20131
Description
Background
The purpose of this RFC is to discuss the solution to the long primary promotion time for segment replication mentioned in #20118. Primary promotion may take up to 15 minutes or even longer. During this period, write throughput significantly decreases.
Reproduction
Primary promotion takes a long time
- The test cluster includes
4 16U64Gdata nodes, with segment replication and merged segment warmer enabled. Execute the following command.
opensearch-benchmark execute-test --target-hosts http://[fdbd:dc05:b:235::33]:9463 --workload=nyc_taxis --workload-params="search_clients: 4, bulk_indexing_clients: 32, number_of_shards:4, number_of_replicas: 1" --kill-running-processes
- After restarting the data-0 node, a write throughput decline lasting more than
15minutes can be observed.
- Through the logs, we can see that a total of
3,911,465operations were recovered from the translog.
[2025-11-22T01:29:43,325][TRACE][o.o.i.e.Engine ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3911465], current translog generation [1693]
Analysis
Reasons for the long translog recovery time
In the segment replication scenario, replicas only record the translog. The default refresh interval for the nyc_taxis is 30 seconds. In a production environment, a refresh interval of 30 to 60 seconds is also a common configuration. This means that any document that has not been synchronized through segment replication within the refresh interval will be recovered via the translog during the primary promotion process.
Solution
Accelerate translog recovery
- Add a dedicated thread pool
translog_recoveryfor translog recovery. - Provides dynamic configuration at the index level to enable concurrent recovery of translog and specify the number of translog operations processed by a single thread.
- When performing translog recovery, if the total number of translogs exceeds the threshold for single-threaded processing, it will be split into multiple threads for concurrent execution.
- We can consider first restricting the concurrency mechanism of translog recovery to the segment replication scenario. Meanwhile, some backoff strategies can be considered, such as falling back to the current logic when the
translog_recoverythread pool is busy.
Evaluation
Recovery speed increased by 13 times
Using the same test, enable translog recovery concurrency, with each thread handling 200,000 operations.
- After restarting the data-0 node, the write throughput decline lasted
1min7s
- Through the logs, we can see that a total of
3,804,928operations were recovered from the translog.
[2025-11-27T14:45:17,623][TRACE][o.o.i.e.Engine ] [byte-es-test-native-data-1] [nyc_taxis][1] flushing post recovery from translog: ops recovered [3804928], current translog generation [1285]
- During this period, the newly introduced translog recovery thread can be observed working.
Related component
Indexing:Replication
Describe alternatives you've considered
No response
Additional context
No response