Skip to content

Adjust dynamic timeout for get_segment_files operation to prevent request timeouts #4392

@dreamer-89

Description

@dreamer-89

GetSegmentFiles transport request times out during requests with the current timeout of 1 minute from the recovery setting - indices.recovery.internal_action_retry_timeout.

To come up with a better timeout option, we can set it dynamically according to the total file segment size (from FileStoreMetadata) and the cluster's network bandwidth.

Without having access to knowledge of the cluster's network bandwidth, we can experiment to set a value of timeout that takes into account segment files' size.

Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [seed][10.9.0.166:9300][internal:index/shard/replication/get_segment_files] request_id [552738] timed out after [599988ms]

Failure stack trace from benchmarking

2022-09-02T09:34:08,220][ERROR][o.o.i.r.SegmentReplicationTargetService] [data-e20223d0] replication failure
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:293) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.2.0.jar:2.2.0]        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.2.0.jar:2.2.0]        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.2.0.jar:2.2.0]        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1379) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1270) [opensearch-2.2.0.jar:2.2.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.2.0.jar:2.2.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.ReceiveTimeoutTransportException: [seed][10.9.0.166:9300][internal:index/shard/replication/get_segment_files] request_id [552738] timed out after [599988ms]
        at org.opensearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1273) ~[opensearch-2.2.0.jar:2.2.0]
        ... 4 more

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions