Describe the bug
In some circumstances, when running a search operation, if the client TCP connection is closed before the search operation is completed, opensearch crashes with:
[2022-06-09T16:54:39,763][ERROR][org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-5][search][T#10]], exiting
java.lang.AssertionError: unexpected higher total ops [18] compared to expected [17]
To Reproduce
Steps to reproduce the behaviour:
We can consistently reproduce this behaviour in several environments.
- run the long running search query, e.g.
curl -s "http://localhost:9200/myindex*/_search (this is the same with any client/library)
- kill the client connection after a couple of seconds, e.g stop the curl command with ctrl+c (same as a connection that is closed by the client after a timeout, before the reply from the OS node)
- the opensearch node crashes
Expected behavior
No crashes.
Plugins
- repository-s3
- repository-gcs
Host/Environment (please complete the following information):
- OS: Red Hat Enterprise Linux 8 (in a container, UBI minimal)
- Version: Opensearch 1.2.4
Additional context
Relevant logs:
Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch fatal fatal error in thread [opensearch[mycluster-elasticsearch-1][search][T#6]], exiting
Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch java.lang.AssertionError: unexpected higher total ops [19] compared to expected [18]
at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:461)
at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303)
at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: Failed to execute phase [query], Shard failures; shardFailures {[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.09][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.08][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.07][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.06][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.05][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.04][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.03][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.02][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.01][0]: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]}
... 10 more
Caused by: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)
See attachment for complete logs.
log.txt
Describe the bug
In some circumstances, when running a search operation, if the client TCP connection is closed before the search operation is completed, opensearch crashes with:
To Reproduce
Steps to reproduce the behaviour:
We can consistently reproduce this behaviour in several environments.
curl -s "http://localhost:9200/myindex*/_search(this is the same with any client/library)Expected behavior
No crashes.
Plugins
Host/Environment (please complete the following information):
Additional context
Relevant logs:
See attachment for complete logs.
log.txt