Skip to content

[BUG] opensearch crashes on closed client connection before search reply #3557

@dbonf

Description

@dbonf

Describe the bug
In some circumstances, when running a search operation, if the client TCP connection is closed before the search operation is completed, opensearch crashes with:

[2022-06-09T16:54:39,763][ERROR][org.opensearch.bootstrap.OpenSearchUncaughtExceptionHandler] fatal error in thread [opensearch[mycluster-elasticsearch-5][search][T#10]], exiting
java.lang.AssertionError: unexpected higher total ops [18] compared to expected [17]

To Reproduce
Steps to reproduce the behaviour:

We can consistently reproduce this behaviour in several environments.

  1. run the long running search query, e.g. curl -s "http://localhost:9200/myindex*/_search (this is the same with any client/library)
  2. kill the client connection after a couple of seconds, e.g stop the curl command with ctrl+c (same as a connection that is closed by the client after a timeout, before the reply from the OS node)
  3. the opensearch node crashes

Expected behavior
No crashes.

Plugins

  • repository-s3
  • repository-gcs

Host/Environment (please complete the following information):

  • OS: Red Hat Enterprise Linux 8 (in a container, UBI minimal)
  • Version: Opensearch 1.2.4

Additional context
Relevant logs:

Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch fatal fatal error in thread [opensearch[mycluster-elasticsearch-1][search][T#6]], exiting
Jun 9 18:54:18 mycluster-elasticsearch-1 elasticsearch java.lang.AssertionError: unexpected higher total ops [19] compared to expected [18]
	at org.opensearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:461)
	at org.opensearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$2(AbstractSearchAsyncAction.java:303)
	at org.opensearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:338)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at org.opensearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:57)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:792)
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:50)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)
Caused by: Failed to execute phase [query], Shard failures; shardFailures {[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.09][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.08][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.07][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[L_zfhb4RRHm2BRsxkzqkig][myindex-2022.06.06][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.05][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.04][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.03][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.02][0]: TransportException[failure to send]; nested: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]; }{[RtTqNO27TSitOPJX41HW5g][myindex-2022.06.01][0]: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]}
	... 10 more
Caused by: TaskCancelledException[The parent task was cancelled, shouldn't start any child tasks]
	at org.opensearch.tasks.TaskManager$CancellableTaskHolder.registerChildNode(TaskManager.java:534)
	at org.opensearch.tasks.TaskManager.registerChildNode(TaskManager.java:226)

See attachment for complete logs.
log.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingv2.1.0Issues and PRs related to version 2.1.0v3.0.0Issues and PRs related to version 3.0.0

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions