Skip to content

Fix: Connection-reset crashes the workflow#2174

Merged
pditommaso merged 5 commits intonextflow-io:masterfrom
Lehmann-Fabian:fix-connection-reset
Jun 18, 2021
Merged

Fix: Connection-reset crashes the workflow#2174
pditommaso merged 5 commits intonextflow-io:masterfrom
Lehmann-Fabian:fix-connection-reset

Conversation

@Lehmann-Fabian
Copy link
Contributor

Working with Kubernetes and Nextflow I sometimes got an exception socket-reset, which cases the whole workflow to fail.

This is the exception trace:

...
Jun-17 12:07:25.401 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'preprocessing:preprocess (LT05_L1TP_183035_19891106_20180210_01_T1)'

Caused by:
  Connection reset

java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:210)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)
    at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395)
    at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379)
    at sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
    at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:185)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1564)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
    at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
    at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy:366)
    at nextflow.k8s.client.K8sClient.makeRequest(K8sClient.groovy)
    at nextflow.k8s.client.K8sClient.get(K8sClient.groovy:379)
    at nextflow.k8s.client.K8sClient.podStatus(K8sClient.groovy:196)
    at nextflow.k8s.client.K8sClient.podState(K8sClient.groovy:228)
    at nextflow.k8s.K8sTaskHandler.getState(K8sTaskHandler.groovy:261)
    at nextflow.k8s.K8sTaskHandler.checkIfCompleted(K8sTaskHandler.groovy:318)
    at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:605)
    at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:530)
    at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:409)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
    at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1268)
    at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1035)
    at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1029)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1012)
    at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:101)
    at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293)
    at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
    at groovy.lang.Closure.run(Closure.java:493)
    at java.lang.Thread.run(Thread.java:748)
Jun-17 12:07:25.412 [Task monitor] DEBUG nextflow.Session - Session aborted -- Cause: Connection reset
Jun-17 12:07:25.435 [Task monitor] DEBUG nextflow.Session - The following nodes are still active:
...

Here is my proposed solution: I wrap the call, and retry the API request, if it fails with a socket exception. This solved it for me.
We can discuss, whether to increase/decrease the sleep time and to catch more than SocketExceptions.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this. Indeed makes sense. I left a couple of comments

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
@Lehmann-Fabian
Copy link
Contributor Author

Hi @pditommaso, thanks for your feedback. I addressed both and changed it accordingly.

Signed-off-by: Lehmann-Fabian <fabian.lehmann@informatik.hu-berlin.de>
Copy link
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Not understanding why tests are failing tho 😕

@Lehmann-Fabian
Copy link
Contributor Author

Me neither, I ran it once again locally and it passed all tests.

@pditommaso pditommaso merged commit 42c1153 into nextflow-io:master Jun 18, 2021
@pditommaso
Copy link
Member

No idea. I've merged manually. Thanks for this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants