Skip to content

comm/tcp listener: do not pass comm with failed handshake to comm_handler#4240

Merged
jrbourbeau merged 3 commits intodask:masterfrom
jochen-ott-by:tcp-listener-do-not-handle-failed-connections
Nov 20, 2020
Merged

comm/tcp listener: do not pass comm with failed handshake to comm_handler#4240
jrbourbeau merged 3 commits intodask:masterfrom
jochen-ott-by:tcp-listener-do-not-handle-failed-connections

Conversation

@jochen-ott-by
Copy link
Copy Markdown
Contributor

A return was missing here, which would pass on comm to the handler, even if the handshake had errors.

Copy link
Copy Markdown
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jochen-ott-by for spotting this!

Will wait a bit to give other maintainers the possibility to chime in before I merge but I think this is a safe thing

Comment on lines 461 to +462
logger.info("Connection closed before handshake completed")
return
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. BTW this has been added to the inproc handling of the handshake, already. A pity that this diverges

try:
await self.on_connection(comm)
except CommClosedError:
logger.debug("Connection closed before handshake completed")
return

@fjetter
Copy link
Copy Markdown
Member

fjetter commented Nov 12, 2020

Both py3.7/3.8 fail in test_dynamic_workloads_sync which doesn't seem to be a known flaky test

@jochen-ott-by
Copy link
Copy Markdown
Contributor Author

Both py3.7/3.8 fail in test_dynamic_workloads_sync which doesn't seem to be a known flaky test

They fail only on windows. Not sure how to debug this tbh; I don't have access to a Windows machine here and the tests run fine for me on my linux machine for python3.6 and python3.8. I'd need some advice here from someone more familiar with the test setup and the differences between Windows and Linux that may come in here.

@mrocklin
Copy link
Copy Markdown
Member

Restarting Windows CI to see if they fail differently next time

It looks like test_broken_worker_during_computation failed on Travis/Linux

@mrocklin
Copy link
Copy Markdown
Member

Thank you for finding this @jochen-ott-by

@fjetter
Copy link
Copy Markdown
Member

fjetter commented Nov 12, 2020

test_broken_worker_during_computation is a known issue #4173

@jochen-ott-by
Copy link
Copy Markdown
Contributor Author

Restarting Windows CI to see if they fail differently next time

Indeed, they did: the first time, python3.7/python3.8 had errors while python3.6 was fine. The second run was the the exact opposite: python3.6 had errors (not test failure, only 1 error: "ERROR at teardown of test_get_task_stream_save"), python3.7/python3.8 were fine. So we've seen exactly one green build for each of python3.6/3.7/3.8 on windows.

@fjetter
Copy link
Copy Markdown
Member

fjetter commented Nov 18, 2020

I tried restarting the GH actions one more time since I haven't seen the other breaking test anywhere, yet. However, the github security changes now break the builds. The fix is already on master. @jochen-ott-by can you please rebase once more?

Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @jochen-ott-by! Just a couple of small comments


assert not handle_comm_called

writer.close()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also wait until writer has closed

Suggested change
writer.close()
writer.close()
await writer.wait_closed()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it in the second commit, if the wait_closed method exists (it does not for python3.6).

host, port = listener.get_host_port()
# connect without handshake:
reader, writer = await asyncio.open_connection(host=host, port=port)
await asyncio.sleep(0.02)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious what prompted the addition of this asyncio.sleep

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures we actually hit the timeout of the handshake path on the server side. Without this, the test would be useless. I'll add a comment.

@jochen-ott-by jochen-ott-by force-pushed the tcp-listener-do-not-handle-failed-connections branch from ad0e50d to 181ac92 Compare November 19, 2020 06:25
Co-authored-by: Lucas Rademaker <44430780+lr4d@users.noreply.github.com>
Copy link
Copy Markdown
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @jochen-ott-by! And @fjetter @lr4d for reviewing

@jrbourbeau jrbourbeau merged commit 2ba2731 into dask:master Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants