Fix excess threading on missing connections#2403
Conversation
|
Interesting. First, thanks for taking the initiative and submitting this. I appreciate it. I'm a little curious about that while loop. Under what conditions are we iterating many times? What is the EnvironmentError that you're getting that allows the loop to continue? I am also a little surprised that calling |
I'm not entirely sure, but if the ip exists, but there's no dask scheduler there it basically keeps checking until timeout error just in case it's busy responding to something else.
unfortunately I have no idea. I have never used tornado before, but it looks like something to do with the way it does: |
|
OK, I printed out the error in the Given the sleep that's there it looks like we're trying this every 10ms. I'm not surprised that this is somewhat challenging on the system, but I am surprised that it is manifesting as too many threads. However, this also means that once that connection fails, we will want to start a new one, so I'm not sure that we can pull the Things still work for you because at some point this function stops entirely, and something above it tries it again. |
|
@mrocklin The solution (i think) is to somehow hook distributed.comm.tcp.BaseTCPConnector.connect() up with an arg that gives the TCPClient init call your own (global, or worker/scheduler's )threadpoolexecutor that respects our desired number of threads. it will need its own |
|
Hrm, that's really interesting. Thank you for spending the effort to track that down. It looks like that code should create and reuse a single ThreadPoolExecutor (it seems to be attaching it to the IOLoop for future reuse). I'm surprised that it seems to be leaking threads. One thing you could do here if you were interested is to verify that it continues to create executors, and if so why, and then raise that upstream to core CPython (if indeed you find that it is a bug). My guess is that it will be difficult to pass a ThreadPoolExecutor down from Dask, through Tornado, to Asyncio (indeed it looks as though Tornado is hard-coding the You might also want to try Tornado master to see if they have resolved the problem upstream. |
|
This new implementation __ might __ be a possible option. Some tests still fail though, so it is probably filling up the thread queue so much that it isn't able to get through all of them. |
|
Those errors might also be intermittent (there are unfortunately a few of these). I don't know enough about tornado internals to be able to judge this PR effectively. Do you have any thoughts on how hard it would be to add a test for this behavior? Presumably it would call connect on an address that needed resolving, wait for a while, and then check that there were not significantly more threads than when we started. |
|
previously on an 8 core machine: now the test passes. from the tornado mailing list:
so in the future if this becomes a more complicated problem, it might be necessary to set the loop default executor to a specified number of threads, but for now allocating 2 threads to comms seems fine... I think 1 would also be fine, but just to be cautious.... |
mrocklin
left a comment
There was a problem hiding this comment.
In principle this seems good to me. Thank you for constructing the test. It's very clear.
I've left a couple of small comments about the test, but they're relatively minor so we should be able to merge this soon.
distributed/comm/tests/test_comms.py
Outdated
| future_connect = gen.convert_yielded(connect("tcp://localhost:28400", 0.6)) | ||
| yield future_connect | ||
| max_thread_count = yield sleep_future | ||
| assert max_thread_count == 4 |
There was a problem hiding this comment.
I recommend testing that no more than a few threads were created since starting this test. Otherwise this test will fail if some other test leaks a thread (which does happen with a few of our tests).
distributed/comm/tests/test_comms.py
Outdated
| return max_thread_count | ||
|
|
||
| # tcp.TCPConnector() | ||
| sleep_future = gen.convert_yielded(sleep_for_500ms()) |
There was a problem hiding this comment.
I think that you can drop the gen.convert_yielded here. Tornado semantics are a bit different from normal async-def semantics. gen.coroutine objects start running immediately once called.
This is also true below with the future_connect lines. I would merge them into a single yield connect(...) statement.
|
This looks good to me. Thank you for coming back to this @danpf . Two comments.
|
|
@mrocklin |
|
The Python 2 failure is interesting. I don't have a good idea of what is going on there unfortunately. Do you have any idea about what's going on? |
|
I added one more thread to the tcpclient, and it fixed it (on my machine) However, I have no idea why. It worries me that this is an incorrect fix that might have broader repercussions, but I'm not sure of alternatives. on the other side, I've been using python3.7 asynchronous clients (with this patch (2 threads)) for a few weeks now with no problems... |
I'm glad to hear it.
I'm not particularly concerned about issues with Python 2. There are other known issues that we just don't fix because Python 2 doesn't have as high a priority. If the CI passes then I'm fine with it. |
|
So if I bump the thread count up to 100 then things run smoothly, so maybe there is some issue deep within an old version of Tornado. My suggestion at this point would be to only assign a client to Eventually when we drop Python 2 we can clear this out. |
|
thank you for fixing my silly extra inclusion! |
|
Merging. Thanks for identifying and resolving this issue. I hope that you run into many more so that you can fix them ;) Also, I notice that this is your first code contribution to this repository. Welcome! |

When you start a worker, and there is no scheduler at the location you expect it to be, it continuously generates threads until it hits some limit (~300 for me) and then fails.
The problem is this
while Trueloop.we can solve this by making the future outside the loop, and then tracking that one instead of all of the ones from each
whileloop.I - think - this is an okay way to do this, but maybe someone that knows tornado more can say otherwise.
possible fix for #2398