Conversation
|
Thank you for submitting your thoughts here @michaelnarodovitch . This is helpful. I'm curious why there are 100,000 who_has calls. Looking through the Worker implementation it looks like we call Regardless, it seems like we might want to ease back on the 512 connection limit on workers generally, given that they might all be talking to the scheduler simultaneously. |
|
Thanks again for looking into that :) Probably, I was not entirely correct with the explanation. What I see clear evidence for in the logs, are 51,988 stack-traces thrown from distributed/distributed/worker.py Lines 2078 to 2085 in b2f594e The way, I understand it, is that the workers would query the scheduler to get the keys for the re-partition step. Maybe it would be a good idea to refine the Do you see any other types of requests, which might stress the scheduler in a similar way, and may require special care? |
#4080
did more digging with the reproducer and identified the failure mode, which is causing connection timeouts and failed tasks on our 50 worker cluster.
100_000who_hasrequests in a quite short period in time.ConnectionPoolofWorker.scheduler, which has the default connection limit of 512. This will result in ~25,000 concurrent connection attempts and saturate the scheduler listener socket.Logs would typically output:
Tornado connection attempt - message:
Connect start: <SCHEDULER>170,000 events - Worker sideTornado connection success - message:
Connect done16,000 events - Worker sideTornado connection accept - message:
On connection13,000 events - Scheduler sideDask handshake success - message:
handshake:6,000 events - Worker and Scheduler sideWorkers would typically fail tasks or crash due to connection timeouts.
I doubt that this is a good solution and do not intend to do further work on this PR. Nevertheless, I believe that this 'fix' helps to shed light on the root-cause of the issue.