quiet cancelledError for scheduler shutdown#2477
Conversation
this actually might work
|
Thanks for the possible fix @danpf . Does this resolve the problem for you? Lets see if others on that issue have any feedback. |
|
yes, I started getting this problem a few days ago, not really sure why.... now i don't see the error anymore. |
|
I used my snippet from #2273 (comment): # test-distributed.py
import time
from distributed import Client, LocalCluster
def func(x):
time.sleep(1)
if __name__ == '__main__':
cluster = LocalCluster()
client = Client(cluster)
# I don't think you need to .submit + .gather to trigger the errors below
# but this was intended to leave some time for the workers to come online
future = client.submit(func, 1)
client.gather(future)
# cluster.close is called first
cluster.close()
client.close()I ran it 100 times and can did not get CanceledError so I think this is a very nice improvement. A few things I got, which I am mentioning for completeness (not suggesting they should be fixed in this PR at all): 37 times out of 100: 1 times out of 100: |
| yield gen.with_timeout(timedelta(milliseconds=100), | ||
| self._handle_scheduler_coroutine) | ||
| self._handle_scheduler_coroutine, | ||
| CancelledError) |
There was a problem hiding this comment.
Nitpicks:
- maybe add the parameter name for the last argument since it is optional in
gen.with_timeout - maybe use a tuple of length 1 to be closer to the
gen.with_timeoutsignature (usingCancelledErrorworks because the argument is only used in aisinstancecall)
i.e. something like this:
yield gen.with_timeout(..., quiet_exceptions=(CancelledError,))`|
Thank you for testing this @lesteve . Do you have time to submit a test with what you did? It would be useful to make sure that this doesn't sneak back in again. |
|
(no obligation though, of course) |
|
Here is a test, as I tried to explain in the comment it passes by chance on master in about 10% of the cases (at least on my machine ...).
def test_quiet_client_close_when_cluster_is_closed_before_client(loop):
n_attempts = 5
# Trying a few times to reduce the flakiness of the test. Without the bug
# fix in #2477 and with 5 attempts, this test passes by chance in about 10%
# of the cases.
for _ in range(n_attempts):
with captured_logger(logging.getLogger('tornado.application')) as logger:
cluster = LocalCluster(loop=loop)
client = Client(cluster, loop=loop)
cluster.close()
client.close()
out = logger.getvalue()
assert 'CancelledError' not in out |
gen.with_timeoutunfortunately permanently adds a callback to futures, and when the timeout is over, it doesn't remove that callback.with python asyncio futures, if you get the result of them when they are cancelled they raise
CancelledErrors..... This isn't caught byerror_callbackinwith_timeout, and so it is printed.we cancel the futures a few lines down.
for reference: (the tornado callback)
possible fix for #2273