Skip to content

quiet cancelledError for scheduler shutdown#2477

Closed
danpf wants to merge 1 commit intodask:masterfrom
danpf:patch-2
Closed

quiet cancelledError for scheduler shutdown#2477
danpf wants to merge 1 commit intodask:masterfrom
danpf:patch-2

Conversation

@danpf
Copy link
Contributor

@danpf danpf commented Jan 23, 2019

gen.with_timeout unfortunately permanently adds a callback to futures, and when the timeout is over, it doesn't remove that callback.

with python asyncio futures, if you get the result of them when they are cancelled they raise CancelledErrors..... This isn't caught by error_callback in with_timeout, and so it is printed.

we cancel the futures a few lines down.

for reference: (the tornado callback)

    def error_callback(future):
        try:
            future.result()
        except Exception as e:
            if not isinstance(e, quiet_exceptions):
                app_log.error("Exception in Future %r after timeout",
                              future, exc_info=True)

possible fix for #2273

this actually might work
@danpf danpf changed the title quiet timeout for scheduler quiet cancelledError for scheduler shutdown Jan 23, 2019
@mrocklin
Copy link
Member

Thanks for the possible fix @danpf . Does this resolve the problem for you? Lets see if others on that issue have any feedback.

@danpf
Copy link
Contributor Author

danpf commented Jan 23, 2019

yes, I started getting this problem a few days ago, not really sure why.... now i don't see the error anymore.

@lesteve
Copy link
Member

lesteve commented Jan 24, 2019

I used my snippet from #2273 (comment):

# test-distributed.py
import time

from distributed import Client, LocalCluster


def func(x):
    time.sleep(1)


if __name__ == '__main__':
    cluster = LocalCluster()
    client = Client(cluster)

    # I don't think you need to .submit + .gather to trigger the errors below 
    # but this was intended to leave some time for the workers to come online
    future = client.submit(func, 1)
    client.gather(future)
    # cluster.close is called first
    cluster.close()
    client.close()

I ran it 100 times and can did not get CanceledError so I think this is a very nice improvement.

A few things I got, which I am mentioning for completeness (not suggesting they should be fixed in this PR at all):

37 times out of 100:

Exception ignored in: <generator object add_client at 0x7f2c83f72eb8>
RuntimeError: generator ignored GeneratorExit

1 times out of 100:

Future exception was never retrieved
future: <Future finished exception=StreamClosedError('Stream is closed',)>
Traceback (most recent call last):
  File "/home/local/lesteve/miniconda3/lib/python3.6/site-packages/tornado/tcpclient.py", line 112, in on_connect_done
    stream = future.result()
tornado.iostream.StreamClosedError: Stream is closed

yield gen.with_timeout(timedelta(milliseconds=100),
self._handle_scheduler_coroutine)
self._handle_scheduler_coroutine,
CancelledError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks:

  • maybe add the parameter name for the last argument since it is optional in gen.with_timeout
  • maybe use a tuple of length 1 to be closer to the gen.with_timeout signature (using CancelledError works because the argument is only used in a isinstance call)

i.e. something like this:

yield gen.with_timeout(..., quiet_exceptions=(CancelledError,))`

@mrocklin
Copy link
Member

Thank you for testing this @lesteve . Do you have time to submit a test with what you did? It would be useful to make sure that this doesn't sneak back in again.

@mrocklin
Copy link
Member

(no obligation though, of course)

@lesteve
Copy link
Member

lesteve commented Jan 28, 2019

Here is a test, as I tried to explain in the comment it passes by chance on master in about 10% of the cases (at least on my machine ...).

  • in case it is more convenient, I opened Fix spurious CancelledError #2485 with the fix in this PR, the test below and
    the fix for the nitpick I had above
  • You may want to double-check whether I used the loop fixture correctly. I used loop both in LocalCluster and Client but I have to admit I don't understand the subtleties involved here.
  • The test takes about 4s, not sure it should be labelled slow or not
def test_quiet_client_close_when_cluster_is_closed_before_client(loop):
    n_attempts = 5
    # Trying a few times to reduce the flakiness of the test. Without the bug
    # fix in #2477 and with 5 attempts, this test passes by chance in about 10%
    # of the cases.
    for _ in range(n_attempts):
        with captured_logger(logging.getLogger('tornado.application')) as logger:
            cluster = LocalCluster(loop=loop)
            client = Client(cluster, loop=loop)
            cluster.close()
            client.close()

        out = logger.getvalue()
        assert 'CancelledError' not in out

@mrocklin
Copy link
Member

Thanks @danpf and @lesteve . This is in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants