Skip to content

Unclear user feedback in case Client-Scheduler connection breaks up #5666

@fjetter

Description

@fjetter

Description

In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.

A) CancelledError confusing

The below code raises a CancelledError after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If a persist is in the chain of commands, the timeout is not even logged but only a CancelledError is raised such that the disconnect is entirely obfuscated.

Instead of a CancelledError, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.

CancelledError after timeout
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
x = da.ones(10, chunks=5)
x.compute()

Causes

distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
CancelledError with persist
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
    x = x.persist()
y = x.sum()
y.compute()

B) Client reconnecting to (new) scheduler

In a case a reconnect finishes successfully, this can go entirely unnoticed

import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1, scheduler_port=5678) as cluster:
    client = distributed.Client(cluster)
    x = da.ones(10, chunks=5).persist()

# New cluster, same address
cluster = distributed.LocalCluster(n_workers=1, scheduler_port=5678)
assert client.status == "running"

x.sum().compute()  # Boom

image

Expected behaviour

  • Clear logging mentioning that a scheduler<->Client disconnect happened with a sufficient log level that it is not dropped in common applications (e.g. jupyter lab / default config)
  • Clear exception messages indicating what's wrong and where to look
  • Both persist and compute calls provide a clear message

Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is brokendiagnosticsgood second issueClearly described, educational, but less trivial than "good first issue".

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions