Unclear user feedback in case Client-Scheduler connection breaks up

# Description

In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.


A) CancelledError confusing

The below code raises a `CancelledError` after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If a `persist` is in the chain of commands, the timeout is not even logged but only a `CancelledError` is raised such that the disconnect is entirely obfuscated.

Instead of a `CancelledError`, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.

<details>
<summary>CancelledError after timeout</summary>


```python
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
x = da.ones(10, chunks=5)
x.compute()
```

Causes

```
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
```

</details>

<details>
<summary>CancelledError with persist</summary>


```python
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1) as cluster:
    client = distributed.Client(cluster)
    x = x.persist()
y = x.sum()
y.compute()
```

</details>


B) Client reconnecting to (new) scheduler

In a case a reconnect finishes successfully, this can go entirely unnoticed

```python
import dask.array as da
import distributed

with distributed.LocalCluster(n_workers=1, scheduler_port=5678) as cluster:
    client = distributed.Client(cluster)
    x = da.ones(10, chunks=5).persist()

# New cluster, same address
cluster = distributed.LocalCluster(n_workers=1, scheduler_port=5678)
assert client.status == "running"

x.sum().compute()  # Boom
```

![image](https://user-images.githubusercontent.com/8629629/149959547-464a5c36-6eaf-47c3-9cd6-89f753f2f3be.png)


# Expected behaviour

* Clear logging mentioning that a scheduler<->Client disconnect happened with a sufficient log level that it is not dropped in common applications (e.g. jupyter lab / default config)
* Clear exception messages indicating what's wrong and where to look
* Both persist and compute calls provide a clear message

Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Description

Expected behaviour

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unclear user feedback in case Client-Scheduler connection breaks up #5666

Description

Description

Expected behaviour

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions