-
-
Notifications
You must be signed in to change notification settings - Fork 757
Unclear user feedback in case Client-Scheduler connection breaks up #5666
Description
Description
In case of a disconnect between Client and Scheduler, the logging messages are not very helpful for the typical user and do not help debugging the issue but rather cause more confusion. The feedback is sometimes delayed (only triggered after a timeout), causation is not transparent or log messages are missing or on a wrong level.
A) CancelledError confusing
The below code raises a CancelledError after a timeout of 30s. This is typically a cryptic exception most users cannot relate to. If a persist is in the chain of commands, the timeout is not even logged but only a CancelledError is raised such that the disconnect is entirely obfuscated.
Instead of a CancelledError, I would expect a more user friendly exception telling the user about the non-running status of the client which is likely caused by a disconnect with the scheduler.
CancelledError after timeout
import dask.array as da
import distributed
with distributed.LocalCluster(n_workers=1) as cluster:
client = distributed.Client(cluster)
x = da.ones(10, chunks=5)
x.compute()Causes
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
distributed.client - ERROR - Failed to reconnect to scheduler after 30.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
asyncio.exceptions.CancelledError
CancelledError with persist
import dask.array as da
import distributed
with distributed.LocalCluster(n_workers=1) as cluster:
client = distributed.Client(cluster)
x = x.persist()
y = x.sum()
y.compute()B) Client reconnecting to (new) scheduler
In a case a reconnect finishes successfully, this can go entirely unnoticed
import dask.array as da
import distributed
with distributed.LocalCluster(n_workers=1, scheduler_port=5678) as cluster:
client = distributed.Client(cluster)
x = da.ones(10, chunks=5).persist()
# New cluster, same address
cluster = distributed.LocalCluster(n_workers=1, scheduler_port=5678)
assert client.status == "running"
x.sum().compute() # BoomExpected behaviour
- Clear logging mentioning that a scheduler<->Client disconnect happened with a sufficient log level that it is not dropped in common applications (e.g. jupyter lab / default config)
- Clear exception messages indicating what's wrong and where to look
- Both persist and compute calls provide a clear message
Note:
When only using persist, the log only appears after a timeout of 30s (after giving up on the reconnect)
