distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38001
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38001 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:39419
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:39419 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:42233
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:42233 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38215
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38215 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:33333
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:33333 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:36893
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:36893 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:41795
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 245, in write
async def write(self, msg, serializers=None, on_error="message"):
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 452, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 320, in connect
await asyncio.wait_for(comm.write(local_info), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 454, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:41795 after 30 s
distributed.worker - ERROR - Worker stream died during communication: tcp://127.0.0.1:38707
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/tcp.py", line 205, in read
frames_nbytes = await stream.read_bytes(fmt_size)
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 492, in wait_for
fut.result()
asyncio.exceptions.CancelledError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 319, in connect
handshake = await asyncio.wait_for(comm.read(), time_left())
File "/root/miniconda3/lib/python3.9/asyncio/tasks.py", line 494, in wait_for
raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2654, in gather_dep
response = await get_data_from_worker(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3982, in get_data_from_worker
return await retry_operation(_get_data, operation="get_data_from_worker")
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 385, in retry_operation
return await retry(
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils_comm.py", line 370, in retry
return await coro()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 3959, in _get_data
comm = await rpc.connect(worker)
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1048, in connect
raise exc
File "/root/miniconda3/lib/python3.9/site-packages/distributed/core.py", line 1032, in connect
comm = await fut
File "/root/miniconda3/lib/python3.9/site-packages/distributed/comm/core.py", line 324, in connect
raise OSError(
OSError: Timed out during handshake while connecting to tcp://127.0.0.1:38707 after 30 s
distributed.utils - ERROR - ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/distributed/utils.py", line 648, in log_errors
yield
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOLoop object at 0x7f5f45925ac0>>, <Task finished name='Task-8548' coro=<Worker.gather_dep() done, defined at /root/miniconda3/lib/python3.9/site-packages/distributed/worker.py:2588> exception=AssertionError(("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)]))>)
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 741, in _run_callback
ret = callback()
File "/root/miniconda3/lib/python3.9/site-packages/tornado/ioloop.py", line 765, in _discard_future_result
future.result()
File "/root/miniconda3/lib/python3.9/site-packages/distributed/worker.py", line 2745, in gather_dep
assert ts, (d, self.story(d))
AssertionError: ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", [("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'register-replica', 'released', 'compute-task-1635104458.1960478', 1635104468.872155), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'fetch', {}, 'compute-task-1635104458.1960478', 1635104468.8722844), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104468.872853), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104468.8727045', 1635104468.8728676), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104470.384826), ('receive-dep-failed', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104468.8727045', 1635104503.914738), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing-dep'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'fetch', {}, 'ensure-communicating-1635104468.8727045', 1635104503.9147604), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'missing', {}, 'ensure-communicating-1635104503.91477', 1635104503.9147818), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'missing', 'fetch', {}, 'find-missing-1635104504.5391095', 1635104504.5398495), ('gather-dependencies', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'stimulus', 1635104504.5398893), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'fetch', 'flight', {}, 'ensure-communicating-1635104504.5398538', 1635104504.5398993), ('request-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104504.5399654), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {}, 'processing-released-1635104504.5400004', 1635104504.5406826), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.5406876), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'flight', 'cancelled', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'released'}, 'processing-released-1635104504.5400004', 1635104504.540689), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'release-key', 'processing-released-1635104504.5400004'), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'cancelled', 'released', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)": 'forgotten'}, 'processing-released-1635104504.5400004', 1635104504.5407002), ("('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)", 'released', 'forgotten', {}, 'processing-released-1635104504.5400004', 1635104504.5407038), ('lost-during-gather', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538'), ('receive-dep', 'tcp://127.0.0.1:41795', {"('re-quantiles-1-59cbe1d46b9bfe7bf5a9da9c07036ff5', 75)"}, 'ensure-communicating-1635104504.5398538', 1635104548.287821)])
I have tried #8223 on a ~3.4TB gzipped Parquet dataset.
I tried four runs so far, with two different behaviours
to_parquet), but then ran out of disk space with over 4TB being used by the shuffleread_parquet). I did not get over theset_indexstep. Workers seem to die and the computation hangs, always towards the end. Errors below.Based on the first run, this does look like it could've actually been successful if I had more disk space. That's quite exciting, as external sorting has been a big issue for me.
Code used
Sadly I cannot share the data. I am at least sharing the code I'm using.
Errors
Orginal error
Another error during `to_parquet`
Environment
'2021.09.1+26.gfd1b02b6'([Never Merge] Prototype for scalable dataframe shuffle #8223)1.3.13.9.5Ubuntu 18.04.5 LTSconda