Skip to content

Issues starting workers  #2506

@bpmweel

Description

@bpmweel

I'm trying to start a dask cluster on my local machine. I can use the LocalCluster just fine, but starting the scheduler and client from the command line I run into issues:

$ dask-scheduler
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:  tcp://145.90.225.10:8786
distributed.scheduler - INFO -       bokeh at:                     :8787
distributed.scheduler - INFO - Local Directory: /var/folders/h6/ck24x_854wd94jzwy9r7gvl40000gn/T/scheduler-gjtgzvz2
distributed.scheduler - INFO - -----------------------------------------------
$ dask-worker 145.90.225.10:8786
distributed.nanny - INFO -         Start Nanny at: 'tcp://145.90.225.10:51273'
distributed.worker - INFO -       Start worker at:  tcp://145.90.225.10:51274
distributed.worker - INFO -          Listening to:  tcp://145.90.225.10:51274
distributed.worker - INFO -              nanny at:        145.90.225.10:51273
distributed.worker - INFO -              bokeh at:        145.90.225.10:51275
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/bweel/worker-73osop8n
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
distributed.worker - INFO - Waiting to connect to:   tcp://145.90.225.10:8786
...etc

I have the feeling this has to do with having multiple interfaces and the nanny process. The following combination works on the command line:

$ dask-scheduler --interface en0
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at: tcp://145.100.116.139:8786
distributed.scheduler - INFO -       bokeh at:      145.100.116.139:8787
distributed.scheduler - INFO - Local Directory: /var/folders/h6/ck24x_854wd94jzwy9r7gvl40000gn/T/scheduler-3evaz473
distributed.scheduler - INFO - -----------------------------------------------
distributed.scheduler - INFO - Register tcp://145.100.116.139:62873
distributed.scheduler - INFO - Starting worker compute stream, tcp://145.100.116.139:62873
distributed.core - INFO - Starting established connection
$ dask-worker 145.100.116.139:8786 --interface en0 --no-nanny
distributed.diskutils - INFO - Found stale lock file and directory '/Users/bweel/worker-o4mon_8t', purging
distributed.worker - INFO -       Start worker at: tcp://145.100.116.139:62873
distributed.worker - INFO -          Listening to: tcp://145.100.116.139:62873
distributed.worker - INFO -              bokeh at:      145.100.116.139:62874
distributed.worker - INFO - Waiting to connect to: tcp://145.100.116.139:8786
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          8
distributed.worker - INFO -                Memory:                   17.18 GB
distributed.worker - INFO -       Local Directory: /Users/bweel/worker-br866x7c
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -         Registered to: tcp://145.100.116.139:8786
distributed.worker - INFO - -------------------------------------------------
distributed.core - INFO - Starting established connection

However, now I cannot connect to the scheduler from ipython:

from dask.distributed import Client
client = Client('tcp://145.100.116.139:8786')

---------------------------------------------------------------------------
TimeoutError                              Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, connection_args)
    203                                           future,
--> 204                                           quiet_exceptions=EnvironmentError)
    205         except FatalCommClosedError:

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

TimeoutError: Timeout

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
<ipython-input-7-2c6e48226add> in <module>()
----> 1 client = Client('tcp://145.100.116.139:8786')

/usr/local/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    636             ext(self)
    637
--> 638         self.start(timeout=timeout)
    639
    640         from distributed.recreate_exceptions import ReplayExceptionClient

/usr/local/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
    759             self._started = self._start(**kwargs)
    760         else:
--> 761             sync(self.loop, self._start, **kwargs)
    762
    763     def __await__(self):

/usr/local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, *args, **kwargs)
    275             e.wait(10)
    276     if error[0]:
--> 277         six.reraise(*error[0])
    278     else:
    279         return result[0]

/usr/local/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
    684         if value.__traceback__ is not tb:
    685             raise value.with_traceback(tb)
--> 686         raise value
    687
    688 else:

/usr/local/lib/python3.6/site-packages/distributed/utils.py in f()
    260             if timeout is not None:
    261                 future = gen.with_timeout(timedelta(seconds=timeout), future)
--> 262             result[0] = yield future
    263         except Exception as exc:
    264             error[0] = sys.exc_info()

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    847         self.scheduler_comm = None
    848
--> 849         yield self._ensure_connected(timeout=timeout)
    850
    851         for pc in self._periodic_callbacks.values():

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/client.py in _ensure_connected(self, timeout)
    885         try:
    886             comm = yield connect(self.scheduler.address, timeout=timeout,
--> 887                                  connection_args=self.connection_args)
    888             if timeout is not None:
    889                 yield gen.with_timeout(timedelta(seconds=timeout),

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/usr/local/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/usr/local/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/usr/local/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1061                     if exc_info is not None:
   1062                         try:
-> 1063                             yielded = self.gen.throw(*exc_info)
   1064                         finally:
   1065                             # Break up a reference to itself

/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in connect(addr, timeout, deserialize, connection_args)
    213                 _raise(error)
    214         except gen.TimeoutError:
--> 215             _raise(error)
    216         else:
    217             break

/usr/local/lib/python3.6/site-packages/distributed/comm/core.py in _raise(error)
    193         msg = ("Timed out trying to connect to %r after %s s: %s"
    194                % (addr, timeout, error))
--> 195         raise IOError(msg)
    196
    197     # This starts a thread

OSError: Timed out trying to connect to 'tcp://145.100.116.139:8786' after 10 s: connect() didn't finish in time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions