BufferError:  - ERROR - Existing exports of data: object cannot be re-sized

I am running a dask-scheduler on one node and my dask-worker is running on another node.. And I submit a task to the dask-scheduler from a third node.

it sometimes throws the below error. I am using python 2.7, tornado 4.5.2, tensorflow 1.3.0

Following is the minimal script that can be used to reproduce the mentioned error which appears more often than not.

```
import os, sys
import subprocess

from dask.distributed import Variable, Client
#import psutil

import time, json, shlex

## the following function/task will be executed on the dask-worker node running on a separate node in the cluster.
def my_task(stop, is_alive):
  proc = None
  proc_started = False
  try:
    while(True):
      if stop.get():
        proc.terminate()
        return
      else:
        if not proc_started:
        
          ### The following train_image_classifier script which does the training on the set of images for classification,
          ### which is running smoothly when executing as a standalone script.
          
          ###starting a child process where the train_image_classifier i.e. training will run.
          proc = subprocess.Popen("python train_image_classifier.py") 
      is_alive.set(proc.poll())
  except:
    #####
    is_alive.set(proc.poll())
  finally:
    is_alive.set(proc.poll())
    return

### the following or this script will basically launch the task to run on the dask-worker through dask-scheduler
### dask-client script which will submit the task on dask-worker and listens the running status of the task and will send stop signal to dask-worker to stop the live task.
if __name__ == '__main__':

  client = Client("198.152.1.2:8786") # creating a dask client

  ### these two distributed variables will be used for two way communication between dask-client and the dask-worker
  stop = Variable("stop_", client = client)
  is_alive = Variable("is_alive_", client = client)

  future = client.submit(my_task)

  ###polling whether the running task is alive or not.
  ###if not then send the stop signal and come out of the execution
  while(True):
    if is_alive.get():
      stop.set(True)
      break;
  print("Execution over.! returning back to the caller..")
```

And here is the error description from the trace. it appears during the execution of the training process. it sometimes appear or sometimes not, letting the training completes.. 

**distributed.utils - ERROR - Existing exports of data: object cannot be re-sized
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
    result[0] = yield make_coro()**
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/variable.py", line 179, in _get
    client=self.client.id)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/core.py", line 464, in send_recv_from_rpc
    result = yield send_recv(comm=comm, op=key, **kwargs)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/usr/lib/python2.7/site-packages/distributed/core.py", line 348, in send_recv
    yield comm.write(msg)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/usr/lib/python2.7/site-packages/distributed/comm/tcp.py", line 218, in write
    future = stream.write(frame)
  File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 406, in write
    self._handle_write()
  **File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 872, in _handle_write
    del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized**

distributed.worker - WARNING -  Compute Failed
Function:  my_task
args:      ({'upper': '1.4', 'trainable_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'checkpoint_path': '/home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt', 'log_every_n_steps': '1', 'dataset_split_name': 'train', 'learning_rate': '0.01', 'train_dir': '/home/mapr/mano/slim_data/flowers/train_dir/train_outs_19', 'clone_on_cpu': 'True', 'batch_size': '32', 'resize_method': '3', 'hue_max_delta': '0.3', 'lower': '0.6', 'trace_every_n_steps': '1', 'script_name': 'train_image_classifier.py', 'checkpoint_exclude_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'dataset_dir': '/home/mapr/mano/slim_data/flowers/slim_data_dir', 'max_number_of_steps': '4', 'model_name': 'inception_v3', 'dataset_name': 'flowers'})
kwargs:    {}
Exception: BufferError('Existing exports of data: object cannot be re-sized',)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/mapr/mano/slim_data/flowers/train_dir/train_outs_19/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 2.6281 (19.799 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = nan (7.406 sec/step)
INFO:tensorflow:global step 3: loss = nan (6.953 sec/step)
INFO:tensorflow:global step 4: loss = nan (6.840 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions