Skip to content

BufferError: - ERROR - Existing exports of data: object cannot be re-sized #1704

@TheCodeCache

Description

@TheCodeCache

I am running a dask-scheduler on one node and my dask-worker is running on another node.. And I submit a task to the dask-scheduler from a third node.

it sometimes throws the below error. I am using python 2.7, tornado 4.5.2, tensorflow 1.3.0

Following is the minimal script that can be used to reproduce the mentioned error which appears more often than not.

import os, sys
import subprocess

from dask.distributed import Variable, Client
#import psutil

import time, json, shlex

## the following function/task will be executed on the dask-worker node running on a separate node in the cluster.
def my_task(stop, is_alive):
  proc = None
  proc_started = False
  try:
    while(True):
      if stop.get():
        proc.terminate()
        return
      else:
        if not proc_started:
        
          ### The following train_image_classifier script which does the training on the set of images for classification,
          ### which is running smoothly when executing as a standalone script.
          
          ###starting a child process where the train_image_classifier i.e. training will run.
          proc = subprocess.Popen("python train_image_classifier.py") 
      is_alive.set(proc.poll())
  except:
    #####
    is_alive.set(proc.poll())
  finally:
    is_alive.set(proc.poll())
    return

### the following or this script will basically launch the task to run on the dask-worker through dask-scheduler
### dask-client script which will submit the task on dask-worker and listens the running status of the task and will send stop signal to dask-worker to stop the live task.
if __name__ == '__main__':

  client = Client("198.152.1.2:8786") # creating a dask client

  ### these two distributed variables will be used for two way communication between dask-client and the dask-worker
  stop = Variable("stop_", client = client)
  is_alive = Variable("is_alive_", client = client)

  future = client.submit(my_task)

  ###polling whether the running task is alive or not.
  ###if not then send the stop signal and come out of the execution
  while(True):
    if is_alive.get():
      stop.set(True)
      break;
  print("Execution over.! returning back to the caller..")

And here is the error description from the trace. it appears during the execution of the training process. it sometimes appear or sometimes not, letting the training completes..

distributed.utils - ERROR - Existing exports of data: object cannot be re-sized
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/distributed/utils.py", line 238, in f
result[0] = yield make_coro()

File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/variable.py", line 179, in _get
client=self.client.id)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 464, in send_recv_from_rpc
result = yield send_recv(comm=comm, op=key, **kwargs)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "/usr/lib/python2.7/site-packages/distributed/core.py", line 348, in send_recv
yield comm.write(msg)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "/usr/lib64/python2.7/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "/usr/lib64/python2.7/site-packages/tornado/gen.py", line 1069, in run
yielded = self.gen.send(value)
File "/usr/lib/python2.7/site-packages/distributed/comm/tcp.py", line 218, in write
future = stream.write(frame)
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 406, in write
self._handle_write()
File "/usr/lib64/python2.7/site-packages/tornado/iostream.py", line 872, in _handle_write
del self._write_buffer[:self._write_buffer_pos]
BufferError: Existing exports of data: object cannot be re-sized

distributed.worker - WARNING - Compute Failed
Function: my_task
args: ({'upper': '1.4', 'trainable_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'checkpoint_path': '/home/mapr/mano/slim_data/flowers/model/inception/inception_v3.ckpt', 'log_every_n_steps': '1', 'dataset_split_name': 'train', 'learning_rate': '0.01', 'train_dir': '/home/mapr/mano/slim_data/flowers/train_dir/train_outs_19', 'clone_on_cpu': 'True', 'batch_size': '32', 'resize_method': '3', 'hue_max_delta': '0.3', 'lower': '0.6', 'trace_every_n_steps': '1', 'script_name': 'train_image_classifier.py', 'checkpoint_exclude_scopes': 'InceptionV3/Logits,InceptionV3/AuxLogits', 'dataset_dir': '/home/mapr/mano/slim_data/flowers/slim_data_dir', 'max_number_of_steps': '4', 'model_name': 'inception_v3', 'dataset_name': 'flowers'})
kwargs: {}
Exception: BufferError('Existing exports of data: object cannot be re-sized',)

INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/mapr/mano/slim_data/flowers/train_dir/train_outs_19/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 2.6281 (19.799 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = nan (7.406 sec/step)
INFO:tensorflow:global step 3: loss = nan (6.953 sec/step)
INFO:tensorflow:global step 4: loss = nan (6.840 sec/step)
INFO:tensorflow:Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions