Skip to content

GCS downloads are randomly truncated about 1 time in 1000 #4129

@johnrausersc

Description

@johnrausersc

I have a process that consumes data from GCS. There are about 70k objects in the bucket, each several kb to several Mb in size. A separate process is continuously recomputing and refreshing the objects. My issue is that the consumer will occasionally download a file using download_to_filename() only to find that the local copy is corrupt. Specifically, the file will be truncated to a small multiple of 15Kb plus a few bytes. download_to_filename() does not raise an exception when this happens. I do not bother check the return value from download_to_filename() as my understanding from reading the source is that it is always None.

I do not have a simple, clean repro, but am filing this issue at the request of @jonparrott.

  1. OS type and version
# uname -a
Linux XXXXXXX 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
  1. Python version and virtual environment information python --version
# python --version
Python 2.7.9
  1. google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze
pip freeze | grep google
gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.1.1
google-auth-httplib2==0.0.2
google-cloud==0.27.0
google-cloud-bigquery==0.26.0
google-cloud-bigtable==0.26.0
google-cloud-core==0.26.0
google-cloud-datastore==1.2.0
google-cloud-dns==0.26.0
google-cloud-error-reporting==0.26.0
google-cloud-happybase==0.22.0
google-cloud-language==0.27.0
google-cloud-logging==1.2.0
google-cloud-monitoring==0.26.0
google-cloud-pubsub==0.27.0
google-cloud-resource-manager==0.26.0
google-cloud-runtimeconfig==0.26.0
google-cloud-spanner==0.26.0
google-cloud-speech==0.28.0
google-cloud-storage==1.3.2
google-cloud-translate==1.1.0
google-cloud-videointelligence==0.25.0
google-cloud-vision==0.26.0
google-compute-engine==2.3.2
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-cloud-datastore-v1==0.14.0
grpc-google-cloud-logging-v2==0.90.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.4
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3

Note that this problem appeared only after I updated my google cloud libraries (which I was forced to do because of a bug on google's side involving pubsub). Here's a diff of my requirements.txt (with context removed):

+cachetools==2.0.1
+certifi==2017.7.27.1
-chardet==2.3.0
+chardet==3.0.4
-dill==0.2.6
+dill==0.2.7.1
-futures==3.0.5
-gapic-google-cloud-datastore-v1==0.14.1
-gapic-google-cloud-logging-v2==0.90.1
-gapic-google-cloud-pubsub-v1==0.14.1
+futures==3.1.1
+gapic-google-cloud-datastore-v1==0.15.3
+gapic-google-cloud-error-reporting-v1beta1==0.15.3
+gapic-google-cloud-logging-v2==0.91.3
+gapic-google-cloud-pubsub-v1==0.15.4
+gapic-google-cloud-spanner-admin-database-v1==0.15.3
+gapic-google-cloud-spanner-admin-instance-v1==0.15.3
+gapic-google-cloud-spanner-v1==0.15.3
-google-auth==0.5.0
+google-auth==1.1.1
-google-cloud==0.22.0
  1. Stacktrace if available

That it does not produce a stacktrace is the isssue! 😃

  1. Steps to reproduce

As I mentioned, I don't have a simple repro case. I think it's something like:

  1. Create a bucket with 70k objects each several Mb in size. The objects can be random data. It might be important that another process continuously refreshes these objects.

  2. Write a program to download the objects. If you want to mimic my setup as closely as possible, do this using a multiprocessing.Pool of 8 processes on an 8 core machine GCE machine and Pool.map() over a few dozen objects at a time.

  3. Check that the downloaded file size matches the file size reported by GCS.

  4. Note that about 1 in 1,000 downloads, the local size will be a small multiple of 15kb plus a few bytes, and not the size reported by GCS.

  5. Code example

Use something like this to download. Note that I've hacked this up from my actual code and haven't tested it.

def copy_from_gcs(bucket_name, src_path, dst_path):
    """Copies the file at src_path in GCS to dst_path locally"""

    logging.info("Copying gs://%s/%s to %s", bucket_name, src_path, dst_path)
    bucket = storage.client.Client().bucket(bucket_name)
    if not bucket.exists():
        raise RuntimeError("GCS bucket %s doesn't exist!" % (bucket_name,))
    blob = bucket.get_blob(src_path)
    if blob is None:
        raise RuntimeError("No GCS object at %s/%s" % (bucket_name, src_path))
    remote_size = blob.size

    (tempfd, temp_fname) = tempfile.mkstemp()
    os.close(tempfd)

    blob.download_to_filename(temp_fname)
    local_size = os.stat(temp_fname).st_size

    if local_size != remote_size:
        logging.warning("When downloading gs://%s/%s to %s, the local file's size (%d) didn't match the remote size (%d).",
                        bucket_name, src_path, dst_path, local_size, remote_size)
        os.remove(temp_fname)
        raise RuntimeError("Downloaded file size (%d) doesn't match GCS file size (%d)" % (local_size, remote_size))

    os.rename(temp_fname, dst_path)
    return True

Metadata

Metadata

Assignees

Labels

api: bigqueryIssues related to the BigQuery API.api: storageIssues related to the Cloud Storage API.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions