-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
I have a process that consumes data from GCS. There are about 70k objects in the bucket, each several kb to several Mb in size. A separate process is continuously recomputing and refreshing the objects. My issue is that the consumer will occasionally download a file using download_to_filename() only to find that the local copy is corrupt. Specifically, the file will be truncated to a small multiple of 15Kb plus a few bytes. download_to_filename() does not raise an exception when this happens. I do not bother check the return value from download_to_filename() as my understanding from reading the source is that it is always None.
I do not have a simple, clean repro, but am filing this issue at the request of @jonparrott.
- OS type and version
# uname -a
Linux XXXXXXX 3.16.0-4-amd64 #1 SMP Debian 3.16.43-2+deb8u5 (2017-09-19) x86_64 GNU/Linux
- Python version and virtual environment information
python --version
# python --version
Python 2.7.9
- google-cloud-python version
pip show google-cloud,pip show google-<service>orpip freeze
pip freeze | grep google
gapic-google-cloud-datastore-v1==0.15.3
gapic-google-cloud-error-reporting-v1beta1==0.15.3
gapic-google-cloud-logging-v2==0.91.3
gapic-google-cloud-pubsub-v1==0.15.4
gapic-google-cloud-spanner-admin-database-v1==0.15.3
gapic-google-cloud-spanner-admin-instance-v1==0.15.3
gapic-google-cloud-spanner-v1==0.15.3
google-api-python-client==1.6.2
google-auth==1.1.1
google-auth-httplib2==0.0.2
google-cloud==0.27.0
google-cloud-bigquery==0.26.0
google-cloud-bigtable==0.26.0
google-cloud-core==0.26.0
google-cloud-datastore==1.2.0
google-cloud-dns==0.26.0
google-cloud-error-reporting==0.26.0
google-cloud-happybase==0.22.0
google-cloud-language==0.27.0
google-cloud-logging==1.2.0
google-cloud-monitoring==0.26.0
google-cloud-pubsub==0.27.0
google-cloud-resource-manager==0.26.0
google-cloud-runtimeconfig==0.26.0
google-cloud-spanner==0.26.0
google-cloud-speech==0.28.0
google-cloud-storage==1.3.2
google-cloud-translate==1.1.0
google-cloud-videointelligence==0.25.0
google-cloud-vision==0.26.0
google-compute-engine==2.3.2
google-gax==0.15.15
google-resumable-media==0.2.3
googleapis-common-protos==1.5.3
grpc-google-cloud-datastore-v1==0.14.0
grpc-google-cloud-logging-v2==0.90.0
grpc-google-cloud-pubsub-v1==0.14.0
grpc-google-iam-v1==0.11.4
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-error-reporting-v1beta1==0.15.3
proto-google-cloud-logging-v2==0.91.3
proto-google-cloud-pubsub-v1==0.15.4
proto-google-cloud-spanner-admin-database-v1==0.15.3
proto-google-cloud-spanner-admin-instance-v1==0.15.3
proto-google-cloud-spanner-v1==0.15.3
Note that this problem appeared only after I updated my google cloud libraries (which I was forced to do because of a bug on google's side involving pubsub). Here's a diff of my requirements.txt (with context removed):
+cachetools==2.0.1
+certifi==2017.7.27.1
-chardet==2.3.0
+chardet==3.0.4
-dill==0.2.6
+dill==0.2.7.1
-futures==3.0.5
-gapic-google-cloud-datastore-v1==0.14.1
-gapic-google-cloud-logging-v2==0.90.1
-gapic-google-cloud-pubsub-v1==0.14.1
+futures==3.1.1
+gapic-google-cloud-datastore-v1==0.15.3
+gapic-google-cloud-error-reporting-v1beta1==0.15.3
+gapic-google-cloud-logging-v2==0.91.3
+gapic-google-cloud-pubsub-v1==0.15.4
+gapic-google-cloud-spanner-admin-database-v1==0.15.3
+gapic-google-cloud-spanner-admin-instance-v1==0.15.3
+gapic-google-cloud-spanner-v1==0.15.3
-google-auth==0.5.0
+google-auth==1.1.1
-google-cloud==0.22.0
- Stacktrace if available
That it does not produce a stacktrace is the isssue! 😃
- Steps to reproduce
As I mentioned, I don't have a simple repro case. I think it's something like:
-
Create a bucket with 70k objects each several Mb in size. The objects can be random data. It might be important that another process continuously refreshes these objects.
-
Write a program to download the objects. If you want to mimic my setup as closely as possible, do this using a
multiprocessing.Poolof 8 processes on an 8 core machine GCE machine andPool.map()over a few dozen objects at a time. -
Check that the downloaded file size matches the file size reported by GCS.
-
Note that about 1 in 1,000 downloads, the local size will be a small multiple of 15kb plus a few bytes, and not the size reported by GCS.
-
Code example
Use something like this to download. Note that I've hacked this up from my actual code and haven't tested it.
def copy_from_gcs(bucket_name, src_path, dst_path):
"""Copies the file at src_path in GCS to dst_path locally"""
logging.info("Copying gs://%s/%s to %s", bucket_name, src_path, dst_path)
bucket = storage.client.Client().bucket(bucket_name)
if not bucket.exists():
raise RuntimeError("GCS bucket %s doesn't exist!" % (bucket_name,))
blob = bucket.get_blob(src_path)
if blob is None:
raise RuntimeError("No GCS object at %s/%s" % (bucket_name, src_path))
remote_size = blob.size
(tempfd, temp_fname) = tempfile.mkstemp()
os.close(tempfd)
blob.download_to_filename(temp_fname)
local_size = os.stat(temp_fname).st_size
if local_size != remote_size:
logging.warning("When downloading gs://%s/%s to %s, the local file's size (%d) didn't match the remote size (%d).",
bucket_name, src_path, dst_path, local_size, remote_size)
os.remove(temp_fname)
raise RuntimeError("Downloaded file size (%d) doesn't match GCS file size (%d)" % (local_size, remote_size))
os.rename(temp_fname, dst_path)
return True