feat: safely resume interrupted downloads#294
Conversation
tritone
left a comment
There was a problem hiding this comment.
Couple thoughts, generally looking really good to me!
General comment, can you give some more details about how you tested this out with the emulator?
| # data corruption for that byte range alone. | ||
| if self._expected_checksum is None and self._checksum_object is None: | ||
| # `_get_expected_checksum()` may return None even if a checksum was | ||
| # requested, in which case it will emit an info log _MISSING_CHECKSUM. |
There was a problem hiding this comment.
What causes this case to happen? Transcoding?
There was a problem hiding this comment.
This is due to retried requests being range requests. For range requests, as noted here, there's no way to detect data corruption for that byte range alone.
Therefore, here we retrieve the expected checksum/checksum object only once for the initial download request. Then we calculate and validate the checksum when the download completes.
| if self._stream is not None: | ||
| request_kwargs["stream"] = True | ||
|
|
||
| # Assign object generation if generation is specified in the media url. |
There was a problem hiding this comment.
Would this happen via a user specifying a generation on the object? Were we not respecting this previously?
There was a problem hiding this comment.
Yep this would happen via a user specifying a generation on the object. Previously, we've been respecting that only through download.media_url
A property download._object_generation is added. It records the object generation from either (1) generation query param from the media_url, or (2) the object generation from the initial response header. This specific line of code does (1) and retrieves it from the media_url
P.S. It's tricky in how limited information is passed from python-storage to resumable-media-python. A resumable-media-python download instance only knows the specified object generation from its media_url, and the "object" itself isn't pertained in a download.
|
|
||
| self._process_response(result) | ||
|
|
||
| # With decompressive transcoding, GCS serves back the whole file regardless of the range request, |
There was a problem hiding this comment.
Wondering if this should be highlighted as a shortcoming in the decompressive transcoding docs-- not being able to resume a download may be costly.
There was a problem hiding this comment.
It's mentioned in the very bottom section of the decompressive transcoding docs. I agree we can add notes on how retries may be impacted in this sense.
Thanks for the review! I've added data integrity checks and test cases to the retry conf test (open PR). The changes in this PR are tested against the testbench using above-mentioned tests. Before the changes, conformance tests fail as below. The conf tests pass running locally against the changes made in this PR. |
|
This is looking really good in general. Based on offline discussion I would recommend moving the decompressive transcoding feature to a TODO and moving ahead with the rest of this PR. There may be some details that take a while to resolve for transcoding and it's important that we still move ahead with the rest of this PR which is a major fix to retry logic for downloads. |
|
Thanks Chris and Andrew! I've moved the transcoding feature, tracking in #303 |
If a retryable error occurs mid-download, the download starts sending data to the stream from the
offset_of_last_byte_receivedrather than starting from the beginning of the file, and resolves data integrity issues.object_generationandbytes downloadedFixes #284