Use streaming reads for GCS by ywelsch · Pull Request #55506 · elastic/elasticsearch

ywelsch · 2020-04-21T08:38:34Z

To read from GCS repositories we're currently using Google SDK's official BlobReadChannel, which issues a new request every 2MB (default chunk size for BlobReadChannel) using range requests, and fully downloads the chunk before exposing it to the returned InputStream. This means that the SDK issues an awfully high number of requests to download large blobs. Increasing the chunk size is not an option, as that will mean that an awfully high amount of heap memory will be consumed by the download process.

The Google SDK does not provide the right abstractions for a streaming download. This PR uses the lower-level primitives of the SDK to implement a streaming download, similar to what S3's SDK does.

Also closes #55505

elasticmachine · 2020-04-21T08:38:37Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear

As discussed: Besides the fact that I think we should validate this against real GCS to make sure the chunk sizing is supported by the GCS API (especially that huge chunks work), LGTM :)

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java

To read from GCS repositories we're currently using Google SDK's official BlobReadChannel, which issues a new request every 2MB (default chunk size for BlobReadChannel) using range requests, and fully downloads the chunk before exposing it to the returned InputStream. This means that the SDK issues an awfully high number of requests to download large blobs. Increasing the chunk size is not an option, as that will mean that an awfully high amount of heap memory will be consumed by the download process. The Google SDK does not provide the right abstractions for a streaming download. This PR uses the lower-level primitives of the SDK to implement a streaming download, similar to what S3's SDK does. Also closes #55505

Use streaming reads for GCS

2097f2b

ywelsch added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 21, 2020

ywelsch requested review from original-brownbear and tlrx April 21, 2020 08:38

original-brownbear approved these changes Apr 21, 2020

View reviewed changes

.../src/main/java/org/elasticsearch/repositories/gcs/GoogleCloudStorageRetryingInputStream.java Outdated Show resolved Hide resolved

ywelsch added 3 commits April 21, 2020 11:58

shorter

87ff1ca

Merge remote-tracking branch 'elastic/master' into gcs-read

44efd55

remove test mute

64819ad

ywelsch merged commit 9233438 into elastic:master Apr 21, 2020

original-brownbear mentioned this pull request Apr 23, 2020

Enhance Repository Third Party Tests to Cover Large Files #55642

Closed

pugnascotia added the >enhancement label May 6, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

mhl-b mentioned this pull request Dec 16, 2025

Fix GoogleCloudStorageBlobStoreRepositoryTests#testMultipleSnapshotAndRollback #139578

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use streaming reads for GCS#55506

Use streaming reads for GCS#55506
ywelsch merged 4 commits intoelastic:masterfrom
ywelsch:gcs-read

ywelsch commented Apr 21, 2020 •

edited

Loading

Uh oh!

elasticmachine commented Apr 21, 2020

Uh oh!

original-brownbear left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ywelsch commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Apr 21, 2020

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ywelsch commented Apr 21, 2020 •

edited

Loading