Implement S3 snapshot manager by kemkemG0 · Pull Request #4150 · qdrant/qdrant

kemkemG0 · 2024-05-01T04:17:50Z

What I changed

I have added the aws-sdk-s3 to the collection for managing operations on S3. I chose this SDK because it is the official one and the version 1.24.0 is well-established and stable. The implementation was based on the official sample code found at https://github.com/awslabs/aws-sdk-rust/tree/main/examples/examples/s3/src/bin.
Removed aws_s3_rust and replaced it with object_store. This change allows the same abstract code to be used with S3, GCS, and Azure Storage. Currently, the implementation is only for S3, but it can be easily extended to support other services.
~~Next, I implemented the S3 functionalities in the existing snapshots_manager.rs. The necessary functions using aws-sdk-s3 were implemented in snapshots_s3_ops.rs and called from there.~~
Deleted the previously created snapshot_s3_ops.rs and introduced a more abstract snapshot_storage_ops.rs.
One challenging aspect, different from what was initially expected, involved operations like delete and download of snapshots. The process used get_snapshot_path for path verification followed by delete and download actions, which invariably led to errors when the paths did not exist locally on S3.
~~To address this, I implemented get_s3_snapshot_path and used a match statement to handle different scenarios.~~
~~The same changes have been made in four places: delete and download of snapshots, and delete and download of full snapshots. (Note: Additional changes were also necessary for downloading shards.)~~
Addressed a review comment by removing S3-specific functions such as get_s3_snapshot_path and get_full_s3_snapshot_path. Replaced them with more abstract functions get_snapshot_path and get_full_snapshot_path located in snapshot_storage_ops.rs to enhance abstraction.
The use of object_store has resulted in a reduced build size.

Test Modifications

The test script tests/snapshots/snapshots-recovery.sh was modified to allow switching between local and s3 based on the input arguments. For S3 tests, the config file is automatically modified using the yq command before execution.

✅ I have verified that all tests pass on my forked repository.

resolve: #4109

/claim #4109

All Submissions:

Contributions should target the dev branch. Did you create your branch from dev?
Have you followed the guidelines in our Contributing document?
Have you checked to ensure there aren't other open Pull Requests for the same update/change?

New Feature Submissions:

Does your submission pass tests?
Have you formatted your code locally using cargo +nightly fmt --all command prior to submission?
Have you checked your code using cargo clippy --all --all-features command?

Changes to Core Features:

Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

…tation

kemkemG0 · 2024-05-01T06:12:47Z

Implementation is done.

I am going to add unitests/integration test from now.

kemkemG0 · 2024-05-01T06:17:58Z

lib/collection/Cargo.toml

+# AWS
+aws-config = { version = "1.1.7", features = ["behavior-version-latest"] }
+aws-sdk-s3 = "1.24.0"
+aws-smithy-types = "1.1.8"
+aws-smithy-types-convert = { version = "0.60.8", features = ["convert-chrono"] }


I used the official AWS SDK for S3

Any particular reason for that?

kemkemG0 · 2024-05-01T06:19:18Z

lib/collection/src/common/snapshots_manager.rs

+#[derive(Clone, Deserialize, Debug, Default)]
+pub struct SnapShotsConfig {
+    pub snapshots_storage: SnapshotsStorageConfig,
+    pub s3_config: Option<S3Config>,
+}
+
+#[derive(Clone, Debug, Default)]
+pub enum SnapshotsStorageConfig {
+    #[default]
+    Local,
+    S3,
+}
+
+impl<'de> Deserialize<'de> for SnapshotsStorageConfig {
+    fn deserialize<D>(deserializer: D) -> Result<SnapshotsStorageConfig, D::Error>
+    where
+        D: serde::Deserializer<'de>,
+    {
+        let s: String = Deserialize::deserialize(deserializer)?;
+        match s.as_str() {
+            "local" => Ok(SnapshotsStorageConfig::Local),
+            "s3" => Ok(SnapshotsStorageConfig::S3),
+            _ => Err(serde::de::Error::custom(
+                "Invalid snapshots_storage. Use 'local' or 's3'",
+            )),
+        }
+    }
+}
+


These are for deserialization of config yaml

kemkemG0 · 2024-05-01T07:52:47Z

lib/collection/src/operations/snapshot_s3_ops.rs

+    const CHUNK_SIZE: u64 = 1024 * 1024 * 5;
+    const MAX_CHUNKS: u64 = 10000;


I just followed official example as below.

https://github.com/awsdocs/aws-doc-sdk-examples/blob/429c855e5f8d74330f9cbc4cb1570668439283ee/rustv1/examples/s3/src/bin/s3-multipart-upload.rs#L25-L29

This probably should be on config but i wasn't sure.

kemkemG0 · 2024-05-01T07:53:19Z

lib/collection/src/operations/snapshot_s3_ops.rs

+    key.map(|k| k.trim_start_matches("./").to_string())
+}
+
+pub async fn multi_part_upload(


I used this official example code as the reference.

https://github.com/awsdocs/aws-doc-sdk-examples/blob/429c855e5f8d74330f9cbc4cb1570668439283ee/rustv1/examples/s3/src/bin/s3-multipart-upload.rs

…test-for-s3-snapshots

…lling yq

generall · 2024-05-07T22:21:55Z

A few things we would need to finalize this PR:

In the do_get_snapshot and similar functions we download the snapshot from s3 and only after it we respond to te user. But this might not work so good with large files, where user would need to wait indefinitely without a single response byte. Additionally it will left temporary files behind which are never cleaned.

My suggestion is to stream files from either local or S3 as an

HttpResponse::Ok().content_type("application/octet-stream").streaming(stream)

In order to do that, we would need to implement something like

pub struct SnapshotStreamer {
    snapshot_manager: SnapshotStorageManager,
    snapshot_path: PathBuf,
}

impl Stream for SnapshotStreamer {
    type Item = Result<bytes::Bytes, CollectionError>;

NamedFile from actix does a lot of nice things, maybe we could either keep it for local files, or re-implement in the octet-stream.

make sure tests work

Please note that the main issue bounty should still be paid once this PR is merged

…storage_ops.rs

kemkemG0 · 2024-05-10T01:25:48Z

@generall
I also noticed that, shard snapshots had the separated implementations and need to add create_shard_snapshot, recover_shard_snapshot, create_shard_snapshot and delete_shard_snapshot under snapshot_storage_manager.
These looks complicated and may take some time.

Nevermind, it was just an internal thing and we just need to call common create_store after these are done.

…nload_snapshot function

Fix bugs, adding `flush()` after `write_all()`

kemkemG0 · 2024-05-10T03:35:42Z

Fixed bugs and tests (should) passe
Implement downloading with Stream
Add Integration test for Shard Snapshot API with S3 storage

Use Streaming instead of Downloading snapshots

kemkemG0 · 2024-05-10T12:34:59Z

.github/workflows/integration-tests.yml

          sleep 10
          ./tests/shard-snapshot-api.sh test-all
+
+  test-shard-snapshot-api-s3-minio:


Add shard snapshot API integration test for s3 version.

kemkemG0 · 2024-05-10T12:36:06Z

lib/collection/src/common/snapshot_stream.rs

+
+pub struct SnapShotStreamLocalFS {
+    pub snapshot_path: PathBuf,
+    pub req: HttpRequest,
+}
+pub struct SnapShotStreamCloudStrage {
+    pub streamer:
+        std::pin::Pin<Box<dyn Stream<Item = Result<bytes::Bytes, object_store::Error>> + Send>>,
+}
+
+pub enum SnapshotStream {
+    LocalFS(SnapShotStreamLocalFS),
+    CloudStorage(SnapShotStreamCloudStrage),
+}
+
+impl Responder for SnapshotStream {
+    type Body = actix_web::body::BoxBody;
+
+    fn respond_to(self, _: &actix_web::HttpRequest) -> HttpResponse<Self::Body> {
+        match self {
+            SnapshotStream::LocalFS(stream) => match NamedFile::open(stream.snapshot_path) {
+                Ok(file) => file.into_response(&stream.req),
+                Err(e) => match e.kind() {
+                    std::io::ErrorKind::NotFound => {
+                        HttpResponse::NotFound().body(format!("File not found: {}", e))
+                    }
+                    _ => HttpResponse::InternalServerError()
+                        .body(format!("Failed to open file: {}", e)),
+                },
+            },
+
+            SnapshotStream::CloudStorage(stream) => HttpResponse::Ok()
+                .content_type("application/octet-stream")
+                .streaming(stream.streamer),
+        }
+    }
+}


Add SnspShotStream for downloading snapshot with streaming

kemkemG0 · 2024-05-10T12:36:49Z

@generall
I fixed them up and added extra integration test !

* Add SnapshotsStorageConfig enum(Local or S3) and deserialize implementation * [refactor] use snapshots_config instead of s3_config * update config * add AWS official`aws-sdk-s3` * implement store_file() WITHOUT error handling * implement list_snapshots * implement delete_snapshot * run `cargo +nightly fmt` * delete println * implement get_stored_file * Add error handlings * Refactor AWS S3 configuration and error handling * fix bugs * create an empty test file * fix `alias_test.rs` for StorageConfig type * tempolary delete some test and try s3 test * Update integration-tests.yml to use snap instead of apt-get for installing yq * Update integration-tests.yml to use sudo when installing yq * add sudo * make (full/non-full) snapshots downloadable * debug * small fix * Add S3 endpoint URL configuration option * fix * fix * debug * fix endpoint * update to http://127.0.0.1:9000/ * update * fix * fix `#[get("/collections/{collection}/shards/{shard}/snapshots/{snapshot}")]` for s3 * put original tests back * refactor * small fix (delete println & echo) * use object_store and refactor * create snapshot_storage_ops and implement * Refactor get_appropriate_chunk_size function to adjust chunk size based on service limits and file size * cargo +nightly fmt --all * make it more abstract * Refactor SnapshotsStorageConfig deserialization in SnapShotsConfig * small update * small fix * Update dependencies in Cargo.lock * Update minio image to satantime/minio-server * Refactor snapshot storage paths in snapshots_manager.rs and snapshot_storage_ops.rs * Fix issue with downloaded file size not matching expected size in download_snapshot function * add flush * Use Streaming instead of donloading once * apply `cargo +nightly fmt --all` * Fix issue with opening file in SnapshotStream::LocalFS variant * Fix error handling in SnapshotStream::LocalFS variant * Add integration test for Shard Snapshot API with S3 storage (#7)

Summary: Pull Request resolved: facebookresearch/faiss#4150 Creates a sharding convenience function for IVF indexes. - The __**centroids on the quantizer**__ are sharded based on the given sharding function. (not the data, as data sharding by ids is already implemented by copy_subuset_to, https://github.com/facebookresearch/faiss/blob/main/faiss/IndexIVF.h#L408) - The output is written to files based on the template filename generator param. - The default sharding function is simply the ith vector mod the total shard count. This would called by Laser here: https://www.internalfb.com/code/fbsource/[ce1f2e028e79]/fbcode/fblearner/flow/projects/laser/laser_sim_search/knn_trainer.py?lines=295-296. This convenience function will do the file writing, and return the created file names. There's a few key required changes in FAISS: 1. Allow `std::vector<std::string>` to be used. Updates swigfaiss.swig and array_conversions.py to accommodate. These have to be numpy dtype of `object` instead of the more correct `unicode`, because unicode dtype is fixed length. I couldn't figure out how to create a numpy array with each of the output file names where they have different dtypes. (Say the file names are like file1, file11, file111. The dtype would need to be U5, U6, U7 respectively, as the dtype for unicode contains the length). I tried structured arrays : this does not work either, as numpy makes it into a matrix instead: the `file1 file11 file111` example with explicit setting of U5, U6, U7 turns into `[[file1 file1 file1], [file1 file11 file11], [file1 file11 file111]]`, which we do not want. If someone knows the right syntax, please yell at me 2. Create Python callbacks for sharding and template filename: `PyCallbackFilenameTemplateGenerator` and `PyCallbackShardingFunction`. Users of this function would inherit from the FilenameTemplateGenerator or ShardingFunction in C++ to pass to `shard_ivf_index_centroids`. See the other examples in python_callbacks.cpp. This is required because Python functions cannot be passed through SWIG to C++ (i.e. no std::function or function pointers), so we have to use this approach. This approach allows it to be called from both C++ and Python. test_sharding.py shows the Python calling, test_utils.cpp shows the C++ calling. Reviewed By: asadoughi Differential Revision: D68534991 fbshipit-source-id: b857e20c6cc4249a2ab7792db4c93dd4fb8403fd

kemkemG0 added 11 commits April 30, 2024 06:18

Add SnapshotsStorageConfig enum(Local or S3) and deserialize implemen…

42062dc

…tation

[refactor] use snapshots_config instead of s3_config

1641464

update config

af6d29f

add AWS officialaws-sdk-s3

3496d74

implement store_file() WITHOUT error handling

f2d780a

implement list_snapshots

2024109

implement delete_snapshot

15409f9

run cargo +nightly fmt

05a61bd

delete println

e38ac3f

implement get_stored_file

2afb462

Add error handlings

eb6f250

kemkemG0 marked this pull request as ready for review May 1, 2024 06:12

kemkemG0 changed the title ~~Implement S3 snapshot manager~~ [WIP] Implement S3 snapshot manager May 1, 2024

kemkemG0 commented May 1, 2024

View reviewed changes

kemkemG0 added 2 commits May 1, 2024 16:14

Refactor AWS S3 configuration and error handling

0ee357a

fix bugs

7b59053

kemkemG0 commented May 1, 2024

View reviewed changes

kemkemG0 added 10 commits May 1, 2024 16:59

create an empty test file

61787ab

fix alias_test.rs for StorageConfig type

0d8c954

Merge branch 'feature/make-s3-available-for-snapshots' into test/add-…

ed617d5

…test-for-s3-snapshots

tempolary delete some test and try s3 test

a60551c

Update integration-tests.yml to use snap instead of apt-get for insta…

d1528d5

…lling yq

Update integration-tests.yml to use sudo when installing yq

f642870

add sudo

57db486

make (full/non-full) snapshots downloadable

1bccbff

debug

aedd4d6

small fix

62170f3

algora-pbc bot added the 💰 Rewarded label May 7, 2024

kemkemG0 added 2 commits May 9, 2024 23:04

Merge branch 'dev' into feature/make-s3-available-for-snapshots

1308fdc

Update dependencies in Cargo.lock

3aac01a

kemkemG0 changed the title ~~Implement S3 snapshot manager~~ [WIP] Implement S3 snapshot manager May 9, 2024

kemkemG0 added 2 commits May 9, 2024 23:08

Update minio image to satantime/minio-server

f1e25ff

Refactor snapshot storage paths in snapshots_manager.rs and snapshot_…

9f2bb92

…storage_ops.rs

kemkemG0 and others added 3 commits May 10, 2024 03:10

Fix issue with downloaded file size not matching expected size in dow…

637e315

…nload_snapshot function

add flush

4199311

Merge pull request #5 from kemkemG0/feature/fix-object-store-pr

7502b10

Fix bugs, adding `flush()` after `write_all()`

kemkemG0 and others added 6 commits May 10, 2024 10:31

Use Streaming instead of donloading once

8390686

apply cargo +nightly fmt --all

1031bc2

Fix issue with opening file in SnapshotStream::LocalFS variant

6293630

Fix error handling in SnapshotStream::LocalFS variant

df484d4

Merge pull request #6 from kemkemG0/feature/fix-object-store-pr

ae97976

Use Streaming instead of Downloading snapshots

Add integration test for Shard Snapshot API with S3 storage (#7)

bd5ac78

kemkemG0 changed the title ~~[WIP] Implement S3 snapshot manager~~ Implement S3 snapshot manager May 10, 2024

kemkemG0 commented May 10, 2024

View reviewed changes

generall approved these changes May 10, 2024

View reviewed changes

generall merged commit 0d46aeb into qdrant:dev May 10, 2024

This was referenced May 16, 2024

s3 storage setup for snapshots, fixes #4109 #4151

Closed

feat: add and port to SnapshotManager #3520

Closed

xzfc mentioned this pull request Oct 14, 2024

Fix shard snapshot operations for s3 #5227

Merged

ghost mentioned this pull request Nov 7, 2024

Snapshot storage on S3 #3324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement S3 snapshot manager#4150

Implement S3 snapshot manager#4150
generall merged 59 commits intoqdrant:devfrom
kemkemG0:feature/make-s3-available-for-snapshots

kemkemG0 commented May 1, 2024 •

edited

Loading

Uh oh!

kemkemG0 commented May 1, 2024

Uh oh!

kemkemG0 May 1, 2024

Uh oh!

generall May 5, 2024

Uh oh!

kemkemG0 May 1, 2024

Uh oh!

kemkemG0 May 1, 2024

Uh oh!

kemkemG0 May 1, 2024

Uh oh!

generall commented May 7, 2024 •

edited

Loading

Uh oh!

kemkemG0 commented May 10, 2024 •

edited

Loading

Uh oh!

kemkemG0 commented May 10, 2024 •

edited

Loading

Uh oh!

kemkemG0 May 10, 2024 •

edited

Loading

Uh oh!

kemkemG0 May 10, 2024

Uh oh!

kemkemG0 commented May 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const CHUNK_SIZE: u64 = 1024 * 1024 * 5;
		const MAX_CHUNKS: u64 = 10000;

Conversation

kemkemG0 commented May 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I changed

Test Modifications

All Submissions:

New Feature Submissions:

Changes to Core Features:

Uh oh!

kemkemG0 commented May 1, 2024

Uh oh!

kemkemG0 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

generall May 5, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 May 1, 2024

Choose a reason for hiding this comment

Uh oh!

generall commented May 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kemkemG0 commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kemkemG0 commented May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kemkemG0 May 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kemkemG0 May 10, 2024

Choose a reason for hiding this comment

Uh oh!

kemkemG0 commented May 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kemkemG0 commented May 1, 2024 •

edited

Loading

generall commented May 7, 2024 •

edited

Loading

kemkemG0 commented May 10, 2024 •

edited

Loading

kemkemG0 commented May 10, 2024 •

edited

Loading

kemkemG0 May 10, 2024 •

edited

Loading