Streaming snapshot unpacking by generall · Pull Request #8025 · qdrant/qdrant

generall · 2026-02-02T09:37:41Z

download tar
compute sha256 for stream download
wip: propagate unpacking into down to the logic, todo: validation
unpacked snapshot validation

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/collection/src/shards/shard_holder/mod.rs`:
- Around line 1268-1270: The Unpacked branch calls move_all(snapshot_dir.path(),
&snapshot_temp_dir) without the cancellation check used in the Packed branch;
add the same cancellation check (the identical call used before ar.unpack() in
the Packed branch) immediately before invoking move_all in the
SnapshotData::Unpacked arm so the code aborts promptly if cancelled.

🧹 Nitpick comments (3)

lib/shard/src/snapshots/snapshot_data.rs (1)
4-9: Consider adding #[derive(Debug)] for better diagnostics.

The SnapshotData enum would benefit from deriving Debug to aid in logging and troubleshooting during snapshot operations.
💡 Suggested improvement
+#[derive(Debug)]
 pub enum SnapshotData {
     /// Tar file containing the snapshot, needs to be unpacked
     Packed(MaybeTempPath),
     /// Directory containing the unpacked snapshot
     Unpacked(TempDir),
 }
lib/storage/src/content_manager/snapshots/download_result.rs (1)
3-7: Consider adding #[derive(Debug)] for logging and diagnostics.

As a data container that will be passed through download and recovery flows, having Debug would help with troubleshooting.
💡 Suggested improvement
+#[derive(Debug)]
 pub struct DownloadResult {
     pub snapshot: SnapshotData,
     /// Sha256 hash of the downloaded snapshot file, if computed
     pub hash: Option<String>,
 }
lib/storage/src/content_manager/snapshots/download_tar.rs (1)
133-166: Test depends on external network resource.

This test downloads from storage.googleapis.com, making it dependent on network availability and external service uptime. Consider marking it with #[ignore] or gating it behind a feature flag to prevent CI flakiness.
💡 Suggested improvement
 #[cfg(test)]
 mod tests {
     use super::*;

+    #[ignore] // Requires network access to external service
     #[tokio::test]
     async fn test_download_and_unpack_tar() {

coderabbitai · 2026-02-02T09:42:09Z

lib/collection/src/shards/shard_holder/mod.rs

+                        SnapshotData::Unpacked(snapshot_dir) => {
+                            move_all(snapshot_dir.path(), &snapshot_temp_dir)?;
+                        }


⚠️ Potential issue | 🟡 Minor

Missing cancellation check before move_all in the Unpacked branch.

The Packed branch (lines 1260-1262) checks for cancellation before the expensive ar.unpack() operation, but the Unpacked branch proceeds directly to move_all() without a similar check. For consistency and proper cancel responsiveness, consider adding a cancellation check here.

Proposed fix

SnapshotData::Unpacked(snapshot_dir) => { + if cancel.is_cancelled() { + return Err(cancel::Error::Cancelled.into()); + } move_all(snapshot_dir.path(), &snapshot_temp_dir)?; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

SnapshotData::Unpacked(snapshot_dir) => {

move_all(snapshot_dir.path(), &snapshot_temp_dir)?;

}

SnapshotData::Unpacked(snapshot_dir) => {

if cancel.is_cancelled() {

return Err(cancel::Error::Cancelled.into());

}

move_all(snapshot_dir.path(), &snapshot_temp_dir)?;

}

🤖 Prompt for AI Agents

In `@lib/collection/src/shards/shard_holder/mod.rs` around lines 1268 - 1270, The Unpacked branch calls move_all(snapshot_dir.path(), &snapshot_temp_dir) without the cancellation check used in the Packed branch; add the same cancellation check (the identical call used before ar.unpack() in the Packed branch) immediately before invoking move_all in the SnapshotData::Unpacked arm so the code aborts promptly if cancelled.

timvisee

Vetted and tested. It works well and I didn't see any problems.

Note that this doesn't make shard snapshot recovery fully streaming. This only streams and unpacks the outer layer. We also have an inner layer of tars (each segment is a tar), which remains non-streaming.

On the good side this removes 1 out of 3 big IO steps.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@lib/storage/src/content_manager/snapshots/download_tar.rs`:
- Around line 129-161: The test test_download_and_unpack_tar in the tests module
performs a network call to a public URL and should not run by default in unit
test runs; either mark the test ignored by default or gate it behind an
integration/feature flag. Update the test declaration for
test_download_and_unpack_tar (the async tokio test in the tests module) to
either add an ignore attribute so it only runs when explicitly included, or wrap
it with a cfg feature (e.g., cfg(feature = "integration_tests")) and document
enabling that feature in Cargo.toml so CI can opt-in when network tests are
allowed.

In `@lib/storage/src/content_manager/snapshots/download.rs`:
- Around line 94-97: The error message in the download.rs branch uses the
incorrect word "schema"; update the string passed to StorageError::bad_request
to use "scheme" instead (locate the Err(StorageError::bad_request(...)) call
that formats "URL {url} with schema {} is not supported" and change it to "URL
{url} with scheme {} is not supported") so the message correctly refers to URL
schemes; leave the surrounding call and formatting (StorageError::bad_request,
url.scheme()) unchanged.

In `@src/common/snapshots.rs`:
- Around line 207-216: The code is hashing the original snapshot_path instead of
the actual file returned by get_snapshot_file() (a MaybeTempPath), so replace
the sha_256::hash_file call that uses snapshot_path with a call that hashes the
real restored file inside snapshot_file (use the MaybeTempPath's concrete path
accessor or conversion) when computing hash in the restore flow; specifically,
update the block that sets hash (currently referencing snapshot_path) to
open/hash the path inside snapshot_file (the value returned by
collection.get_snapshots_storage_manager().get_snapshot_file(...)) so
S3-downloaded temporary files are hashed correctly.

🧹 Nitpick comments (1)

lib/storage/src/content_manager/snapshots/download_tar.rs (1)

94-104: Path traversal is protected by tar 0.4.41's built-in guards.

The project uses tar = "0.4.41", which skips archive entries containing .. and prevents writing outside the destination directory. This addresses the original concern about path traversal attacks. However, for defense-in-depth with untrusted archives, consider using Entry::unpack_in (iterating per entry) instead of Archive::unpack, as it enforces stricter per-entry containment checks.

lib/storage/src/content_manager/snapshots/download_tar.rs

lib/storage/src/content_manager/snapshots/download.rs

coderabbitai · 2026-02-02T13:17:54Z

src/common/snapshots.rs

+                    let snapshot_file = collection
                        .get_snapshots_storage_manager()?
                        .get_snapshot_file(&snapshot_path, &download_dir)
-                        .await?
+                        .await?;
+
+                    let hash = if checksum.is_some() {
+                        Some(sha_256::hash_file(&snapshot_path).await?)
+                    } else {
+                        None
+                    };


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, let's look at the full context around lines 207-216 in snapshots.rs cat -n src/common/snapshots.rs | sed -n '200,225p'

Repository: qdrant/qdrant

Length of output: 1159

🏁 Script executed:

#!/bin/bash # Search for get_snapshot_file implementation rg -n --type=rs -A 10 'fn get_snapshot_file' src/

Repository: qdrant/qdrant

Length of output: 83

🏁 Script executed:

#!/bin/bash # Also search for snapshots_storage_manager type/trait definition rg -n --type=rs 'get_snapshots_storage_manager' -A 2 src/

Repository: qdrant/qdrant

Length of output: 83

🏁 Script executed:

#!/bin/bash # Search for get_snapshot_file implementation more broadly rg -n 'fn get_snapshot_file' src/ -A 10

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Look for the storage manager trait or struct definition rg -n 'get_snapshots_storage_manager' src/ -B 2 -A 5

Repository: qdrant/qdrant

Length of output: 3145

🏁 Script executed:

#!/bin/bash # Check what SnapshotData::Packed expects and examine the broader context rg -n 'SnapshotData::Packed' src/ -B 3 -A 3

Repository: qdrant/qdrant

Length of output: 2071

🏁 Script executed:

#!/bin/bash # Search for trait definition or impl block with get_snapshot_file rg -n 'get_snapshot_file' src/ -B 5 -A 15 | head -100

Repository: qdrant/qdrant

Length of output: 1324

🏁 Script executed:

#!/bin/bash # Find SnapshotsStorageManager or similar trait/struct rg -n 'trait.*[Ss]torage.*[Mm]anager\|struct.*[Ss]torage.*[Mm]anager' src/ -A 20

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Look at the types - find MaybeTempPath definition to understand temp handling rg -n 'MaybeTempPath' src/ -B 3 -A 5 | head -80

Repository: qdrant/qdrant

Length of output: 2669

🏁 Script executed:

#!/bin/bash # Search for get_snapshot_file in collection module find src/ -type f -name "*.rs" | xargs grep -l 'get_snapshot_file' | head -20

Repository: qdrant/qdrant

Length of output: 80

🏁 Script executed:

#!/bin/bash # Look for SnapshotsStorageManager more broadly rg -n 'SnapshotsStorageManager' src/ -A 10 | head -100

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Check collection.rs or similar to see where get_snapshots_storage_manager is defined rg -n 'fn get_snapshots_storage_manager' src/ -A 5

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Let's examine the imports and see what collection type is being used head -50 src/common/snapshots.rs

Repository: qdrant/qdrant

Length of output: 1765

🏁 Script executed:

#!/bin/bash # Look for impl blocks in snapshots.rs that might define get_snapshot_file rg -n 'impl.*get_snapshot_file\|impl.*get_snapshots_storage_manager' src/common/snapshots.rs -B 5 -A 15

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Search broadly for get_snapshot_file being called or defined rg -n 'get_snapshot_file' . -A 5 -B 5 2>/dev/null | head -150

Repository: qdrant/qdrant

Length of output: 4489

🏁 Script executed:

#!/bin/bash # Get the full LocalFS implementation of get_snapshot_file sed -n '371,385p' lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 604

🏁 Script executed:

#!/bin/bash # Get the full S3 implementation of get_snapshot_file sed -n '436,460p' lib/collection/src/common/snapshots_manager.rs

Repository: qdrant/qdrant

Length of output: 903

🏁 Script executed:

#!/bin/bash # Check what MaybeTempPath is - does it track temp vs non-temp? rg -n 'enum MaybeTempPath\|struct MaybeTempPath' . -A 10

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

#!/bin/bash # Check the DownloadResult struct to understand hash usage rg -n 'struct DownloadResult\|pub.*hash' lib/collection/src/common/snapshots_manager.rs storage/src/content_manager/snapshots/ -A 3 -B 2 | head -50

Repository: qdrant/qdrant

Length of output: 135

🏁 Script executed:

#!/bin/bash # Look for how DownloadResult is used - confirm snapshot_file is what gets restored rg -n 'DownloadResult' src/common/snapshots.rs -B 5 -A 5

Repository: qdrant/qdrant

Length of output: 1495

Hash the snapshot file actually used for restore, not the source path.

The code calls get_snapshot_file() which returns MaybeTempPath containing the actual file to be restored, but then hashes snapshot_path instead. For LocalFS, this happens to work since the returned path equals the input; for S3, get_snapshot_file() downloads to a temporary file and returns MaybeTempPath::Temporary(temp_path), making the hash of the remote snapshot_path incorrect. Hash snapshot_file (which is MaybeTempPath and contains the correct path) instead.

🤖 Prompt for AI Agents

In `@src/common/snapshots.rs` around lines 207 - 216, The code is hashing the original snapshot_path instead of the actual file returned by get_snapshot_file() (a MaybeTempPath), so replace the sha_256::hash_file call that uses snapshot_path with a call that hashes the real restored file inside snapshot_file (use the MaybeTempPath's concrete path accessor or conversion) when computing hash in the restore flow; specifically, update the block that sets hash (currently referencing snapshot_path) to open/hash the path inside snapshot_file (the value returned by collection.get_snapshots_storage_manager().get_snapshot_file(...)) so S3-downloaded temporary files are hashed correctly.

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@lib/storage/src/content_manager/snapshots/download_tar.rs`:
- Around line 88-114: Replace the direct call to archive.unpack(&target_dir)
with an explicit iteration over entries so each Entry's header.entry_type() is
validated (allow only Directory, Regular, and GNUSparse as
validate_snapshot_archive() does) before extracting; specifically, inside the
spawn_blocking closure (the code creating tar::Archive and using HashingReader),
call archive.entries()? to iterate entries, for each entry check
entry.header().entry_type() and return a StorageError::service_error on any
disallowed type, and for valid entries unpack them into target_dir (e.g.,
entry.unpack_in or equivalent) instead of archive.unpack; keep the existing
trailing-drain logic and hash retrieval via the HashingReader unchanged.

In `@lib/storage/src/content_manager/snapshots/download.rs`:
- Around line 23-52: Update the doc comment for _download_snapshot to describe
that it downloads and unpacks a tar snapshot into a temporary directory
(TempDir) rather than a single file; mention that the TempDir contains the
unpacked contents and will be deleted when dropped (use TempDir::keep() to
persist), and note that the function may compute and return an optional checksum
(Option<String>) via download_and_unpack_tar; also adjust wording to reference
snapshot_prefix and the unpacking behavior instead of "file".

coderabbitai · 2026-02-02T13:31:55Z

lib/storage/src/content_manager/snapshots/download_tar.rs

+    // Use spawn_blocking because tar::Archive is synchronous
+    let hash = tokio::task::spawn_blocking(move || {
+        // SyncIoBridge converts an AsyncRead into a sync Read
+        // It must be used within a tokio runtime context (spawn_blocking provides this)
+        let sync_reader = tokio_util::io::SyncIoBridge::new(async_reader);
+
+        // Wrap the reader with optional hashing
+        let hashing_reader = HashingReader::new(sync_reader, compute_checksum);
+
+        // We need to keep access to the hashing reader to get the hash after unpacking,
+        // but tar::Archive takes ownership. Use a RefCell-like pattern with take.
+        let mut archive = tar::Archive::new(hashing_reader);
+        archive.set_overwrite(false);
+
+        archive.unpack(&target_dir).map_err(|e| {
+            StorageError::service_error(format!("Failed to unpack tar archive: {e}"))
+        })?;
+
+        // Drain any remaining bytes to ensure the full stream is hashed.
+        // Tar files have trailing padding that Archive doesn't read.
+        let mut reader = archive.into_inner();
+        if reader.hasher.is_some() {
+            let mut buf = [0u8; 8192];
+            while reader.read(&mut buf)? > 0 {}
+        }
+
+        // Get the hash from the inner reader


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "download_tar.rs" | head -5

Repository: qdrant/qdrant

Length of output: 116

🏁 Script executed:

cd lib/storage/src/content_manager/snapshots && ls -la

Repository: qdrant/qdrant

Length of output: 485

🏁 Script executed:

wc -l download_tar.rs

Repository: qdrant/qdrant

Length of output: 103

🏁 Script executed:

wc -l lib/storage/src/content_manager/snapshots/download_tar.rs

Repository: qdrant/qdrant

Length of output: 118

🏁 Script executed:

cat -n lib/storage/src/content_manager/snapshots/download_tar.rs

Repository: qdrant/qdrant

Length of output: 6300

🏁 Script executed:

rg -n "validate_snapshot_archive" lib/storage/src/content_manager/snapshots/

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

rg -n "entry_type\|EntryType" lib/storage/src/content_manager/snapshots/

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

rg -rn "Directory.*Regular.*GNUSparse" lib/storage/

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

rg -n "validate_snapshot" lib/storage/src/content_manager/

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

rg -n "Malformed snapshot" lib/storage/src/

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

rg -n "entry_type\|EntryType" lib/storage/src/ -A 3 -B 3

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

cat lib/storage/src/content_manager/snapshots/mod.rs

Repository: qdrant/qdrant

Length of output: 7674

🏁 Script executed:

rg -n "tar::" lib/storage/src/content_manager/snapshots/ -A 2 -B 2

Repository: qdrant/qdrant

Length of output: 2158

🏁 Script executed:

rg -n "validate\|Malformed\|symlink" lib/storage/src/content_manager/ -i

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

cat lib/storage/src/content_manager/snapshots/download.rs

Repository: qdrant/qdrant

Length of output: 3230

🏁 Script executed:

cat lib/storage/src/content_manager/snapshots/recover.rs | head -100

Repository: qdrant/qdrant

Length of output: 3214

🏁 Script executed:

rg -n "Directory\|Regular\|Symlink" lib/storage/src/ -i -A 2 -B 2

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -n "unpack\|entries\|entry" lib/storage/src/content_manager/snapshots/recover.rs | head -30

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "snapshot.*archive\|archive.*snapshot" lib/storage/src/content_manager/ -i

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "\.entries()\|entry_type\|EntryType" lib/storage/src/ -A 3 -B 3

Repository: qdrant/qdrant

Length of output: 17715

🏁 Script executed:

rg -n "tar\|archive" lib/storage/src/ -i --type rust | grep -i "symlink\|special\|type\|validate" | head -20

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

find lib/storage -name "*.rs" -exec grep -l "tar::Archive\|tar::Entry" {} \;

Repository: qdrant/qdrant

Length of output: 114

🏁 Script executed:

rg -n "fn.*validate.*snapshot\|fn.*unpack.*tar" lib/storage/src/ -A 5

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "symlink\|Symlink" lib/storage/src/ -i

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "Directory\|Regular" lib/storage/src/content_manager/snapshots/ -i

Repository: qdrant/qdrant

Length of output: 185

🏁 Script executed:

python3 << 'EOF' # Let's check what tar::Archive::unpack does and compare with manual iteration import subprocess result = subprocess.run( ["rg", "-n", "set_overwrite|unpack", "lib/storage/src/", "-B2", "-A2"], capture_output=True, text=True, cwd="." ) print(result.stdout) EOF

Repository: qdrant/qdrant

Length of output: 4610

🏁 Script executed:

rg -rn "validate\|check.*archive\|check.*snapshot" lib/storage/src/content_manager/snapshots/ -i

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

cat lib/storage/src/content_manager/snapshots/recover.rs | tail -100

Repository: qdrant/qdrant

Length of output: 4283

🏁 Script executed:

grep -rn "snapshot" lib/storage/src/content_manager/snapshots/recover.rs | head -30

Repository: qdrant/qdrant

Length of output: 1922

🏁 Script executed:

grep -rn "check_shard_path" lib/storage/src/ -A 5 -B 2

Repository: qdrant/qdrant

Length of output: 1685

🏁 Script executed:

grep -rn "restore_snapshot\|unpack" lib/storage/src/content_manager/snapshots/recover.rs -A 3 -B 3

Repository: qdrant/qdrant

Length of output: 349

🏁 Script executed:

web_search tar::Archive::unpack security symlink extraction

Repository: qdrant/qdrant

Length of output: 166

🌐 Web query:

tar::Archive::unpack Rust symlink security validation

💡 Result:

In the Rust tar crate, tar::Archive::unpack(dst) is intended to be “safe-by-default” for untrusted archives in the basic sense that it won’t write files outside dst and skips entries whose paths contain ... [1]

However, symlinks/hardlinks have historically been a security footgun during extraction, and the safety you get depends on the crate version:

tar < 0.4.16: unpack_in-family behavior could be abused with hardlinks/symlinks to overwrite arbitrary filesystem files (RustSec RUSTSEC-2018-0002, CVE-2018-20990). Fixed in tar >= 0.4.16. [2]

tar < 0.4.36: tar::Archive::unpack could be abused so that a tarball containing a symlink causes directories to be created outside the destination (RustSec RUSTSEC-2021-0080, CVE-2021-38511). Fixed in tar >= 0.4.36. [3]

Practical guidance for “symlink security validation”:

Use tar >= 0.4.36 (this includes the earlier fix from 0.4.16). [2][3]

Prefer Archive::unpack (bulk) or Entry::unpack_in (per-entry) for untrusted archives; avoid Entry::unpack because it explicitly has fewer sanity checks. [4]

If you need stricter symlink handling than the crate provides (e.g., “never create symlinks” or “reject absolute/external link targets”), you must implement policy yourself while iterating entries (since tar archives can legitimately contain symlinks). [4]

Sources: tar Archive::unpack docs [1], RustSec advisories RUSTSEC-2018-0002 [2] and RUSTSEC-2021-0080 [3], tar Entry::unpack/unpack_in docs note on sanity checks [4].

🏁 Script executed:

grep -rn "Collection::restore_snapshot\|restore_snapshot" lib/ -A 2 -B 2 | head -40

Repository: qdrant/qdrant

Length of output: 3105

🏁 Script executed:

grep -rn "tar\s*=" Cargo.toml | head -5

Repository: qdrant/qdrant

Length of output: 75

🏁 Script executed:

find . -name "Cargo.lock" -o -name "Cargo.toml" | xargs grep -l "^tar" | head -3

Repository: qdrant/qdrant

Length of output: 128

🏁 Script executed:

cat Cargo.toml | grep -A 2 -B 2 "tar"

Repository: qdrant/qdrant

Length of output: 656

🏁 Script executed:

grep -rn "EntryType\|entry_type\|entry\.header" lib/collection/src/ -A 3 -B 3

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "snapshot.*extract\|extract.*snapshot" lib/collection/src/shards/ -i

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "symlink\|Symlink" lib/collection/src/ -i -A 3 -B 3

Repository: qdrant/qdrant

Length of output: 39

🏁 Script executed:

grep -rn "validate_snapshot\|validate.*archive" lib/ -i

Repository: qdrant/qdrant

Length of output: 1251

🏁 Script executed:

grep -rn "Malformed snapshot\|malformed.*archive" lib/ -i

Repository: qdrant/qdrant

Length of output: 469

🏁 Script executed:

rg -rn "tar::EntryType" lib/

Repository: qdrant/qdrant

Length of output: 168

🏁 Script executed:

cat lib/segment/src/common/validate_snapshot_archive.rs

Repository: qdrant/qdrant

Length of output: 4296

Validate tar entries before unpacking to match snapshot safety checks.

The streaming extraction path uses archive.unpack() without entry-type validation, whereas the codebase has an established validate_snapshot_archive() function (in lib/segment/src/common/validate_snapshot_archive.rs) that rejects symlinks and special file types. The current code allows these entries to be extracted before post-extraction filesystem validation occurs in recover.rs, creating a window of opportunity for unsafe extraction.

Validate entries while iterating instead of extracting first. The suggested approach mirrors the existing validate_snapshot_archive() logic that only accepts Directory, Regular, and GNUSparse entry types.

🤖 Prompt for AI Agents

In `@lib/storage/src/content_manager/snapshots/download_tar.rs` around lines 88 - 114, Replace the direct call to archive.unpack(&target_dir) with an explicit iteration over entries so each Entry's header.entry_type() is validated (allow only Directory, Regular, and GNUSparse as validate_snapshot_archive() does) before extracting; specifically, inside the spawn_blocking closure (the code creating tar::Archive and using HashingReader), call archive.entries()? to iterate entries, for each entry check entry.header().entry_type() and return a StorageError::service_error on any disallowed type, and for valid entries unpack them into target_dir (e.g., entry.unpack_in or equivalent) instead of archive.unpack; keep the existing trailing-drain logic and hash retrieval via the HashingReader unchanged.

coderabbitai · 2026-02-02T13:31:55Z

lib/storage/src/content_manager/snapshots/download.rs

 /// Download a remote file from `url` to `path`
 ///
-/// Returns a `TempPath` that will delete the downloaded file once it is dropped.
-/// To persist the file, use `download_file(...).keep()`.
-#[must_use = "returns a TempPath, if dropped the downloaded file is deleted"]
-async fn download_file(
+/// Returns a `TempDir` that will delete the downloaded file once it is dropped.
+/// To persist the file, use [`keep()`](TempDir::keep).
+#[must_use = "returns a TempDir, if dropped the downloaded file is deleted"]
+async fn _download_snapshot(
    client: &reqwest::Client,
    url: &Url,
    dir_path: &Path,
-) -> Result<TempPath, StorageError> {
+    compute_checksum: bool,
+) -> Result<(TempDir, Option<String>), StorageError> {
    let download_start_time = tokio::time::Instant::now();
-    let (file, temp_path) = tempfile::Builder::new()
-        .prefix(&snapshot_prefix(url))
-        .suffix(".download")
-        .tempfile_in(dir_path)?
-        .into_parts();
-    let file = fs::File::from_parts::<&Path>(file, temp_path.as_ref());
-
-    log::debug!("Downloading snapshot from {url} to {temp_path:?}");
-
-    let mut file = tokio_fs::File::from_std(file);
-
-    let response = client.get(url.clone()).send().await?;
-
-    if !response.status().is_success() {
-        return Err(StorageError::bad_input(format!(
-            "Failed to download snapshot from {}: status - {}",
-            url,
-            response.status()
-        )));
-    }

-    let mut stream = response.bytes_stream();
-    let mut total_bytes_downloaded = 0u64;
+    let snapshot_name = snapshot_prefix(url);

-    while let Some(chunk_result) = stream.next().await {
-        let chunk = chunk_result?;
-        total_bytes_downloaded += chunk.len() as u64;
-        file.write_all(&chunk).await?;
-    }
+    let tempdir = tempfile::Builder::new()
+        .prefix(&snapshot_name)
+        .suffix(".download")
+        .tempdir_in(dir_path)?;

-    file.flush().await?;
+    let hash = download_and_unpack_tar(client, url, tempdir.path(), compute_checksum).await?;

    let download_duration = download_start_time.elapsed();
-    let total_size_mb = total_bytes_downloaded as f64 / BYTES_IN_MB as f64;
-    let download_speed_mbps = total_size_mb / download_duration.as_secs_f64();
    log::debug!(
-        "Snapshot download completed: path={}, size={:.2} MB, duration={:.2}s, speed={:.2} MB/s",
-        temp_path.display(),
-        total_size_mb,
+        "Snapshot download completed: path={tempdir:?}, duration={:.2}s",
        download_duration.as_secs_f64(),
-        download_speed_mbps
    );

-    Ok(temp_path)
+    Ok((tempdir, hash))
 }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

fd "download.rs" lib/storage/src/content_manager/snapshots/

Repository: qdrant/qdrant

Length of output: 110

🏁 Script executed:

cat -n lib/storage/src/content_manager/snapshots/download.rs | head -60

Repository: qdrant/qdrant

Length of output: 2325

🏁 Script executed:

fd "download_tar.rs" lib/storage/src/content_manager/snapshots/

Repository: qdrant/qdrant

Length of output: 114

🏁 Script executed:

cat -n lib/storage/src/content_manager/snapshots/download_tar.rs | head -100

Repository: qdrant/qdrant

Length of output: 4099

🏁 Script executed:

cat -n lib/storage/src/content_manager/snapshots/download_tar.rs | sed -n '100,130p'

Repository: qdrant/qdrant

Length of output: 1101

Update the doc comment to reflect directory-based downloads and unpacking behavior.

The function downloads and unpacks a tar snapshot into a TempDir, but the doc comment uses "file" terminology which is misleading. The suggested documentation updates properly reflect that the function unpacks contents into a directory.

📝 Suggested doc update

-/// Download a remote file from `url` to `path` +/// Download and unpack a snapshot from `url` into a temporary directory. /// -/// Returns a `TempDir` that will delete the downloaded file once it is dropped. -/// To persist the file, use [`keep()`](TempDir::keep). +/// Returns a `TempDir` that will delete the unpacked snapshot directory when dropped.

🤖 Prompt for AI Agents

In `@lib/storage/src/content_manager/snapshots/download.rs` around lines 23 - 52, Update the doc comment for _download_snapshot to describe that it downloads and unpacks a tar snapshot into a temporary directory (TempDir) rather than a single file; mention that the TempDir contains the unpacked contents and will be deleted when dropped (use TempDir::keep() to persist), and note that the function may compute and return an optional checksum (Option<String>) via download_and_unpack_tar; also adjust wording to reference snapshot_prefix and the unpacking behavior instead of "file".

coderabbitai · 2026-02-02T15:17:55Z

📝 Walkthrough

Walkthrough

Replaces path-centric snapshot handling with a SnapshotData enum (Packed | Unpacked) and propagates it through collection, shard, storage, and API layers. Adds tar_unpack utilities and a download-and-unpack streaming path (with optional SHA‑256) that returns a DownloadResult containing SnapshotData and optional hash. Restores now accept SnapshotData instead of MaybeTempPath/Path; validate_snapshot_archive and related helpers were removed; tests and imports updated to construct/forward SnapshotData. Several public method signatures were changed to accept SnapshotData.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Log snapshot download duration #7715: Modifies the same storage download_snapshot logic and return payloads (DownloadResult) — strong overlap in download flow changes.
Refactor Collection::restore_shard_snapshot #6988: Previously introduced MaybeTempPath-based cancel-safe restore flow that this change replaces with SnapshotData — direct overlap in restore_shard_snapshot semantics.
Add partial snapshot API endpoints #6304: Adjusts shard/collection snapshot restore APIs and flows — overlaps with signature and restore path changes here.

Suggested reviewers

KShivendu
ffuugoo

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Streaming snapshot unpacking' directly describes the main change: refactoring snapshot handling to support streaming unpacking of tar archives instead of path-based operations.
Description check	✅ Passed	The description outlines key aspects of the PR: streaming download, SHA-256 computation, unpacking propagation, and snapshot validation—all present in the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch streaming-untar

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@lib/common/common/src/safe_unpack.rs`:
- Around line 8-27: The function safe_unpack silently ignores canonicalize
failures by using unwrap_or; change it to propagate the error: call
fs_err::canonicalize(target_dir)? and assign the resulting PathBuf (e.g., dst)
and use that for entry.unpack_in(&dst). Update any bindings that used dst to
accept the canonicalized PathBuf instead of falling back to target_dir; ensure
the canonicalize error is returned to the caller rather than being swallowed.

In `@lib/shard/Cargo.toml`:
- Line 46: Remove the redundant dev-dependency entry for "tempfile" from the
Cargo.toml dev-dependencies section because "tempfile" is already declared as a
regular dependency (used by snapshot_data.rs); locate the dev-dependencies block
and delete the "tempfile" entry so tests use the existing runtime dependency and
avoid duplication.

lib/common/common/src/safe_unpack.rs

lib/shard/Cargo.toml

- Rename `safe_unpack.rs` into `tar_unpack.rs` so it would be listed near `tar_ext.rs` in IDEs. - Replace calls like `ar = open_snapshot_archive(…); safe_unpack(ar, …);` with a single call to `tar_unpack_file(…)`. - Put calls to `Archive::new(); Archive::set_overwrite(false);` inside `tar_unpack_reader` (was `safe_unpack`). So, now it is the only place that does `set_overwrite`.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@lib/collection/src/collection/snapshots.rs`:
- Around line 171-175: When handling SnapshotData::Unpacked in the restore flow,
ensure the destination directory exists before calling move_all: locate the
match arm that uses SnapshotData::Unpacked (uses snapshot_dir.path(), target_dir
and calls move_all) and create the target directory (e.g., via
std::fs::create_dir_all or the project's mkdir helper) for target_dir before
invoking move_all, propagating errors as before so callers still receive
failures on permission/IO errors.

coderabbitai · 2026-02-02T16:31:48Z

lib/collection/src/collection/snapshots.rs

+            SnapshotData::Unpacked(snapshot_dir) => {
+                // already unpacked snapshot, validate files and move to target dir
+                let snapshot_dir_path = snapshot_dir.path();
+                move_all(snapshot_dir_path, target_dir)?;
+            }


⚠️ Potential issue | 🟠 Major

Ensure target_dir exists before move_all.
move_all asserts the destination directory exists. When restoring from SnapshotData::Unpacked, callers that pass a fresh temp path will fail unless target_dir is created first (e.g., the .tmp path in recovery flows).

✅ Suggested fix

- match snapshot_data { + std::fs::create_dir_all(target_dir)?; + match snapshot_data { SnapshotData::Packed(snapshot_path) => { tar_unpack_file(&snapshot_path, target_dir)?; snapshot_path.close()?; } SnapshotData::Unpacked(snapshot_dir) => { // already unpacked snapshot, validate files and move to target dir let snapshot_dir_path = snapshot_dir.path(); move_all(snapshot_dir_path, target_dir)?; } }

🤖 Prompt for AI Agents

In `@lib/collection/src/collection/snapshots.rs` around lines 171 - 175, When handling SnapshotData::Unpacked in the restore flow, ensure the destination directory exists before calling move_all: locate the match arm that uses SnapshotData::Unpacked (uses snapshot_dir.path(), target_dir and calls move_all) and create the target directory (e.g., via std::fs::create_dir_all or the project's mkdir helper) for target_dir before invoking move_all, propagating errors as before so callers still receive failures on permission/IO errors.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@clippy.toml`:
- Line 18: The clippy suggestion uses the module name as the replacement, which
is incorrect; update the replacement value from "common::tar_unpack" to the
actual function "common::tar_unpack::tar_unpack_reader" so the Clippy suggestion
for tar::Archive::unpack points to the reader-based helper; locate the entry
referencing tar::Archive::unpack and change its replacement string accordingly.

clippy.toml

Instead of downloading from storage.googleapis.com every time the test runs, put small snapshot file to the repo. The snapshot file is created using this command: curl -s \ https://storage.googleapis.com/qdrant-benchmark-snapshots/test-shard.snapshot \ | tar \ --delete segments/4ea958d8-0b64-4312-9a53-0cd857e93535.tar \ --delete segments/65ac6276-8cca-4f5c-b767-9722190cee8b.tar \ > lib/storage/src/content_manager/snapshots/test-shard.snapshot File contents: $ tar tf lib/storage/src/content_manager/snapshots/test-shard.snapshot wal/ wal/closed-255 newest_clocks.json replica_state.json shard_config.json $ du -sh lib/storage/src/content_manager/snapshots/test-shard.snapshot 12K lib/storage/src/content_manager/snapshots/test-shard.snapshot $ sha256sum < lib/storage/src/content_manager/snapshots/test-shard.snapshot 5d94eac5c1ede3994a28bc406120046c37370d5d45b489a0d2252531b4e3e1f2 -

* download tar * compute sha256 for stream download * wip: propagate unpacking into down to the logic, todo: validation * unpacked snapshot validation * Minor tweaks * Fix typo * validation during unpack * cancellation token * update docstring * remove redundant dep * Rearrange unpack functions - Rename `safe_unpack.rs` into `tar_unpack.rs` so it would be listed near `tar_ext.rs` in IDEs. - Replace calls like `ar = open_snapshot_archive(…); safe_unpack(ar, …);` with a single call to `tar_unpack_file(…)`. - Put calls to `Archive::new(); Archive::set_overwrite(false);` inside `tar_unpack_reader` (was `safe_unpack`). So, now it is the only place that does `set_overwrite`. * Let clippy complain if tar::Archive::unpack used * Mock snapshot download URL Instead of downloading from storage.googleapis.com every time the test runs, put small snapshot file to the repo. The snapshot file is created using this command: curl -s \ https://storage.googleapis.com/qdrant-benchmark-snapshots/test-shard.snapshot \ | tar \ --delete segments/4ea958d8-0b64-4312-9a53-0cd857e93535.tar \ --delete segments/65ac6276-8cca-4f5c-b767-9722190cee8b.tar \ > lib/storage/src/content_manager/snapshots/test-shard.snapshot File contents: $ tar tf lib/storage/src/content_manager/snapshots/test-shard.snapshot wal/ wal/closed-255 newest_clocks.json replica_state.json shard_config.json $ du -sh lib/storage/src/content_manager/snapshots/test-shard.snapshot 12K lib/storage/src/content_manager/snapshots/test-shard.snapshot $ sha256sum < lib/storage/src/content_manager/snapshots/test-shard.snapshot 5d94eac5c1ede3994a28bc406120046c37370d5d45b489a0d2252531b4e3e1f2 - --------- Co-authored-by: timvisee <tim@visee.me> Co-authored-by: xzfc <xzfcpw@gmail.com>

generall added 4 commits February 2, 2026 00:02

download tar

d5ee427

compute sha256 for stream download

0d9d793

wip: propagate unpacking into down to the logic, todo: validation

ba3cef7

unpacked snapshot validation

2df3448

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

generall requested a review from xzfc February 2, 2026 09:46

qdrant deleted a comment from coderabbitai bot Feb 2, 2026

timvisee self-requested a review February 2, 2026 12:07

timvisee approved these changes Feb 2, 2026

View reviewed changes

Minor tweaks

4417265

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

qdrant deleted a comment from coderabbitai bot Feb 2, 2026

Fix typo

833272e

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

qdrant deleted a comment from coderabbitai bot Feb 2, 2026

tellet-q mentioned this pull request Feb 2, 2026

Add more stages for shard transfer profiling #8027

Merged

9 tasks

github-actions bot mentioned this pull request Feb 2, 2026

Flaky test tests::hw_metrics::test_hw_metrics_cancellation #6070

Open

generall added 3 commits February 2, 2026 16:07

validation during unpack

a2c5fbd

cancellation token

6c1f478

update docstring

4fc2370

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

lib/common/common/src/safe_unpack.rs Outdated Show resolved Hide resolved

lib/shard/Cargo.toml Show resolved Hide resolved

generall and others added 2 commits February 2, 2026 16:29

remove redundant dep

0b100fd

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

Let clippy complain if tar::Archive::unpack used

745a0d5

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

clippy.toml Show resolved Hide resolved

xzfc approved these changes Feb 2, 2026

View reviewed changes

generall merged commit f403f6d into dev Feb 3, 2026
35 of 39 checks passed

generall deleted the streaming-untar branch February 3, 2026 11:19

timvisee mentioned this pull request Feb 4, 2026

For snapshots directly restored, use default temp path in storage volume #8059

Merged

6 tasks

coderabbitai bot mentioned this pull request Feb 16, 2026

Enable lint detecting ignored unit patterns #8142

Closed

timvisee mentioned this pull request Feb 17, 2026

Bump version to 1.17.0 #8160

Merged

5 tasks

coderabbitai bot mentioned this pull request Mar 9, 2026

Restrict snaphsot recovery to snapshots directory, hide file contents #8341

Merged

Conversation

generall commented Feb 2, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

timvisee left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coderabbitai bot commented Feb 2, 2026 •

edited

Loading