Conversation
generall
commented
Feb 2, 2026
- download tar
- compute sha256 for stream download
- wip: propagate unpacking into down to the logic, todo: validation
- unpacked snapshot validation
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/collection/src/shards/shard_holder/mod.rs`:
- Around line 1268-1270: The Unpacked branch calls move_all(snapshot_dir.path(),
&snapshot_temp_dir) without the cancellation check used in the Packed branch;
add the same cancellation check (the identical call used before ar.unpack() in
the Packed branch) immediately before invoking move_all in the
SnapshotData::Unpacked arm so the code aborts promptly if cancelled.
🧹 Nitpick comments (3)
lib/shard/src/snapshots/snapshot_data.rs (1)
4-9: Consider adding#[derive(Debug)]for better diagnostics.The
SnapshotDataenum would benefit from derivingDebugto aid in logging and troubleshooting during snapshot operations.💡 Suggested improvement
+#[derive(Debug)] pub enum SnapshotData { /// Tar file containing the snapshot, needs to be unpacked Packed(MaybeTempPath), /// Directory containing the unpacked snapshot Unpacked(TempDir), }lib/storage/src/content_manager/snapshots/download_result.rs (1)
3-7: Consider adding#[derive(Debug)]for logging and diagnostics.As a data container that will be passed through download and recovery flows, having
Debugwould help with troubleshooting.💡 Suggested improvement
+#[derive(Debug)] pub struct DownloadResult { pub snapshot: SnapshotData, /// Sha256 hash of the downloaded snapshot file, if computed pub hash: Option<String>, }lib/storage/src/content_manager/snapshots/download_tar.rs (1)
133-166: Test depends on external network resource.This test downloads from
storage.googleapis.com, making it dependent on network availability and external service uptime. Consider marking it with#[ignore]or gating it behind a feature flag to prevent CI flakiness.💡 Suggested improvement
#[cfg(test)] mod tests { use super::*; + #[ignore] // Requires network access to external service #[tokio::test] async fn test_download_and_unpack_tar() {
| SnapshotData::Unpacked(snapshot_dir) => { | ||
| move_all(snapshot_dir.path(), &snapshot_temp_dir)?; | ||
| } |
There was a problem hiding this comment.
Missing cancellation check before move_all in the Unpacked branch.
The Packed branch (lines 1260-1262) checks for cancellation before the expensive ar.unpack() operation, but the Unpacked branch proceeds directly to move_all() without a similar check. For consistency and proper cancel responsiveness, consider adding a cancellation check here.
Proposed fix
SnapshotData::Unpacked(snapshot_dir) => {
+ if cancel.is_cancelled() {
+ return Err(cancel::Error::Cancelled.into());
+ }
move_all(snapshot_dir.path(), &snapshot_temp_dir)?;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| SnapshotData::Unpacked(snapshot_dir) => { | |
| move_all(snapshot_dir.path(), &snapshot_temp_dir)?; | |
| } | |
| SnapshotData::Unpacked(snapshot_dir) => { | |
| if cancel.is_cancelled() { | |
| return Err(cancel::Error::Cancelled.into()); | |
| } | |
| move_all(snapshot_dir.path(), &snapshot_temp_dir)?; | |
| } |
🤖 Prompt for AI Agents
In `@lib/collection/src/shards/shard_holder/mod.rs` around lines 1268 - 1270, The
Unpacked branch calls move_all(snapshot_dir.path(), &snapshot_temp_dir) without
the cancellation check used in the Packed branch; add the same cancellation
check (the identical call used before ar.unpack() in the Packed branch)
immediately before invoking move_all in the SnapshotData::Unpacked arm so the
code aborts promptly if cancelled.
timvisee
left a comment
There was a problem hiding this comment.
Vetted and tested. It works well and I didn't see any problems.
Note that this doesn't make shard snapshot recovery fully streaming. This only streams and unpacks the outer layer. We also have an inner layer of tars (each segment is a tar), which remains non-streaming.
On the good side this removes 1 out of 3 big IO steps.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@lib/storage/src/content_manager/snapshots/download_tar.rs`:
- Around line 129-161: The test test_download_and_unpack_tar in the tests module
performs a network call to a public URL and should not run by default in unit
test runs; either mark the test ignored by default or gate it behind an
integration/feature flag. Update the test declaration for
test_download_and_unpack_tar (the async tokio test in the tests module) to
either add an ignore attribute so it only runs when explicitly included, or wrap
it with a cfg feature (e.g., cfg(feature = "integration_tests")) and document
enabling that feature in Cargo.toml so CI can opt-in when network tests are
allowed.
In `@lib/storage/src/content_manager/snapshots/download.rs`:
- Around line 94-97: The error message in the download.rs branch uses the
incorrect word "schema"; update the string passed to StorageError::bad_request
to use "scheme" instead (locate the Err(StorageError::bad_request(...)) call
that formats "URL {url} with schema {} is not supported" and change it to "URL
{url} with scheme {} is not supported") so the message correctly refers to URL
schemes; leave the surrounding call and formatting (StorageError::bad_request,
url.scheme()) unchanged.
In `@src/common/snapshots.rs`:
- Around line 207-216: The code is hashing the original snapshot_path instead of
the actual file returned by get_snapshot_file() (a MaybeTempPath), so replace
the sha_256::hash_file call that uses snapshot_path with a call that hashes the
real restored file inside snapshot_file (use the MaybeTempPath's concrete path
accessor or conversion) when computing hash in the restore flow; specifically,
update the block that sets hash (currently referencing snapshot_path) to
open/hash the path inside snapshot_file (the value returned by
collection.get_snapshots_storage_manager().get_snapshot_file(...)) so
S3-downloaded temporary files are hashed correctly.
🧹 Nitpick comments (1)
lib/storage/src/content_manager/snapshots/download_tar.rs (1)
94-104: Path traversal is protected by tar 0.4.41's built-in guards.The project uses
tar = "0.4.41", which skips archive entries containing..and prevents writing outside the destination directory. This addresses the original concern about path traversal attacks. However, for defense-in-depth with untrusted archives, consider usingEntry::unpack_in(iterating per entry) instead ofArchive::unpack, as it enforces stricter per-entry containment checks.
| let snapshot_file = collection | ||
| .get_snapshots_storage_manager()? | ||
| .get_snapshot_file(&snapshot_path, &download_dir) | ||
| .await? | ||
| .await?; | ||
|
|
||
| let hash = if checksum.is_some() { | ||
| Some(sha_256::hash_file(&snapshot_path).await?) | ||
| } else { | ||
| None | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, let's look at the full context around lines 207-216 in snapshots.rs
cat -n src/common/snapshots.rs | sed -n '200,225p'Repository: qdrant/qdrant
Length of output: 1159
🏁 Script executed:
#!/bin/bash
# Search for get_snapshot_file implementation
rg -n --type=rs -A 10 'fn get_snapshot_file' src/Repository: qdrant/qdrant
Length of output: 83
🏁 Script executed:
#!/bin/bash
# Also search for snapshots_storage_manager type/trait definition
rg -n --type=rs 'get_snapshots_storage_manager' -A 2 src/Repository: qdrant/qdrant
Length of output: 83
🏁 Script executed:
#!/bin/bash
# Search for get_snapshot_file implementation more broadly
rg -n 'fn get_snapshot_file' src/ -A 10Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Look for the storage manager trait or struct definition
rg -n 'get_snapshots_storage_manager' src/ -B 2 -A 5Repository: qdrant/qdrant
Length of output: 3145
🏁 Script executed:
#!/bin/bash
# Check what SnapshotData::Packed expects and examine the broader context
rg -n 'SnapshotData::Packed' src/ -B 3 -A 3Repository: qdrant/qdrant
Length of output: 2071
🏁 Script executed:
#!/bin/bash
# Search for trait definition or impl block with get_snapshot_file
rg -n 'get_snapshot_file' src/ -B 5 -A 15 | head -100Repository: qdrant/qdrant
Length of output: 1324
🏁 Script executed:
#!/bin/bash
# Find SnapshotsStorageManager or similar trait/struct
rg -n 'trait.*[Ss]torage.*[Mm]anager\|struct.*[Ss]torage.*[Mm]anager' src/ -A 20Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Look at the types - find MaybeTempPath definition to understand temp handling
rg -n 'MaybeTempPath' src/ -B 3 -A 5 | head -80Repository: qdrant/qdrant
Length of output: 2669
🏁 Script executed:
#!/bin/bash
# Search for get_snapshot_file in collection module
find src/ -type f -name "*.rs" | xargs grep -l 'get_snapshot_file' | head -20Repository: qdrant/qdrant
Length of output: 80
🏁 Script executed:
#!/bin/bash
# Look for SnapshotsStorageManager more broadly
rg -n 'SnapshotsStorageManager' src/ -A 10 | head -100Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check collection.rs or similar to see where get_snapshots_storage_manager is defined
rg -n 'fn get_snapshots_storage_manager' src/ -A 5Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Let's examine the imports and see what collection type is being used
head -50 src/common/snapshots.rsRepository: qdrant/qdrant
Length of output: 1765
🏁 Script executed:
#!/bin/bash
# Look for impl blocks in snapshots.rs that might define get_snapshot_file
rg -n 'impl.*get_snapshot_file\|impl.*get_snapshots_storage_manager' src/common/snapshots.rs -B 5 -A 15Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Search broadly for get_snapshot_file being called or defined
rg -n 'get_snapshot_file' . -A 5 -B 5 2>/dev/null | head -150Repository: qdrant/qdrant
Length of output: 4489
🏁 Script executed:
#!/bin/bash
# Get the full LocalFS implementation of get_snapshot_file
sed -n '371,385p' lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 604
🏁 Script executed:
#!/bin/bash
# Get the full S3 implementation of get_snapshot_file
sed -n '436,460p' lib/collection/src/common/snapshots_manager.rsRepository: qdrant/qdrant
Length of output: 903
🏁 Script executed:
#!/bin/bash
# Check what MaybeTempPath is - does it track temp vs non-temp?
rg -n 'enum MaybeTempPath\|struct MaybeTempPath' . -A 10Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
#!/bin/bash
# Check the DownloadResult struct to understand hash usage
rg -n 'struct DownloadResult\|pub.*hash' lib/collection/src/common/snapshots_manager.rs storage/src/content_manager/snapshots/ -A 3 -B 2 | head -50Repository: qdrant/qdrant
Length of output: 135
🏁 Script executed:
#!/bin/bash
# Look for how DownloadResult is used - confirm snapshot_file is what gets restored
rg -n 'DownloadResult' src/common/snapshots.rs -B 5 -A 5Repository: qdrant/qdrant
Length of output: 1495
Hash the snapshot file actually used for restore, not the source path.
The code calls get_snapshot_file() which returns MaybeTempPath containing the actual file to be restored, but then hashes snapshot_path instead. For LocalFS, this happens to work since the returned path equals the input; for S3, get_snapshot_file() downloads to a temporary file and returns MaybeTempPath::Temporary(temp_path), making the hash of the remote snapshot_path incorrect. Hash snapshot_file (which is MaybeTempPath and contains the correct path) instead.
🤖 Prompt for AI Agents
In `@src/common/snapshots.rs` around lines 207 - 216, The code is hashing the
original snapshot_path instead of the actual file returned by
get_snapshot_file() (a MaybeTempPath), so replace the sha_256::hash_file call
that uses snapshot_path with a call that hashes the real restored file inside
snapshot_file (use the MaybeTempPath's concrete path accessor or conversion)
when computing hash in the restore flow; specifically, update the block that
sets hash (currently referencing snapshot_path) to open/hash the path inside
snapshot_file (the value returned by
collection.get_snapshots_storage_manager().get_snapshot_file(...)) so
S3-downloaded temporary files are hashed correctly.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@lib/storage/src/content_manager/snapshots/download_tar.rs`:
- Around line 88-114: Replace the direct call to archive.unpack(&target_dir)
with an explicit iteration over entries so each Entry's header.entry_type() is
validated (allow only Directory, Regular, and GNUSparse as
validate_snapshot_archive() does) before extracting; specifically, inside the
spawn_blocking closure (the code creating tar::Archive and using HashingReader),
call archive.entries()? to iterate entries, for each entry check
entry.header().entry_type() and return a StorageError::service_error on any
disallowed type, and for valid entries unpack them into target_dir (e.g.,
entry.unpack_in or equivalent) instead of archive.unpack; keep the existing
trailing-drain logic and hash retrieval via the HashingReader unchanged.
In `@lib/storage/src/content_manager/snapshots/download.rs`:
- Around line 23-52: Update the doc comment for _download_snapshot to describe
that it downloads and unpacks a tar snapshot into a temporary directory
(TempDir) rather than a single file; mention that the TempDir contains the
unpacked contents and will be deleted when dropped (use TempDir::keep() to
persist), and note that the function may compute and return an optional checksum
(Option<String>) via download_and_unpack_tar; also adjust wording to reference
snapshot_prefix and the unpacking behavior instead of "file".
| // Use spawn_blocking because tar::Archive is synchronous | ||
| let hash = tokio::task::spawn_blocking(move || { | ||
| // SyncIoBridge converts an AsyncRead into a sync Read | ||
| // It must be used within a tokio runtime context (spawn_blocking provides this) | ||
| let sync_reader = tokio_util::io::SyncIoBridge::new(async_reader); | ||
|
|
||
| // Wrap the reader with optional hashing | ||
| let hashing_reader = HashingReader::new(sync_reader, compute_checksum); | ||
|
|
||
| // We need to keep access to the hashing reader to get the hash after unpacking, | ||
| // but tar::Archive takes ownership. Use a RefCell-like pattern with take. | ||
| let mut archive = tar::Archive::new(hashing_reader); | ||
| archive.set_overwrite(false); | ||
|
|
||
| archive.unpack(&target_dir).map_err(|e| { | ||
| StorageError::service_error(format!("Failed to unpack tar archive: {e}")) | ||
| })?; | ||
|
|
||
| // Drain any remaining bytes to ensure the full stream is hashed. | ||
| // Tar files have trailing padding that Archive doesn't read. | ||
| let mut reader = archive.into_inner(); | ||
| if reader.hasher.is_some() { | ||
| let mut buf = [0u8; 8192]; | ||
| while reader.read(&mut buf)? > 0 {} | ||
| } | ||
|
|
||
| // Get the hash from the inner reader |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "download_tar.rs" | head -5Repository: qdrant/qdrant
Length of output: 116
🏁 Script executed:
cd lib/storage/src/content_manager/snapshots && ls -laRepository: qdrant/qdrant
Length of output: 485
🏁 Script executed:
wc -l download_tar.rsRepository: qdrant/qdrant
Length of output: 103
🏁 Script executed:
wc -l lib/storage/src/content_manager/snapshots/download_tar.rsRepository: qdrant/qdrant
Length of output: 118
🏁 Script executed:
cat -n lib/storage/src/content_manager/snapshots/download_tar.rsRepository: qdrant/qdrant
Length of output: 6300
🏁 Script executed:
rg -n "validate_snapshot_archive" lib/storage/src/content_manager/snapshots/Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
rg -n "entry_type\|EntryType" lib/storage/src/content_manager/snapshots/Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
rg -rn "Directory.*Regular.*GNUSparse" lib/storage/Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
rg -n "validate_snapshot" lib/storage/src/content_manager/Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
rg -n "Malformed snapshot" lib/storage/src/Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
rg -n "entry_type\|EntryType" lib/storage/src/ -A 3 -B 3Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
cat lib/storage/src/content_manager/snapshots/mod.rsRepository: qdrant/qdrant
Length of output: 7674
🏁 Script executed:
rg -n "tar::" lib/storage/src/content_manager/snapshots/ -A 2 -B 2Repository: qdrant/qdrant
Length of output: 2158
🏁 Script executed:
rg -n "validate\|Malformed\|symlink" lib/storage/src/content_manager/ -iRepository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
cat lib/storage/src/content_manager/snapshots/download.rsRepository: qdrant/qdrant
Length of output: 3230
🏁 Script executed:
cat lib/storage/src/content_manager/snapshots/recover.rs | head -100Repository: qdrant/qdrant
Length of output: 3214
🏁 Script executed:
rg -n "Directory\|Regular\|Symlink" lib/storage/src/ -i -A 2 -B 2Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -n "unpack\|entries\|entry" lib/storage/src/content_manager/snapshots/recover.rs | head -30Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "snapshot.*archive\|archive.*snapshot" lib/storage/src/content_manager/ -iRepository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "\.entries()\|entry_type\|EntryType" lib/storage/src/ -A 3 -B 3Repository: qdrant/qdrant
Length of output: 17715
🏁 Script executed:
rg -n "tar\|archive" lib/storage/src/ -i --type rust | grep -i "symlink\|special\|type\|validate" | head -20Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
find lib/storage -name "*.rs" -exec grep -l "tar::Archive\|tar::Entry" {} \;Repository: qdrant/qdrant
Length of output: 114
🏁 Script executed:
rg -n "fn.*validate.*snapshot\|fn.*unpack.*tar" lib/storage/src/ -A 5Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "symlink\|Symlink" lib/storage/src/ -iRepository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "Directory\|Regular" lib/storage/src/content_manager/snapshots/ -iRepository: qdrant/qdrant
Length of output: 185
🏁 Script executed:
python3 << 'EOF'
# Let's check what tar::Archive::unpack does and compare with manual iteration
import subprocess
result = subprocess.run(
["rg", "-n", "set_overwrite|unpack", "lib/storage/src/", "-B2", "-A2"],
capture_output=True,
text=True,
cwd="."
)
print(result.stdout)
EOFRepository: qdrant/qdrant
Length of output: 4610
🏁 Script executed:
rg -rn "validate\|check.*archive\|check.*snapshot" lib/storage/src/content_manager/snapshots/ -iRepository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
cat lib/storage/src/content_manager/snapshots/recover.rs | tail -100Repository: qdrant/qdrant
Length of output: 4283
🏁 Script executed:
grep -rn "snapshot" lib/storage/src/content_manager/snapshots/recover.rs | head -30Repository: qdrant/qdrant
Length of output: 1922
🏁 Script executed:
grep -rn "check_shard_path" lib/storage/src/ -A 5 -B 2Repository: qdrant/qdrant
Length of output: 1685
🏁 Script executed:
grep -rn "restore_snapshot\|unpack" lib/storage/src/content_manager/snapshots/recover.rs -A 3 -B 3Repository: qdrant/qdrant
Length of output: 349
🏁 Script executed:
web_search
tar::Archive::unpack security symlink extractionRepository: qdrant/qdrant
Length of output: 166
🌐 Web query:
tar::Archive::unpack Rust symlink security validation
💡 Result:
In the Rust tar crate, tar::Archive::unpack(dst) is intended to be “safe-by-default” for untrusted archives in the basic sense that it won’t write files outside dst and skips entries whose paths contain ... [1]
However, symlinks/hardlinks have historically been a security footgun during extraction, and the safety you get depends on the crate version:
tar < 0.4.16:unpack_in-family behavior could be abused with hardlinks/symlinks to overwrite arbitrary filesystem files (RustSec RUSTSEC-2018-0002, CVE-2018-20990). Fixed intar >= 0.4.16. [2]tar < 0.4.36:tar::Archive::unpackcould be abused so that a tarball containing a symlink causes directories to be created outside the destination (RustSec RUSTSEC-2021-0080, CVE-2021-38511). Fixed intar >= 0.4.36. [3]
Practical guidance for “symlink security validation”:
- Use
tar >= 0.4.36(this includes the earlier fix from 0.4.16). [2][3] - Prefer
Archive::unpack(bulk) orEntry::unpack_in(per-entry) for untrusted archives; avoidEntry::unpackbecause it explicitly has fewer sanity checks. [4] - If you need stricter symlink handling than the crate provides (e.g., “never create symlinks” or “reject absolute/external link targets”), you must implement policy yourself while iterating entries (since tar archives can legitimately contain symlinks). [4]
Sources: tar Archive::unpack docs [1], RustSec advisories RUSTSEC-2018-0002 [2] and RUSTSEC-2021-0080 [3], tar Entry::unpack/unpack_in docs note on sanity checks [4].
🏁 Script executed:
grep -rn "Collection::restore_snapshot\|restore_snapshot" lib/ -A 2 -B 2 | head -40Repository: qdrant/qdrant
Length of output: 3105
🏁 Script executed:
grep -rn "tar\s*=" Cargo.toml | head -5Repository: qdrant/qdrant
Length of output: 75
🏁 Script executed:
find . -name "Cargo.lock" -o -name "Cargo.toml" | xargs grep -l "^tar" | head -3Repository: qdrant/qdrant
Length of output: 128
🏁 Script executed:
cat Cargo.toml | grep -A 2 -B 2 "tar"Repository: qdrant/qdrant
Length of output: 656
🏁 Script executed:
grep -rn "EntryType\|entry_type\|entry\.header" lib/collection/src/ -A 3 -B 3Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "snapshot.*extract\|extract.*snapshot" lib/collection/src/shards/ -iRepository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "symlink\|Symlink" lib/collection/src/ -i -A 3 -B 3Repository: qdrant/qdrant
Length of output: 39
🏁 Script executed:
grep -rn "validate_snapshot\|validate.*archive" lib/ -iRepository: qdrant/qdrant
Length of output: 1251
🏁 Script executed:
grep -rn "Malformed snapshot\|malformed.*archive" lib/ -iRepository: qdrant/qdrant
Length of output: 469
🏁 Script executed:
rg -rn "tar::EntryType" lib/Repository: qdrant/qdrant
Length of output: 168
🏁 Script executed:
cat lib/segment/src/common/validate_snapshot_archive.rsRepository: qdrant/qdrant
Length of output: 4296
Validate tar entries before unpacking to match snapshot safety checks.
The streaming extraction path uses archive.unpack() without entry-type validation, whereas the codebase has an established validate_snapshot_archive() function (in lib/segment/src/common/validate_snapshot_archive.rs) that rejects symlinks and special file types. The current code allows these entries to be extracted before post-extraction filesystem validation occurs in recover.rs, creating a window of opportunity for unsafe extraction.
Validate entries while iterating instead of extracting first. The suggested approach mirrors the existing validate_snapshot_archive() logic that only accepts Directory, Regular, and GNUSparse entry types.
🤖 Prompt for AI Agents
In `@lib/storage/src/content_manager/snapshots/download_tar.rs` around lines 88 -
114, Replace the direct call to archive.unpack(&target_dir) with an explicit
iteration over entries so each Entry's header.entry_type() is validated (allow
only Directory, Regular, and GNUSparse as validate_snapshot_archive() does)
before extracting; specifically, inside the spawn_blocking closure (the code
creating tar::Archive and using HashingReader), call archive.entries()? to
iterate entries, for each entry check entry.header().entry_type() and return a
StorageError::service_error on any disallowed type, and for valid entries unpack
them into target_dir (e.g., entry.unpack_in or equivalent) instead of
archive.unpack; keep the existing trailing-drain logic and hash retrieval via
the HashingReader unchanged.
| /// Download a remote file from `url` to `path` | ||
| /// | ||
| /// Returns a `TempPath` that will delete the downloaded file once it is dropped. | ||
| /// To persist the file, use `download_file(...).keep()`. | ||
| #[must_use = "returns a TempPath, if dropped the downloaded file is deleted"] | ||
| async fn download_file( | ||
| /// Returns a `TempDir` that will delete the downloaded file once it is dropped. | ||
| /// To persist the file, use [`keep()`](TempDir::keep). | ||
| #[must_use = "returns a TempDir, if dropped the downloaded file is deleted"] | ||
| async fn _download_snapshot( | ||
| client: &reqwest::Client, | ||
| url: &Url, | ||
| dir_path: &Path, | ||
| ) -> Result<TempPath, StorageError> { | ||
| compute_checksum: bool, | ||
| ) -> Result<(TempDir, Option<String>), StorageError> { | ||
| let download_start_time = tokio::time::Instant::now(); | ||
| let (file, temp_path) = tempfile::Builder::new() | ||
| .prefix(&snapshot_prefix(url)) | ||
| .suffix(".download") | ||
| .tempfile_in(dir_path)? | ||
| .into_parts(); | ||
| let file = fs::File::from_parts::<&Path>(file, temp_path.as_ref()); | ||
|
|
||
| log::debug!("Downloading snapshot from {url} to {temp_path:?}"); | ||
|
|
||
| let mut file = tokio_fs::File::from_std(file); | ||
|
|
||
| let response = client.get(url.clone()).send().await?; | ||
|
|
||
| if !response.status().is_success() { | ||
| return Err(StorageError::bad_input(format!( | ||
| "Failed to download snapshot from {}: status - {}", | ||
| url, | ||
| response.status() | ||
| ))); | ||
| } | ||
|
|
||
| let mut stream = response.bytes_stream(); | ||
| let mut total_bytes_downloaded = 0u64; | ||
| let snapshot_name = snapshot_prefix(url); | ||
|
|
||
| while let Some(chunk_result) = stream.next().await { | ||
| let chunk = chunk_result?; | ||
| total_bytes_downloaded += chunk.len() as u64; | ||
| file.write_all(&chunk).await?; | ||
| } | ||
| let tempdir = tempfile::Builder::new() | ||
| .prefix(&snapshot_name) | ||
| .suffix(".download") | ||
| .tempdir_in(dir_path)?; | ||
|
|
||
| file.flush().await?; | ||
| let hash = download_and_unpack_tar(client, url, tempdir.path(), compute_checksum).await?; | ||
|
|
||
| let download_duration = download_start_time.elapsed(); | ||
| let total_size_mb = total_bytes_downloaded as f64 / BYTES_IN_MB as f64; | ||
| let download_speed_mbps = total_size_mb / download_duration.as_secs_f64(); | ||
| log::debug!( | ||
| "Snapshot download completed: path={}, size={:.2} MB, duration={:.2}s, speed={:.2} MB/s", | ||
| temp_path.display(), | ||
| total_size_mb, | ||
| "Snapshot download completed: path={tempdir:?}, duration={:.2}s", | ||
| download_duration.as_secs_f64(), | ||
| download_speed_mbps | ||
| ); | ||
|
|
||
| Ok(temp_path) | ||
| Ok((tempdir, hash)) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd "download.rs" lib/storage/src/content_manager/snapshots/Repository: qdrant/qdrant
Length of output: 110
🏁 Script executed:
cat -n lib/storage/src/content_manager/snapshots/download.rs | head -60Repository: qdrant/qdrant
Length of output: 2325
🏁 Script executed:
fd "download_tar.rs" lib/storage/src/content_manager/snapshots/Repository: qdrant/qdrant
Length of output: 114
🏁 Script executed:
cat -n lib/storage/src/content_manager/snapshots/download_tar.rs | head -100Repository: qdrant/qdrant
Length of output: 4099
🏁 Script executed:
cat -n lib/storage/src/content_manager/snapshots/download_tar.rs | sed -n '100,130p'Repository: qdrant/qdrant
Length of output: 1101
Update the doc comment to reflect directory-based downloads and unpacking behavior.
The function downloads and unpacks a tar snapshot into a TempDir, but the doc comment uses "file" terminology which is misleading. The suggested documentation updates properly reflect that the function unpacks contents into a directory.
📝 Suggested doc update
-/// Download a remote file from `url` to `path`
+/// Download and unpack a snapshot from `url` into a temporary directory.
///
-/// Returns a `TempDir` that will delete the downloaded file once it is dropped.
-/// To persist the file, use [`keep()`](TempDir::keep).
+/// Returns a `TempDir` that will delete the unpacked snapshot directory when dropped.🤖 Prompt for AI Agents
In `@lib/storage/src/content_manager/snapshots/download.rs` around lines 23 - 52,
Update the doc comment for _download_snapshot to describe that it downloads and
unpacks a tar snapshot into a temporary directory (TempDir) rather than a single
file; mention that the TempDir contains the unpacked contents and will be
deleted when dropped (use TempDir::keep() to persist), and note that the
function may compute and return an optional checksum (Option<String>) via
download_and_unpack_tar; also adjust wording to reference snapshot_prefix and
the unpacking behavior instead of "file".
📝 WalkthroughWalkthroughReplaces path-centric snapshot handling with a SnapshotData enum (Packed | Unpacked) and propagates it through collection, shard, storage, and API layers. Adds tar_unpack utilities and a download-and-unpack streaming path (with optional SHA‑256) that returns a DownloadResult containing SnapshotData and optional hash. Restores now accept SnapshotData instead of MaybeTempPath/Path; validate_snapshot_archive and related helpers were removed; tests and imports updated to construct/forward SnapshotData. Several public method signatures were changed to accept SnapshotData. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@lib/common/common/src/safe_unpack.rs`:
- Around line 8-27: The function safe_unpack silently ignores canonicalize
failures by using unwrap_or; change it to propagate the error: call
fs_err::canonicalize(target_dir)? and assign the resulting PathBuf (e.g., dst)
and use that for entry.unpack_in(&dst). Update any bindings that used dst to
accept the canonicalized PathBuf instead of falling back to target_dir; ensure
the canonicalize error is returned to the caller rather than being swallowed.
In `@lib/shard/Cargo.toml`:
- Line 46: Remove the redundant dev-dependency entry for "tempfile" from the
Cargo.toml dev-dependencies section because "tempfile" is already declared as a
regular dependency (used by snapshot_data.rs); locate the dev-dependencies block
and delete the "tempfile" entry so tests use the existing runtime dependency and
avoid duplication.
- Rename `safe_unpack.rs` into `tar_unpack.rs` so it would be listed near `tar_ext.rs` in IDEs. - Replace calls like `ar = open_snapshot_archive(…); safe_unpack(ar, …);` with a single call to `tar_unpack_file(…)`. - Put calls to `Archive::new(); Archive::set_overwrite(false);` inside `tar_unpack_reader` (was `safe_unpack`). So, now it is the only place that does `set_overwrite`.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@lib/collection/src/collection/snapshots.rs`:
- Around line 171-175: When handling SnapshotData::Unpacked in the restore flow,
ensure the destination directory exists before calling move_all: locate the
match arm that uses SnapshotData::Unpacked (uses snapshot_dir.path(), target_dir
and calls move_all) and create the target directory (e.g., via
std::fs::create_dir_all or the project's mkdir helper) for target_dir before
invoking move_all, propagating errors as before so callers still receive
failures on permission/IO errors.
| SnapshotData::Unpacked(snapshot_dir) => { | ||
| // already unpacked snapshot, validate files and move to target dir | ||
| let snapshot_dir_path = snapshot_dir.path(); | ||
| move_all(snapshot_dir_path, target_dir)?; | ||
| } |
There was a problem hiding this comment.
Ensure target_dir exists before move_all.
move_all asserts the destination directory exists. When restoring from SnapshotData::Unpacked, callers that pass a fresh temp path will fail unless target_dir is created first (e.g., the .tmp path in recovery flows).
✅ Suggested fix
- match snapshot_data {
+ std::fs::create_dir_all(target_dir)?;
+ match snapshot_data {
SnapshotData::Packed(snapshot_path) => {
tar_unpack_file(&snapshot_path, target_dir)?;
snapshot_path.close()?;
}
SnapshotData::Unpacked(snapshot_dir) => {
// already unpacked snapshot, validate files and move to target dir
let snapshot_dir_path = snapshot_dir.path();
move_all(snapshot_dir_path, target_dir)?;
}
}🤖 Prompt for AI Agents
In `@lib/collection/src/collection/snapshots.rs` around lines 171 - 175, When
handling SnapshotData::Unpacked in the restore flow, ensure the destination
directory exists before calling move_all: locate the match arm that uses
SnapshotData::Unpacked (uses snapshot_dir.path(), target_dir and calls move_all)
and create the target directory (e.g., via std::fs::create_dir_all or the
project's mkdir helper) for target_dir before invoking move_all, propagating
errors as before so callers still receive failures on permission/IO errors.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@clippy.toml`:
- Line 18: The clippy suggestion uses the module name as the replacement, which
is incorrect; update the replacement value from "common::tar_unpack" to the
actual function "common::tar_unpack::tar_unpack_reader" so the Clippy suggestion
for tar::Archive::unpack points to the reader-based helper; locate the entry
referencing tar::Archive::unpack and change its replacement string accordingly.
Instead of downloading from storage.googleapis.com every time the test
runs, put small snapshot file to the repo.
The snapshot file is created using this command:
curl -s \
https://storage.googleapis.com/qdrant-benchmark-snapshots/test-shard.snapshot \
| tar \
--delete segments/4ea958d8-0b64-4312-9a53-0cd857e93535.tar \
--delete segments/65ac6276-8cca-4f5c-b767-9722190cee8b.tar \
> lib/storage/src/content_manager/snapshots/test-shard.snapshot
File contents:
$ tar tf lib/storage/src/content_manager/snapshots/test-shard.snapshot
wal/
wal/closed-255
newest_clocks.json
replica_state.json
shard_config.json
$ du -sh lib/storage/src/content_manager/snapshots/test-shard.snapshot
12K lib/storage/src/content_manager/snapshots/test-shard.snapshot
$ sha256sum < lib/storage/src/content_manager/snapshots/test-shard.snapshot
5d94eac5c1ede3994a28bc406120046c37370d5d45b489a0d2252531b4e3e1f2 -
* download tar
* compute sha256 for stream download
* wip: propagate unpacking into down to the logic, todo: validation
* unpacked snapshot validation
* Minor tweaks
* Fix typo
* validation during unpack
* cancellation token
* update docstring
* remove redundant dep
* Rearrange unpack functions
- Rename `safe_unpack.rs` into `tar_unpack.rs` so it would be listed
near `tar_ext.rs` in IDEs.
- Replace calls like `ar = open_snapshot_archive(…); safe_unpack(ar, …);`
with a single call to `tar_unpack_file(…)`.
- Put calls to `Archive::new(); Archive::set_overwrite(false);` inside
`tar_unpack_reader` (was `safe_unpack`). So, now it is the only place
that does `set_overwrite`.
* Let clippy complain if tar::Archive::unpack used
* Mock snapshot download URL
Instead of downloading from storage.googleapis.com every time the test
runs, put small snapshot file to the repo.
The snapshot file is created using this command:
curl -s \
https://storage.googleapis.com/qdrant-benchmark-snapshots/test-shard.snapshot \
| tar \
--delete segments/4ea958d8-0b64-4312-9a53-0cd857e93535.tar \
--delete segments/65ac6276-8cca-4f5c-b767-9722190cee8b.tar \
> lib/storage/src/content_manager/snapshots/test-shard.snapshot
File contents:
$ tar tf lib/storage/src/content_manager/snapshots/test-shard.snapshot
wal/
wal/closed-255
newest_clocks.json
replica_state.json
shard_config.json
$ du -sh lib/storage/src/content_manager/snapshots/test-shard.snapshot
12K lib/storage/src/content_manager/snapshots/test-shard.snapshot
$ sha256sum < lib/storage/src/content_manager/snapshots/test-shard.snapshot
5d94eac5c1ede3994a28bc406120046c37370d5d45b489a0d2252531b4e3e1f2 -
---------
Co-authored-by: timvisee <tim@visee.me>
Co-authored-by: xzfc <xzfcpw@gmail.com>