Skip to content

Conversation

@hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Sep 19, 2025

Summary of changes

This PR fixes an issue in transcoding a v2 plain CAR (generated by Lotus) into ForestCAR by increasing max_len of UviBytes (default is 128MiB)

Changes introduced in this pull request:

Reference issue to close (if applicable)

Closes

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

  • New Features

    • Local snapshot imports now auto-convert unsupported formats into the required format.
    • CAR framing/encoding updated to support much larger files (up to 8 GiB).
  • Bug Fixes

    • Fixed failures when importing non-conforming snapshot files by falling back to conversion.
    • Improved reliability after downloads: files are validated and either persisted or converted as needed.
    • Clearer logging and error messages during transcoding and streaming to aid troubleshooting.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 19, 2025

Walkthrough

Adds on-the-fly transcoding for local non-forest CAR sources during import and conditional post-download transcoding; replaces direct UviBytes::default() usage with a new public uvi_bytes() (8 GiB max) used for CAR framing and header encoding; improves stream error messages and adds a transcode start log.

Changes

Cohort / File(s) Summary of Changes
Snapshot import control-flow & logging
src/daemon/db_util.rs
For local (non-URL) sources, check ForestCar::is_valid; if invalid, call transcode_into_forest_car and remove original when import mode is Move. Post-download: validate downloaded CAR and transcode before persisting when needed. Added tracing/info at transcode start.
CAR codec refactor & stream error context
src/db/car/forest.rs
Replaced UviBytes::default() uses with uvi_bytes() for header/block encode/decode and decode_eof. Changed stream polling error branch to return anyhow::anyhow! with contextual message. No public API changes.
CAR stream helper
src/utils/db/car_stream.rs
Added pub fn uvi_bytes() -> UviBytes which configures UviBytes with max_len = 8 * 1024 * 1024 * 1024 (8 GiB). Switched CAR framing and header encoding to use this helper.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Caller
  participant Import as "import_chain_as_forest_car"
  participant FS as "Filesystem/DB"
  participant DL as "Downloader"
  participant Transcode as "transcode_into_forest_car"

  rect rgb(239,246,255)
  note over Caller,Import: Local source flow
  Caller->>Import: import_chain_as_forest_car(from_path)
  Import->>FS: ForestCar::is_valid(from_path)
  alt valid
    Import->>FS: move_or_copy_file -> forest_car_db_path
  else invalid
    Import->>Transcode: transcode(from_path -> temp_forest_car)
    Transcode-->>Import: temp_forest_car
    Import->>FS: persist(temp_forest_car -> forest_car_db_path)
  end
  end

  rect rgb(245,255,245)
  note over Caller,Import: Remote URL flow (downloaded)
  Caller->>Import: import_chain_as_forest_car(url)
  Import->>DL: download(url -> downloaded_car_temp_path)
  DL-->>Import: downloaded_car_temp_path
  Import->>FS: ForestCar::is_valid(downloaded_car_temp_path)
  alt valid
    Import->>FS: persist(downloaded_car_temp_path -> forest_car_db_path)
  else invalid
    Import->>Transcode: transcode(downloaded_car_temp_path -> forest_car_db_temp_path)
    Transcode-->>Import: forest_car_db_temp_path
    Import->>FS: persist(forest_car_db_temp_path -> forest_car_db_path)
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • LesnyRumcajs
  • akaladarshi
  • elmattic

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly and accurately captures the primary change: making CarStream support large data blocks (e.g., F3 snap). It directly reflects the implemented changes (introducing uvi_bytes() and increasing the UviBytes max_len) and is concise and specific enough for a teammate scanning PR history.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hm/fix-car-stream

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5915703 and c64da20.

📒 Files selected for processing (1)
  • src/utils/db/car_stream.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/utils/db/car_stream.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Build Ubuntu
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build MacOS
  • GitHub Check: All lint checks

Comment @coderabbitai help to get the list of available commands and usage tips.

}

pub fn uvi_bytes() -> UviBytes {
let mut decoder = UviBytes::default();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UviBytes::default cannot be banned in .clippy.toml because it does not belong to UviBytes

@hanabi1224 hanabi1224 marked this pull request as ready for review September 19, 2025 07:59
@hanabi1224 hanabi1224 requested a review from a team as a code owner September 19, 2025 07:59
@hanabi1224 hanabi1224 requested review from LesnyRumcajs and akaladarshi and removed request for a team September 19, 2025 07:59
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/utils/db/car_stream.rs (1)

166-176: Validate CARv2 header fields and bound UviBytes (prevent underflow/DoS)

Unchecked casts from header_v2 fields and an unbounded UviBytes allow negative/huge values to produce underflows or massive allocations — validate non‑negative offsets/sizes, bound UviBytes to the CARv1 byte window, and reject malformed headers.

  • src/utils/db/car_stream.rs (around lines 165–176 and 331–336): apply the patch below to validate data_offset/data_size (use try_into/try_from), compute a safe uvi_limit from the validated max_car_v1_bytes, and use uvi_bytes_with_limit(uvi_limit).
  • src/db/car/plain.rs (around lines 129–136): header_v2.data_offset and header_v2.data_size are cast with as to u64; validate they are non‑negative (try_into/try_from) before changing cursor positions or computing limits.

Applicable patch (use as-is for car_stream.rs):

@@
-        // Skip v2 header bytes
-        if let Some(header_v2) = &header_v2 {
-            let mut to_skip = vec![0; header_v2.data_offset as usize];
-            reader.read_exact(&mut to_skip).await?;
-        }
+        // Skip v2 header bytes (validated, non-negative)
+        if let Some(header_v2) = &header_v2 {
+            let data_offset: usize = header_v2
+                .data_offset
+                .try_into()
+                .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "negative data_offset"))?;
+            let mut to_skip = vec![0; data_offset];
+            reader.read_exact(&mut to_skip).await?;
+        }
@@
-        let max_car_v1_bytes = header_v2
-            .as_ref()
-            .map(|h| h.data_size as u64)
-            .unwrap_or(u64::MAX);
-        let mut reader = FramedRead::new(reader.take(max_car_v1_bytes), uvi_bytes());
+        let max_car_v1_bytes: u64 = header_v2
+            .as_ref()
+            .map(|h| {
+                u64::try_from(h.data_size)
+                    .map_err(|_| io::Error::new(io::ErrorKind::InvalidData, "negative data_size"))
+            })
+            .transpose()?
+            .unwrap_or(u64::MAX);
+        let uvi_limit: usize = max_car_v1_bytes.try_into().unwrap_or(usize::MAX);
+        let mut reader = FramedRead::new(reader.take(max_car_v1_bytes), uvi_bytes_with_limit(uvi_limit));
@@
-pub fn uvi_bytes() -> UviBytes {
-    let mut decoder = UviBytes::default();
-    decoder.set_max_len(usize::MAX);
-    decoder
-}
+/// Returns a UviBytes codec with a caller‑provided maximum frame length.
+pub fn uvi_bytes_with_limit(limit: usize) -> UviBytes {
+    let mut c = UviBytes::default();
+    c.set_max_len(limit);
+    c
+}
+
+/// Returns an unbounded UviBytes codec (intended for encoding or fully‑buffered inputs).
+pub fn uvi_bytes() -> UviBytes {
+    uvi_bytes_with_limit(usize::MAX)
+}
🧹 Nitpick comments (2)
src/db/car/forest.rs (1)

267-270: Reuse the decoder in the loop to avoid per‑iteration reinit.

Tiny win and clearer intent.

-                    while let Some(block_frame) = uvi_bytes().decode_eof(&mut zstd_frame)? {
+                    let mut uv = uvi_bytes();
+                    while let Some(block_frame) = uv.decode_eof(&mut zstd_frame)? {
                         let CarBlock { cid, data } = CarBlock::from_bytes(block_frame)?;
                         block_map.insert(cid.into(), data);
                     }
src/daemon/db_util.rs (1)

203-207: Fix log message: it says “fallback to copy” but we may transcode.

Message can mislead when a local raw CAR is transcoded in move_or_copy.

-                tracing::warn!(
-                    "Snapshot file is not a valid forest.car.zst file, fallback to copy"
-                );
+                tracing::warn!(
+                    "Snapshot is not a valid .forest.car.zst; will copy/transcode as needed"
+                );
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between caaba3d and 33de38e.

📒 Files selected for processing (3)
  • src/daemon/db_util.rs (2 hunks)
  • src/db/car/forest.rs (5 hunks)
  • src/utils/db/car_stream.rs (3 hunks)
🧰 Additional context used
🧠 Learnings (5)
📓 Common learnings
Learnt from: hanabi1224
PR: ChainSafe/forest#5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
📚 Learning: 2025-08-11T14:00:47.060Z
Learnt from: hanabi1224
PR: ChainSafe/forest#5886
File: src/daemon/db_util.rs:236-259
Timestamp: 2025-08-11T14:00:47.060Z
Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.

Applied to files:

  • src/daemon/db_util.rs
📚 Learning: 2025-08-25T14:17:09.129Z
Learnt from: hanabi1224
PR: ChainSafe/forest#5978
File: .github/workflows/unit-tests.yml:0-0
Timestamp: 2025-08-25T14:17:09.129Z
Learning: hanabi1224's download_file_with_cache function in src/utils/net/download_file.rs preserves URL path structure in local cache directories by using cache_dir.join(url.path().strip_prefix('/').unwrap_or_else(|| url.path())), so snapshots from https://forest-snapshots.fra1.cdn.digitaloceanspaces.com/rpc_test/ are cached locally at ~/.cache/forest/test/rpc-snapshots/rpc_test/ (including the rpc_test subdirectory from the URL path).

Applied to files:

  • src/daemon/db_util.rs
📚 Learning: 2025-08-04T13:36:22.993Z
Learnt from: hanabi1224
PR: ChainSafe/forest#5835
File: src/chain/tests.rs:58-76
Timestamp: 2025-08-04T13:36:22.993Z
Learning: In projects using tokio with "full" features enabled, `Vec<u8>` implements `AsyncWrite` through blanket implementations provided by Tokio. This allows `&mut Vec<u8>` to be used directly with async write operations without needing additional wrapper types like `SyncIoBridge`.

Applied to files:

  • src/db/car/forest.rs
📚 Learning: 2025-08-04T13:36:22.993Z
Learnt from: hanabi1224
PR: ChainSafe/forest#5835
File: src/chain/tests.rs:58-76
Timestamp: 2025-08-04T13:36:22.993Z
Learning: In the Forest codebase, `Vec<u8>` can be used as an `AsyncWrite` implementation in test contexts. The user confirmed that tests using `&mut Vec<u8>` with `export` and `export_v2` functions compile and pass both locally and on CI.

Applied to files:

  • src/db/car/forest.rs
🧬 Code graph analysis (2)
src/daemon/db_util.rs (2)
src/db/car/forest.rs (1)
  • is_valid (140-142)
src/utils/io/mmap.rs (1)
  • open (56-62)
src/db/car/forest.rs (1)
src/utils/db/car_stream.rs (1)
  • uvi_bytes (332-336)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build Ubuntu
  • GitHub Check: Build MacOS
  • GitHub Check: All lint checks
🔇 Additional comments (6)
src/utils/db/car_stream.rs (1)

286-286: Header encode switch to helper looks good.

Safe use; encode path is unaffected by max_len.

src/db/car/forest.rs (3)

159-163: LGTM: bounded header decode via helper.

Header frame is small; using the shared helper improves consistency.


313-316: LGTM: header encode via helper.

No behavior change; centralized configuration.


402-406: LGTM: error context improved.

Clearer tracing without altering control flow.

src/daemon/db_util.rs (2)

164-169: Good: direct transcode for local non‑Forest CAR avoids wasted copy.

Reduces I/O and temp space.


300-304: LGTM: added structured tracing for transcode.

Helpful for support/telemetry.

@LesnyRumcajs
Copy link
Member

no green checkmark!

@hanabi1224
Copy link
Contributor Author

no green checkmark!

@LesnyRumcajs fixed!


pub fn uvi_bytes() -> UviBytes {
let mut decoder = UviBytes::default();
decoder.set_max_len(usize::MAX);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this reasoanble? Shouldn't we still put some bounds on it, like at most 4 GiB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Made it 8GiB which is reasonably large

LesnyRumcajs
LesnyRumcajs previously approved these changes Sep 19, 2025

pub fn uvi_bytes() -> UviBytes {
let mut decoder = UviBytes::default();
decoder.set_max_len(8 * 1024 * 1024 * 1024); // 8GiB
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a constant out of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

akaladarshi
akaladarshi previously approved these changes Sep 19, 2025
@hanabi1224 hanabi1224 added this pull request to the merge queue Sep 19, 2025
@akaladarshi akaladarshi removed this pull request from the merge queue due to a manual request Sep 19, 2025
@hanabi1224 hanabi1224 added this pull request to the merge queue Sep 19, 2025
Merged via the queue into main with commit 820d628 Sep 19, 2025
62 of 63 checks passed
@hanabi1224 hanabi1224 deleted the hm/fix-car-stream branch September 19, 2025 13:39
@coderabbitai coderabbitai bot mentioned this pull request Oct 14, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants