Skip to content

Fix: fix(snapshot): preserve nodes_written across resume#3921

Merged
jolestar merged 2 commits into
mainfrom
holon/fix-3903-20260108-022151
Jan 10, 2026
Merged

Fix: fix(snapshot): preserve nodes_written across resume#3921
jolestar merged 2 commits into
mainfrom
holon/fix-3903-20260108-022151

Conversation

@holonbot

@holonbot holonbot Bot commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

Fixes #3903

Issue #3903: Fix Snapshot Resume - Preserve nodes_written Across Resume

Summary

Fixed a bug in the snapshot resume functionality where the nodes_written count was lost when resuming a snapshot operation. This caused the final SnapshotMeta.node_count to only include nodes written during the resumed session, not the total count including previous writes.

Root Cause

When a snapshot operation was resumed:

  1. The SnapshotProgress structure correctly saved the nodes_written count to the progress file
  2. However, when creating the SnapshotNodeWriter, it was always initialized with nodes_written: 0
  3. This meant the count was reset instead of being preserved from the previous run

Changes Made

1. Modified SnapshotNodeWriter to Support Initial Count

  • Added new_with_count() method that accepts an initial_count parameter
  • The existing new() method now calls new_with_count() with initial_count: 0
  • This maintains backward compatibility while allowing resume scenarios to preserve the count

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:626-677

2. Updated SnapshotBuilder::build_snapshot() to Restore nodes_written

  • Extracts initial_nodes_written from resume_progress when available
  • Logs a message when restoring the count (e.g., "Restoring snapshot writer with 1234 previously written nodes")
  • Passes the initial count to SnapshotNodeWriter::new_with_count()

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:200-209

3. Added Validation Method for Node Count

  • Implemented get_actual_node_count() to query the actual count from RocksDB
  • Added validation in finalize() that warns if the tracked count doesn't match the database count
  • This provides a safety net to detect discrepancies

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:783-822

4. Fixed Potential Underflow in Progress Logging

  • Changed worklist.len() - progress.worklist_position to worklist.len().saturating_sub(progress.worklist_position)
  • This prevents underflow if worklist_position > worklist.len() (though current code sets position to 0)
  • Applied in two locations where progress is logged

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:110,283

5. Added Comprehensive Tests

  • test_snapshot_node_writer_resume_with_count: Tests that a resumed writer preserves the count and can increment it
  • test_snapshot_progress_save_and_load: Tests progress persistence including nodes_written
  • test_snapshot_progress_invalid_state_root: Tests that progress loading fails when state roots don't match

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:909-1047

Verification Results

Tests Run

cargo test --package rooch-pruner --lib snapshot_builder

Result: 6 passed; 0 failed

All Snapshot Tests

cargo test --package rooch-pruner --lib snapshot

Result: 15 passed; 0 failed

Build Verification

cargo check --package rooch-pruner

Result: SUCCESS (with warnings unrelated to these changes)

Acceptance Criteria Met

Resumed snapshot's node_count equals pre-resume + new writes

  • The nodes_written count is now restored from resume_progress.nodes_written
  • When new nodes are written, the count increments from the restored value
  • Final SnapshotMeta.node_count includes both previous and new writes

Progress logs do not underflow

  • Changed to saturating_sub() to prevent underflow errors
  • Even if worklist_position > worklist.len(), the log will show 0 instead of panicking

Test covers resume continuation

  • test_snapshot_node_writer_resume_with_count verifies:
    • Initial count is preserved when resuming
    • New writes increment the count correctly
    • Database count matches the tracked count
  • test_snapshot_progress_save_and_load verifies:
    • Progress file correctly saves and loads nodes_written
    • All progress fields are preserved

Notes

  • The get_actual_node_count() validation provides a safety check but uses iteration over the entire database, which may be slow for large snapshots. The validation only runs during finalize() so it doesn't impact normal operation.
  • The fix is backward compatible - new snapshots and resumed snapshots both work correctly.
  • All changes follow the project's coding conventions (ASCII-only comments, proper error handling, etc.).

@holonbot holonbot Bot requested a review from jolestar as a code owner January 8, 2026 02:21
@vercel

vercel Bot commented Jan 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rooch-portal-v2.1 Ready Ready Preview, Comment Jan 8, 2026 10:52am
test-portal Ready Ready Preview, Comment Jan 8, 2026 10:52am
1 Skipped Deployment
Project Deployment Review Updated (UTC)
rooch Ignored Ignored Preview Jan 8, 2026 10:52am

@github-actions

github-actions Bot commented Jan 8, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the nodes_written count was lost when resuming a snapshot operation, causing the final SnapshotMeta.node_count to only reflect nodes written during the resumed session rather than the total count. The fix preserves the nodes_written field from the progress file when resuming and adds validation to detect discrepancies between tracked and actual counts.

Key changes:

  • Added new_with_count() method to SnapshotNodeWriter to support initializing with a non-zero node count when resuming
  • Modified build_snapshot() to restore nodes_written from resume_progress
  • Changed arithmetic operations to use saturating_sub() to prevent underflow in progress logging

Comment on lines +809 to +822
// Validate the nodes_written count matches actual database count
match self.get_actual_node_count() {
Ok(actual_count) => {
if actual_count != self.nodes_written {
warn!(
"nodes_written count ({}) does not match actual database count ({}), using actual count",
self.nodes_written, actual_count
);
}
}
Err(e) => {
warn!("Failed to validate node count from database: {}", e);
}
}

Copilot AI Jan 8, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The validation logic warns about a mismatch but then continues using the potentially incorrect self.nodes_written value. When actual_count differs from self.nodes_written, the warning message says "using actual count", but the code on line 842 still returns self.nodes_written instead of the actual count. This inconsistency could lead to incorrect metadata being recorded in the snapshot.

Consider either:

  1. Returning actual_count when there's a mismatch (as the warning message suggests), or
  2. Updating the warning message to say "but continuing with tracked count" to match the actual behavior

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED: Good catch! I've fixed the inconsistency. The finalize() method now:

  1. Changes self to mut self to allow mutation
  2. When actual_count differs from self.nodes_written, updates self.nodes_written = actual_count
  3. Returns the validated final_count (which is the actual count when they differ)
  4. The warning message for the error path now correctly says "using tracked count" to match its behavior

The actual database count is now properly used as the source of truth when there's a mismatch, and the warning message accurately reflects what the code does.

Action taken: Modified SnapshotNodeWriter::finalize() to use actual_count from database when it differs from tracked count. Changed function signature from fn finalize(self, ...) to fn finalize(mut self, ...) and return a computed final_count that uses the database-verified value when available.

@rooch-network rooch-network deleted a comment from github-actions Bot Jan 8, 2026
@jolestar

jolestar commented Jan 8, 2026

Copy link
Copy Markdown
Contributor

@holonbot

fix the review comment and cargo fmt the code

@github-actions

github-actions Bot commented Jan 8, 2026

Copy link
Copy Markdown

Holon failed: failure

Run: https://github.com/rooch-network/rooch/actions/runs/20804970623

@holonbot

holonbot Bot commented Jan 8, 2026

Copy link
Copy Markdown
Contributor Author

PR #3921 Fix Summary

Overview

Fixed review comments and CI issues for PR #3921: "Fix: fix(snapshot): preserve nodes_written across resume"

Changes Made

1. Fixed Review Comment (comment_id: 2670652812)

Issue: The warning message said "using actual count" but the code continued using self.nodes_written instead of the actual database count.

Fix: Updated finalize() method to actually use the actual count from the database when there's a mismatch:

  • Changed pub fn finalize(self, ...) to pub fn finalize(mut self, ...) to allow mutation
  • When actual count differs from tracked count, now updates self.nodes_written = actual_count
  • Returns the validated actual_count as the final count
  • Updated error message path to say "using tracked count" for clarity

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs:802-847

2. Fixed Code Formatting

Ran cargo fmt --package rooch-pruner to ensure all code follows Rust formatting conventions. The original CI failure in "Quick Checks" was due to formatting issues that have now been resolved.

Verification Results

Tests Passed

  • snapshot_builder tests: 6/6 passed
  • All snapshot tests: 15/15 passed

Formatting Check

  • cargo fmt --check --package rooch-pruner: PASSED

Linting

  • Clippy warnings are pre-existing issues in the codebase, not related to these changes

CI/Check Status Analysis

Fixed Issues

  1. Quick Checks (failure) - Fixed by running cargo fmt on the modified file

Not Applicable Issues

  1. Rust Integration Suite Tests (failure) - Pre-existing test failures in Move integration tests unrelated to snapshot_builder changes. The failing tests are in examples/cosmwasm_vm_execution/ and involve object ownership issues (NO_ACCOUNT_ROLE), not related to the snapshot resume functionality.

  2. Rust Lint (failure) - Pre-existing clippy warnings in the codebase that existed before this PR. The warnings are in files like reachability.rs, marker.rs, and test_utils.rs which were not modified by this PR.

  3. Rust Unit Tests, Rust Framework Tests, Rust Bitcoin Tests, Move Framework Tests, SDK and Web Tests, Move Examples Tests - All PASSED

Summary of Actions Taken

  1. Fixed the inconsistency in the finalize() method where the warning said "using actual count" but the code returned self.nodes_written
  2. Ran cargo fmt to fix formatting issues
  3. Verified all snapshot-related tests pass
  4. Verified formatting check passes

The PR is now ready with the review comment addressed and formatting fixed.

@github-actions

github-actions Bot commented Jan 8, 2026

Copy link
Copy Markdown

Holon completed successfully.

Run: https://github.com/rooch-network/rooch/actions/runs/20804970623

@jolestar jolestar merged commit fdf6c70 into main Jan 10, 2026
29 checks passed
@jolestar jolestar deleted the holon/fix-3903-20260108-022151 branch January 10, 2026 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(snapshot): preserve nodes_written across resume

2 participants