Fix: fix(snapshot): preserve nodes_written across resume#3921
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
There was a problem hiding this comment.
Pull request overview
This PR fixes a bug where the nodes_written count was lost when resuming a snapshot operation, causing the final SnapshotMeta.node_count to only reflect nodes written during the resumed session rather than the total count. The fix preserves the nodes_written field from the progress file when resuming and adds validation to detect discrepancies between tracked and actual counts.
Key changes:
- Added
new_with_count()method toSnapshotNodeWriterto support initializing with a non-zero node count when resuming - Modified
build_snapshot()to restorenodes_writtenfromresume_progress - Changed arithmetic operations to use
saturating_sub()to prevent underflow in progress logging
| // Validate the nodes_written count matches actual database count | ||
| match self.get_actual_node_count() { | ||
| Ok(actual_count) => { | ||
| if actual_count != self.nodes_written { | ||
| warn!( | ||
| "nodes_written count ({}) does not match actual database count ({}), using actual count", | ||
| self.nodes_written, actual_count | ||
| ); | ||
| } | ||
| } | ||
| Err(e) => { | ||
| warn!("Failed to validate node count from database: {}", e); | ||
| } | ||
| } |
There was a problem hiding this comment.
The validation logic warns about a mismatch but then continues using the potentially incorrect self.nodes_written value. When actual_count differs from self.nodes_written, the warning message says "using actual count", but the code on line 842 still returns self.nodes_written instead of the actual count. This inconsistency could lead to incorrect metadata being recorded in the snapshot.
Consider either:
- Returning
actual_countwhen there's a mismatch (as the warning message suggests), or - Updating the warning message to say "but continuing with tracked count" to match the actual behavior
There was a problem hiding this comment.
✅ FIXED: Good catch! I've fixed the inconsistency. The finalize() method now:
- Changes
selftomut selfto allow mutation - When
actual_countdiffers fromself.nodes_written, updatesself.nodes_written = actual_count - Returns the validated
final_count(which is the actual count when they differ) - The warning message for the error path now correctly says "using tracked count" to match its behavior
The actual database count is now properly used as the source of truth when there's a mismatch, and the warning message accurately reflects what the code does.
Action taken: Modified SnapshotNodeWriter::finalize() to use actual_count from database when it differs from tracked count. Changed function signature from fn finalize(self, ...) to fn finalize(mut self, ...) and return a computed final_count that uses the database-verified value when available.
|
@holonbot fix the review comment and cargo fmt the code |
|
Holon failed: failure Run: https://github.com/rooch-network/rooch/actions/runs/20804970623 |
PR #3921 Fix SummaryOverviewFixed review comments and CI issues for PR #3921: "Fix: fix(snapshot): preserve nodes_written across resume" Changes Made1. Fixed Review Comment (comment_id: 2670652812)Issue: The warning message said "using actual count" but the code continued using Fix: Updated
File: 2. Fixed Code FormattingRan Verification ResultsTests Passed
Formatting Check
Linting
CI/Check Status AnalysisFixed Issues
Not Applicable Issues
Summary of Actions Taken
The PR is now ready with the review comment addressed and formatting fixed. |
|
Holon completed successfully. Run: https://github.com/rooch-network/rooch/actions/runs/20804970623 |
Fixes #3903
Issue #3903: Fix Snapshot Resume - Preserve nodes_written Across Resume
Summary
Fixed a bug in the snapshot resume functionality where the
nodes_writtencount was lost when resuming a snapshot operation. This caused the finalSnapshotMeta.node_countto only include nodes written during the resumed session, not the total count including previous writes.Root Cause
When a snapshot operation was resumed:
SnapshotProgressstructure correctly saved thenodes_writtencount to the progress fileSnapshotNodeWriter, it was always initialized withnodes_written: 0Changes Made
1. Modified
SnapshotNodeWriterto Support Initial Countnew_with_count()method that accepts aninitial_countparameternew()method now callsnew_with_count()withinitial_count: 0File:
crates/rooch-pruner/src/state_prune/snapshot_builder.rs:626-6772. Updated
SnapshotBuilder::build_snapshot()to Restore nodes_writteninitial_nodes_writtenfromresume_progresswhen availableSnapshotNodeWriter::new_with_count()File:
crates/rooch-pruner/src/state_prune/snapshot_builder.rs:200-2093. Added Validation Method for Node Count
get_actual_node_count()to query the actual count from RocksDBfinalize()that warns if the tracked count doesn't match the database countFile:
crates/rooch-pruner/src/state_prune/snapshot_builder.rs:783-8224. Fixed Potential Underflow in Progress Logging
worklist.len() - progress.worklist_positiontoworklist.len().saturating_sub(progress.worklist_position)worklist_position > worklist.len()(though current code sets position to 0)File:
crates/rooch-pruner/src/state_prune/snapshot_builder.rs:110,2835. Added Comprehensive Tests
test_snapshot_node_writer_resume_with_count: Tests that a resumed writer preserves the count and can increment ittest_snapshot_progress_save_and_load: Tests progress persistence includingnodes_writtentest_snapshot_progress_invalid_state_root: Tests that progress loading fails when state roots don't matchFile:
crates/rooch-pruner/src/state_prune/snapshot_builder.rs:909-1047Verification Results
Tests Run
cargo test --package rooch-pruner --lib snapshot_builderResult: 6 passed; 0 failed
All Snapshot Tests
cargo test --package rooch-pruner --lib snapshotResult: 15 passed; 0 failed
Build Verification
Result: SUCCESS (with warnings unrelated to these changes)
Acceptance Criteria Met
✅ Resumed snapshot's node_count equals pre-resume + new writes
nodes_writtencount is now restored fromresume_progress.nodes_writtenSnapshotMeta.node_countincludes both previous and new writes✅ Progress logs do not underflow
saturating_sub()to prevent underflow errorsworklist_position > worklist.len(), the log will show 0 instead of panicking✅ Test covers resume continuation
test_snapshot_node_writer_resume_with_countverifies:test_snapshot_progress_save_and_loadverifies:nodes_writtenNotes
get_actual_node_count()validation provides a safety check but uses iteration over the entire database, which may be slow for large snapshots. The validation only runs duringfinalize()so it doesn't impact normal operation.