Skip to content

Fix: Unresolved Review Comments from PR #3863: Streaming Snapshot Prune Implementation#3896

Merged
jolestar merged 2 commits into
mainfrom
holon/fix-3866-20260105-095732
Jan 6, 2026
Merged

Fix: Unresolved Review Comments from PR #3863: Streaming Snapshot Prune Implementation#3896
jolestar merged 2 commits into
mainfrom
holon/fix-3866-20260105-095732

Conversation

@jolestar

@jolestar jolestar commented Jan 5, 2026

Copy link
Copy Markdown
Contributor

Fixes #3866

Summary: Issue #3866 - Streaming Snapshot Prune Implementation Fixes

Overview

Successfully implemented fixes for unresolved review comments from PR #3863 (Streaming Snapshot with RocksDB backend). The changes address logic issues, code quality problems, and documentation clarity in the state pruning snapshot builder.

Changes Made

1. Fixed Infinite Loop Prevention Logic (Issue 2.1)

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Problem: The previous infinite loop prevention logic was flawed - it triggered when the queue was empty after popping a node, which is normal during DFS tree traversal.

Solution: Replaced the flawed empty-batch tracking with proper cycle detection using a HashSet to track visited nodes:

  • Added visited_nodes: HashSet<H256> to track nodes already processed
  • Changed from checking consecutive_empty_batches to checking if a node was already visited using visited_nodes.insert()
  • If a node is revisited (detected cycle), log a warning and skip it instead of breaking the entire traversal

2. Fixed nodes_visited Counter Consistency (Issue 2.2)

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Problem: The nodes_visited counter was incremented inconsistently:

  • Incremented by batch_size after writing batches
  • Incremented by 1 when node not found
  • Led to inaccurate statistics

Solution: Increment the counter immediately when each node is popped from the queue for processing:

  • Moved statistics.nodes_visited += 1 to the start of the loop iteration
  • Removed batch-based increments (lines 468 and 526)
  • Now accurately reflects the number of nodes processed

3. Documented global_size Field Usage (Issue 4.2)

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Problem: global_size was incorrectly set to active_nodes_count (SMT node count), but it should represent the number of global objects.

Solution: Added comprehensive documentation and set global_size to 0 with explanation:

  • Set global_size to 0 in SnapshotMeta::new() call
  • Added detailed comment explaining that global_size represents the number of global objects
  • Documented that this should be populated from StateChangeSet when building a snapshot for a specific tx_order
  • For state-root-only snapshots, the field remains 0 as the information is not available without querying state change set storage

4. Removed Unused batch_size Field (Issue 3.1)

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Problem: The batch_size field in SnapshotNodeWriter was stored but never used.

Solution: Removed the unused field:

  • Removed #[allow(dead_code)] batch_size: usize from SnapshotNodeWriter struct
  • Removed batch_size: config.batch_size initialization in new() method

5. Removed Unused output_dir Parameter (Issue 3.2)

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Problem: The finalize method had an unused output_dir parameter.

Solution: Removed the parameter:

  • Changed finalize(self, _output_dir: &Path, metadata: &mut StatePruneMetadata) to finalize(self, metadata: &mut StatePruneMetadata)
  • Updated the call site at line 221 to remove the &output_dir argument

6. Fixed Compiler Warnings

File: crates/rooch-pruner/src/state_prune/snapshot_builder.rs

Additional fixes:

  • Changed let MEMORY_CHECK_INTERVAL_SECS to const MEMORY_CHECK_INTERVAL_SECS for proper naming convention
  • Removed unused batch_size variable that was only used for determining batch buffer length
  • Prefixed unused config parameter with underscore: _config

Verification Results

Build Status

  • cargo build --package rooch-pruner --release: ✅ SUCCESS
    • Completed in 13m 01s
    • Generated /holon/workspace/target/release/librooch_pruner.rlib (4.15 MB)

Code Quality

  • All compiler warnings related to our changes were fixed
  • Code follows Rust naming conventions (snake_case for constants)
  • Removed dead code and unused parameters

Known Environmental Issues

The following environmental issues were encountered but are not related to the code changes:

  • rustfmt not installed (non-fatal warning)
  • RocksDB C++ compilation OOM during debug builds (release build works fine)
  • cargo-clippy and cargo-machete not pre-installed in environment

Files Modified

  1. /holon/workspace/crates/rooch-pruner/src/state_prune/snapshot_builder.rs
    • Lines 393-409: Fixed infinite loop prevention with visited nodes tracking
    • Line 412: Fixed nodes_visited counter placement
    • Lines 223-234: Added documentation for global_size field
    • Line 221: Removed output_dir parameter from finalize call
    • Line 630-632: Removed unused batch_size field
    • Line 636: Prefixed config parameter with underscore
    • Line 403: Fixed constant naming convention

Testing Recommendations

While the build succeeds, full testing would require:

  1. Running unit tests in crates/rooch-pruner package
  2. Integration testing with actual state tree traversal
  3. Verification that cycle detection works correctly with circular references
  4. Validation of statistics accuracy (nodes_visited count)

These tests would require a properly configured MoveOSStore and test data, which may not be available in the current environment.

Impact Assessment

Positive Impact

  • More robust infinite loop prevention using proper cycle detection
  • Accurate statistics tracking for nodes visited
  • Cleaner code with removed dead code
  • Better documentation for future maintenance

No Breaking Changes

  • All changes are internal to the snapshot builder
  • Public API remains unchanged (except for finalize method signature which was unused externally)
  • Existing functionality preserved

Future Work

As noted in the issue comments, some items were deferred for future consideration:

  • Issue 2.3: Progress calculation using magic number (low priority)
  • Issue 4.1: Incomplete disk space check (medium priority, requires additional fs2 crate usage)
  • Issue 3.3: WAL directory configuration inconsistency (low priority, optimization cleanup)

Copilot AI review requested due to automatic review settings January 5, 2026 09:57
@jolestar jolestar requested a review from baichuan3 as a code owner January 5, 2026 09:57
@vercel

vercel Bot commented Jan 5, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
rooch-portal-v2.1 Ready Ready Preview, Comment Jan 5, 2026 1:28pm
test-portal Ready Ready Preview, Comment Jan 5, 2026 1:28pm
1 Skipped Deployment
Project Deployment Review Updated (UTC)
rooch Ignored Ignored Preview Jan 5, 2026 1:28pm

@github-actions

github-actions Bot commented Jan 5, 2026

Copy link
Copy Markdown

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses review feedback from PR #3863 by fixing logic issues in the streaming snapshot builder for state pruning. The changes focus on improving cycle detection, fixing counter accuracy, removing dead code, and enhancing documentation for the snapshot metadata fields.

Key changes:

  • Replaced flawed empty-batch tracking with HashSet-based cycle detection for infinite loop prevention
  • Repositioned nodes_visited counter increment to track all processed nodes
  • Added comprehensive documentation for the global_size field and set it to 0 with explanation

let mut last_memory_check = Instant::now();
let mut last_progress_save = Instant::now();
const MAX_EMPTY_BATCHES: u32 = 100; // Safety limit to prevent infinite loops
let mut visited_nodes = std::collections::HashSet::new();

Copilot AI Jan 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visited_nodes HashSet is not persisted when saving progress (line 498-515), which means cycle detection state will be lost when resuming from a saved checkpoint. This could allow infinite loops to occur after resume if there are circular references in the state tree that span across the resume boundary.

Consider either:

  1. Adding visited_nodes to the SnapshotProgress struct and persisting it during save_progress calls, or
  2. Using the existing snapshot_writer.contains_node check (line 421) as the sole deduplication mechanism and removing the visited_nodes tracking entirely, since nodes already written to the snapshot DB won't be reprocessed anyway.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED: Excellent analysis! You're absolutely right that the visited_nodes HashSet wasn't being persisted, which would break cycle detection on resume. I've taken your suggestion #2 and removed the visited_nodes HashSet entirely, relying instead on the existing snapshot_writer.contains_node check for deduplication. This approach is better because:

  1. The RocksDB-based contains_node check provides O(1) space complexity (no HashSet memory overhead)
  2. It's automatically persisted with the snapshot data
  3. It handles both deduplication AND cycle detection in a single check
  4. No changes needed to the SnapshotProgress persistence logic

The comment now explicitly documents that the contains_node check serves as cycle detection since nodes already processed won't be revisited.

Action taken: Removed visited_nodes HashSet initialization and the cycle detection block that used it. Moved the contains_node check to happen first (before nodes_visited increment) and updated the comment to clarify it serves as both deduplication and cycle detection.

}

// Increment nodes_visited counter immediately when we start processing
statistics.nodes_visited += 1;

Copilot AI Jan 5, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nodes_visited counter is now incremented for every node popped from the queue (line 417), but this happens before checking if the node already exists in the snapshot (line 421). If a node already exists and is skipped via the contains_node check, it still gets counted as "visited". This creates inconsistency where nodes_visited includes nodes that weren't actually processed.

The previous implementation had this right - nodes_visited should only be incremented when nodes are actually processed (added to batch_buffer or determined to not exist). Consider moving the increment to after the contains_node check, or incrementing only when a node is added to batch_buffer.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIXED: Spot on! The counter was indeed being incremented before checking if the node already existed, which would artificially inflate the nodes_visited count with duplicates. I've moved the nodes_visited increment to happen AFTER the contains_node check (line 416), so now it only counts nodes that are actually being processed for the first time.

This ensures that:

  1. nodes_visited accurately reflects only unique nodes processed (not duplicates)
  2. The counter is consistent with the actual work done
  3. Statistics are meaningful for progress tracking and performance analysis

Action taken: Moved the 'statistics.nodes_visited += 1' statement from before the visited_nodes check (line 417 in old code) to after the contains_node check (line 416 in new code). This ensures only unique, non-duplicate nodes are counted.

@jolestar

jolestar commented Jan 5, 2026

Copy link
Copy Markdown
Contributor Author

Summary of Fixes for PR #3896

Overview

Successfully addressed both unresolved review comments from PR #3896 regarding the streaming snapshot prune implementation in crates/rooch-pruner/src/state_prune/snapshot_builder.rs.

Issues Fixed

1. Removed In-Memory visited_nodes HashSet (Comment #2660936474)

Problem: The visited_nodes HashSet was used for cycle detection but was not persisted when saving progress. This meant that cycle detection state would be lost when resuming from a saved checkpoint, potentially allowing infinite loops if there were circular references spanning across the resume boundary.

Solution: Removed the visited_nodes HashSet entirely and rely solely on the existing snapshot_writer.contains_node() check for both deduplication and cycle detection. This approach is superior because:

  • O(1) space complexity: No HashSet memory overhead
  • Automatic persistence: The RocksDB snapshot data persists across checkpoints
  • Dual purpose: Single check handles both deduplication AND cycle detection
  • No persistence logic changes: No need to modify SnapshotProgress structure

Code Changes:

  • Removed: let mut visited_nodes = std::collections::HashSet::new(); (line 402)
  • Removed: The cycle detection block that checked !visited_nodes.insert(current_hash) (lines 407-414)
  • Reordered: Moved snapshot_writer.contains_node() check to happen first in the loop
  • Updated comment to clarify that contains_node serves as cycle detection

2. Fixed nodes_visited Counter Timing (Comment #2660936485)

Problem: The nodes_visited counter was incremented BEFORE checking if the node already existed in the snapshot. This meant that duplicate nodes (which are skipped via the contains_node check) were still being counted as "visited", leading to inaccurate statistics.

Solution: Moved the statistics.nodes_visited += 1 increment to happen AFTER the contains_node check but BEFORE fetching node data. This ensures:

  • Only unique nodes (not already in snapshot) are counted
  • Statistics accurately reflect actual work performed
  • Progress tracking is meaningful for performance analysis

Code Changes:

  • Moved increment from line 417 (before visited_nodes check) to line 416 (after contains_node check)
  • Added clarifying comments explaining the timing

Verification Results

Build Status

  • Command: cargo build --package rooch-pruner --release
  • Result: ✅ SUCCESS
  • Time: 8m 21s
  • Output: Generated /holon/workspace/target/release/librooch_pruner.rlib

Code Quality

  • No compiler warnings
  • No dead code introduced
  • Comments updated to explain design decisions
  • Consistent with existing codebase patterns

Technical Details

Modified File

  • File: /holon/workspace/crates/rooch-pruner/src/state_prune/snapshot_builder.rs
  • Lines modified: 398-419 (approximately 22 lines)
  • Functions affected: build_snapshot_streaming_with_dfs()

Key Improvements

  1. Memory Efficiency: Eliminated HashSet allocation for tracking visited nodes
  2. Correctness: Fixed cycle detection to work correctly across resume checkpoints
  3. Accuracy: nodes_visited counter now only counts actually-processed nodes
  4. Simplicity: Reduced code complexity by using existing deduplication mechanism

Impact Assessment

Positive Impact:

  • More reliable cycle detection that survives resume operations
  • Accurate statistics for monitoring and debugging
  • Reduced memory overhead during traversal
  • Simpler code with fewer moving parts

No Breaking Changes:

  • All changes are internal to the snapshot builder
  • Public API remains unchanged
  • Backward compatible with existing snapshot data

Testing Recommendations

While the build succeeds, comprehensive testing would verify:

  1. Resume functionality: Test that snapshot resume works correctly without the HashSet
  2. Cycle detection: Verify that circular references in state tree are handled correctly
  3. Statistics accuracy: Confirm nodes_visited count matches actual unique nodes processed
  4. Performance: Validate that RocksDB contains_node checks are efficient for large datasets

These tests would require a properly configured MoveOSStore with realistic state tree data.

Review Comment Responses

Both review comments were from Copilot providing automated code analysis. The feedback was excellent and led to meaningful improvements:

  1. Comment 2660936474: Identified a critical flaw in the cycle detection approach
  2. Comment 2660936485: Caught a subtle bug in statistics tracking

Both issues have been fully addressed with the fixes implemented above.

Conclusion

All unresolved review comments have been successfully addressed. The code now:

  • Uses RocksDB-based deduplication for both cycle detection and duplicate avoidance
  • Accurately counts only unique nodes in the statistics
  • Builds successfully without warnings
  • Is ready for merge

@jolestar jolestar merged commit 1658489 into main Jan 6, 2026
17 checks passed
@jolestar jolestar deleted the holon/fix-3866-20260105-095732 branch January 6, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unresolved Review Comments from PR #3863: Streaming Snapshot Prune Implementation

2 participants