-
Notifications
You must be signed in to change notification settings - Fork 182
fix: do not mark time traveling blocks as bad #6241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe changes implement a fix for sync issues caused by time-traveling blocks by propagating a shared bad block cache through the chain synchronization pipeline. The cache is now passed through Changes
Sequence DiagramsequenceDiagram
participant CF as chain_follower
participant SSM as SyncStateMachine
participant ST as SyncTask
participant TV as validate_tipset
participant BBC as BadBlockCache
CF->>CF: Initialize with bad_block_cache
CF->>SSM: Pass bad_block_cache.clone()
SSM->>ST: Spawn SyncTask with bad_block_cache
ST->>TV: execute() with bad_block_cache param
TV->>TV: Validate tipset blocks
alt Validation Success
TV-->>ST: Ok(())
else Validation Failure (non-time-travel)
TV->>BBC: push(failed_block_cid)
BBC-->>TV: Logged warning
TV-->>ST: Err(TipsetSyncerError)
else Time-Travel Error
TV-->>ST: Err(TimeTravellingBlock)
end
ST-->>SSM: SyncEvent
SSM-->>CF: State transition
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20–25 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/chain_sync/bad_block_cache.rs (1)
32-35: Use structured fields for tracing and avoid string formatting.Prefer structured, queryable logs and avoid allocs in the hot path.
Apply:
- tracing::warn!("Marked bad block: {c}"); + tracing::warn!(cid = %c, "Marked bad block");CHANGELOG.md (1)
36-39: Clarify the fix and align wording.Be explicit that future-timestamp blocks aren’t marked bad and are retried later.
[respond_minimally]
-- [#6241](https://github.com/ChainSafe/forest/pull/6241) Fixed a sync issue that is caused by time traveling block(s). +- [#6241](https://github.com/ChainSafe/forest/pull/6241) Do not mark time‑travelling (future‑timestamp) blocks as bad; return a temporal error and retry later (Lotus parity).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
CHANGELOG.md(1 hunks)src/chain_sync/bad_block_cache.rs(1 hunks)src/chain_sync/chain_follower.rs(4 hunks)src/chain_sync/tipset_syncer.rs(3 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5944
File: src/chain/store/index.rs:0-0
Timestamp: 2025-08-18T03:09:47.932Z
Learning: In Forest's tipset_by_height caching implementation, hanabi1224 prefers performance-conscious solutions that leverage finality guarantees rather than expensive chain walking for fork detection. The approach of constraining cache lookups to finalized epochs (using CHECKPOINT_INTERVAL >= CHAIN_FINALITY) provides fork safety without the performance cost of ancestry verification.
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 6057
File: src/cli/subcommands/f3_cmd.rs:0-0
Timestamp: 2025-09-09T10:37:17.947Z
Learning: hanabi1224 prefers having default timeouts (like 10m for --no-progress-timeout) to prevent commands from hanging indefinitely, even when the timeout flag isn't explicitly provided by users. This fail-fast approach is preferred over requiring explicit flag usage.
📚 Learning: 2025-08-11T14:00:47.060Z
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5886
File: src/daemon/db_util.rs:236-259
Timestamp: 2025-08-11T14:00:47.060Z
Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.
Applied to files:
src/chain_sync/tipset_syncer.rs
📚 Learning: 2025-08-08T12:11:55.266Z
Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5867
File: src/ipld/util.rs:461-487
Timestamp: 2025-08-08T12:11:55.266Z
Learning: Forest (src/ipld/util.rs, Rust): In UnorderedChainStream::poll_next, dropping `extract_sender` (when no more tipsets and the extract queue is empty) is the intended shutdown signal for workers. Any subsequent attempt to enqueue work after this drop is a logic error and should be treated as an error; do not change `send()` to ignore a missing sender.
Applied to files:
src/chain_sync/tipset_syncer.rs
🧬 Code graph analysis (1)
src/chain_sync/chain_follower.rs (1)
src/chain_sync/tipset_syncer.rs (1)
validate_tipset(100-147)
🪛 GitHub Actions: Link Checker
src/chain_sync/bad_block_cache.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
src/chain_sync/tipset_syncer.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
src/chain_sync/chain_follower.rs
[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: Build MacOS
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: tests
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: Check
🔇 Additional comments (2)
src/chain_sync/tipset_syncer.rs (1)
105-106: Don’t mark transient validation errors as permanently bad (exclude TipsetParentNotFound).Current logic excludes only TimeTravellingBlock; TipsetParentNotFound is also transient (see Lines 196–205), but will be pushed to bad_block_cache, causing permanent rejection. Align with Lotus’ isPermanent(err) idea.
Apply within validate_tipset:
- warn!("Validating block [CID = {cid}] in EPOCH = {epoch} failed: {why}"); - match &why { - TipsetSyncerError::TimeTravellingBlock(_, _) => { - // Do not mark a block as bad for temporary errors. - // See <https://github.com/filecoin-project/lotus/blob/v1.34.1/chain/sync.go#L602> in Lotus - } - _ => { - if let Some(bad_block_cache) = bad_block_cache { - bad_block_cache.push(cid); - } - } - }; + warn!(epoch, %cid, error = %why, "block validation failed"); + let is_temp = matches!( + &why, + TipsetSyncerError::TimeTravellingBlock(_, _) + | TipsetSyncerError::TipsetParentNotFound(_) + ); + if !is_temp { + if let Some(ref bad_block_cache) = bad_block_cache { + bad_block_cache.push(cid); + } + }Optionally extract the predicate:
fn is_temporary_sync_error(e: &TipsetSyncerError) -> bool { matches!(e, TipsetSyncerError::TimeTravellingBlock(_, _) | TipsetSyncerError::TipsetParentNotFound(_) ) }Then use
if !is_temporary_sync_error(&why) { … }.Also applies to: 129-140
⛔ Skipped due to learnings
Learnt from: hanabi1224 Repo: ChainSafe/forest PR: 5886 File: src/daemon/db_util.rs:236-259 Timestamp: 2025-08-11T14:00:47.060Z Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.src/chain_sync/chain_follower.rs (1)
233-234: Ripgrep found the validate_tipset call site as expected, but cargo check failed due to environment limitations.The
rgsearch located the call tovalidate_tipsetat line 851, which falls within the changed range (852–859). However, the sandbox environment lacks a C compiler, preventingcargo checkfrom completing. The structural changes appear correct, but manual verification in your local environment is required to confirm compilation:cargo check -qAlso applies to: 260-261, 829-835, 852-859
Summary of changes
To match the logic in Lotus: filecoin-project/lotus#535
Changes introduced in this pull request:
Reference issue to close (if applicable)
Closes
Other information and links
Change checklist
Summary by CodeRabbit