fix: do not mark time traveling blocks as bad #6241

hanabi1224 · 2025-11-10T15:36:28Z

Summary of changes

To match the logic in Lotus: filecoin-project/lotus#535

	if h.Timestamp > now+buildconstants.AllowableClockDriftSecs {
		return xerrors.Errorf("block was from the future (now=%d, blk=%d): %w", now, h.Timestamp, consensus.ErrTemporal)
	}

        func isPermanent(err error) bool {
	        return !errors.Is(err, consensus.ErrTemporal)
        }

        if isPermanent(err) {
	        syncer.bad.Add(b.Cid(), NewBadBlockReason([]cid.Cid{b.Cid()}, "%s", err.Error()))
        }

Changes introduced in this pull request:

Reference issue to close (if applicable)

Closes

Change checklist

I have performed a self-review of my own code,
I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
I have added tests that prove my fix is effective or that my feature works (if possible),
I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

Bug Fixes
- Fixed a synchronization issue that occurred when processing time-traveling blocks, preventing sync disruptions.
- Enhanced bad block tracking throughout the synchronization workflow to improve chain sync stability and reliability.

coderabbitai · 2025-11-10T15:36:38Z

Walkthrough

The changes implement a fix for sync issues caused by time-traveling blocks by propagating a shared bad block cache through the chain synchronization pipeline. The cache is now passed through chain_follower into SyncTask execution and validate_tipset, enabling validation steps to track and reject previously failed blocks during tipset validation.

Changes

Cohort / File(s)	Summary
Documentation `CHANGELOG.md`	Added entry documenting fix for sync issue caused by time-traveling blocks (PR `#6241`)
Bad Block Cache Logging `src/chain_sync/bad_block_cache.rs`	Added warning log statement when pushing CID to cache
Chain Synchronization Pipeline `src/chain_sync/chain_follower.rs`, `src/chain_sync/tipset_syncer.rs`	Propagated `bad_block_cache` parameter through chain follower and task execution paths; updated `validate_tipset` to accept optional cache and push failed block CIDs (except time-travel errors) into cache; removed previous per-block iteration logic from `mark_bad_tipset`

Sequence Diagram

sequenceDiagram
    participant CF as chain_follower
    participant SSM as SyncStateMachine
    participant ST as SyncTask
    participant TV as validate_tipset
    participant BBC as BadBlockCache

    CF->>CF: Initialize with bad_block_cache
    CF->>SSM: Pass bad_block_cache.clone()
    
    SSM->>ST: Spawn SyncTask with bad_block_cache
    ST->>TV: execute() with bad_block_cache param
    
    TV->>TV: Validate tipset blocks
    alt Validation Success
        TV-->>ST: Ok(())
    else Validation Failure (non-time-travel)
        TV->>BBC: push(failed_block_cid)
        BBC-->>TV: Logged warning
        TV-->>ST: Err(TipsetSyncerError)
    else Time-Travel Error
        TV-->>ST: Err(TimeTravellingBlock)
    end
    
    ST-->>SSM: SyncEvent
    SSM-->>CF: State transition

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Signature changes across multiple public APIs: chain_follower and SyncTask::execute both gain new parameters requiring verification of all call sites.
Parameter propagation complexity: bad_block_cache flows through multiple layers (chain_follower → SyncStateMachine → SyncTask → validate_tipset), requiring careful trace-through of each pass and clone operation.
Logic removal verification: Confirm that removing per-block iteration logic from mark_bad_tipset doesn't inadvertently lose important caching behavior in edge cases.
Error handling changes: Review conditional caching logic in validate_tipset to ensure only appropriate error types trigger cache population, especially distinguishing time-travel errors.

Possibly related PRs

ChainSafe/forest#5841: Replaces bad block cache implementation (introduces SizeTrackingLruCache and renames put→push), directly related to cache call sites being updated here.
ChainSafe/forest#5842: Modifies bad-block caching and mark_bad_tipset behavior for handling bad tipsets, overlaps with removal of per-block caching logic in this PR.
ChainSafe/forest#6165: Modifies chain_follower function signature and related chain synchronization APIs, involves similar parameter propagation concerns.

Suggested reviewers

LesnyRumcajs
akaladarshi
sudo-shashank

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'fix: do not mark time traveling blocks as bad' directly and specifically describes the main change in the changeset, which involves modifying validation logic to exclude time-traveling blocks from being cached as bad blocks.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch hm/do-not-mark-time-traveling-blocks-as-bad

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/chain_sync/bad_block_cache.rs (1)
32-35: Use structured fields for tracing and avoid string formatting.

Prefer structured, queryable logs and avoid allocs in the hot path.

Apply:
-        tracing::warn!("Marked bad block: {c}");
+        tracing::warn!(cid = %c, "Marked bad block");
CHANGELOG.md (1)
36-39: Clarify the fix and align wording.

Be explicit that future-timestamp blocks aren’t marked bad and are retried later.

[respond_minimally]
-- [#6241](https://github.com/ChainSafe/forest/pull/6241) Fixed a sync issue that is caused by time traveling block(s).
+- [#6241](https://github.com/ChainSafe/forest/pull/6241) Do not mark time‑travelling (future‑timestamp) blocks as bad; return a temporal error and retry later (Lotus parity).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 154bf9c and 19eda8e.

📒 Files selected for processing (4)

CHANGELOG.md (1 hunks)
src/chain_sync/bad_block_cache.rs (1 hunks)
src/chain_sync/chain_follower.rs (4 hunks)
src/chain_sync/tipset_syncer.rs (3 hunks)

🧰 Additional context used

🧠 Learnings (3)

📓 Common learnings

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5944
File: src/chain/store/index.rs:0-0
Timestamp: 2025-08-18T03:09:47.932Z
Learning: In Forest's tipset_by_height caching implementation, hanabi1224 prefers performance-conscious solutions that leverage finality guarantees rather than expensive chain walking for fork detection. The approach of constraining cache lookups to finalized epochs (using CHECKPOINT_INTERVAL >= CHAIN_FINALITY) provides fork safety without the performance cost of ancestry verification.

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5930
File: build.rs:64-77
Timestamp: 2025-08-13T09:43:20.301Z
Learning: hanabi1224 prefers hard compile-time errors in build scripts rather than runtime safeguards or collision detection, believing it's better to fail fast and fix root causes of issues like malformed snapshot names.

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 6057
File: src/cli/subcommands/f3_cmd.rs:0-0
Timestamp: 2025-09-09T10:37:17.947Z
Learning: hanabi1224 prefers having default timeouts (like 10m for --no-progress-timeout) to prevent commands from hanging indefinitely, even when the timeout flag isn't explicitly provided by users. This fail-fast approach is preferred over requiring explicit flag usage.

📚 Learning: 2025-08-11T14:00:47.060Z

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5886
File: src/daemon/db_util.rs:236-259
Timestamp: 2025-08-11T14:00:47.060Z
Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.

Applied to files:

src/chain_sync/tipset_syncer.rs

📚 Learning: 2025-08-08T12:11:55.266Z

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5867
File: src/ipld/util.rs:461-487
Timestamp: 2025-08-08T12:11:55.266Z
Learning: Forest (src/ipld/util.rs, Rust): In UnorderedChainStream::poll_next, dropping `extract_sender` (when no more tipsets and the extract queue is empty) is the intended shutdown signal for workers. Any subsequent attempt to enqueue work after this drop is a logic error and should be treated as an error; do not change `send()` to ignore a missing sender.

Applied to files:

src/chain_sync/tipset_syncer.rs

🧬 Code graph analysis (1)

src/chain_sync/chain_follower.rs (1)

src/chain_sync/tipset_syncer.rs (1)

validate_tipset (100-147)

🪛 GitHub Actions: Link Checker

src/chain_sync/bad_block_cache.rs

[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")

src/chain_sync/tipset_syncer.rs

[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")

src/chain_sync/chain_follower.rs

[warning] 1-1: Lychee: Error creating request: InvalidPathToUri("/")

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build MacOS
GitHub Check: cargo-publish-dry-run
GitHub Check: Build Ubuntu
GitHub Check: All lint checks
GitHub Check: tests-release
GitHub Check: tests
GitHub Check: Build forest binaries on Linux AMD64
GitHub Check: Check

🔇 Additional comments (2)

src/chain_sync/tipset_syncer.rs (1)

105-106: Don’t mark transient validation errors as permanently bad (exclude TipsetParentNotFound).

Current logic excludes only TimeTravellingBlock; TipsetParentNotFound is also transient (see Lines 196–205), but will be pushed to bad_block_cache, causing permanent rejection. Align with Lotus’ isPermanent(err) idea.

Apply within validate_tipset:

-                warn!("Validating block [CID = {cid}] in EPOCH = {epoch} failed: {why}");
-                match &why {
-                    TipsetSyncerError::TimeTravellingBlock(_, _) => {
-                        // Do not mark a block as bad for temporary errors.
-                        // See <https://github.com/filecoin-project/lotus/blob/v1.34.1/chain/sync.go#L602> in Lotus
-                    }
-                    _ => {
-                        if let Some(bad_block_cache) = bad_block_cache {
-                            bad_block_cache.push(cid);
-                        }
-                    }
-                };
+                warn!(epoch, %cid, error = %why, "block validation failed");
+                let is_temp = matches!(
+                    &why,
+                    TipsetSyncerError::TimeTravellingBlock(_, _)
+                        | TipsetSyncerError::TipsetParentNotFound(_)
+                );
+                if !is_temp {
+                    if let Some(ref bad_block_cache) = bad_block_cache {
+                        bad_block_cache.push(cid);
+                    }
+                }

Optionally extract the predicate:

fn is_temporary_sync_error(e: &TipsetSyncerError) -> bool {
    matches!(e,
        TipsetSyncerError::TimeTravellingBlock(_, _)
        | TipsetSyncerError::TipsetParentNotFound(_)
    )
}

Then use if !is_temporary_sync_error(&why) { … }.

Also applies to: 129-140

⛔ Skipped due to learnings

Learnt from: hanabi1224
Repo: ChainSafe/forest PR: 5886
File: src/daemon/db_util.rs:236-259
Timestamp: 2025-08-11T14:00:47.060Z
Learning: In Forest's snapshot import (`src/daemon/db_util.rs`), when F3 CID is present in snapshot metadata but the actual F3 data is missing, this should cause a hard error as it indicates snapshot corruption. Only errors during the F3 import process itself (after data is successfully retrieved) should be non-fatal and logged.

src/chain_sync/chain_follower.rs (1)

233-234: Ripgrep found the validate_tipset call site as expected, but cargo check failed due to environment limitations.

The rg search located the call to validate_tipset at line 851, which falls within the changed range (852–859). However, the sandbox environment lacks a C compiler, preventing cargo check from completing. The structural changes appear correct, but manual verification in your local environment is required to confirm compilation:
cargo check -q
Also applies to: 260-261, 829-835, 852-859

fix: do not mark time traveling blocks as bad

f01944b

changelog

19eda8e

hanabi1224 marked this pull request as ready for review November 10, 2025 15:38

hanabi1224 requested a review from a team as a code owner November 10, 2025 15:38

hanabi1224 requested review from LesnyRumcajs and sudo-shashank and removed request for a team November 10, 2025 15:38

LesnyRumcajs approved these changes Nov 10, 2025

View reviewed changes

coderabbitai bot reviewed Nov 10, 2025

View reviewed changes

hanabi1224 mentioned this pull request Nov 10, 2025

Synchronization backwards cannot be restored #6042

Closed

Merge branch 'main' into hm/do-not-mark-time-traveling-blocks-as-bad

a2b7fc6

hanabi1224 enabled auto-merge November 10, 2025 16:15

hanabi1224 added this pull request to the merge queue Nov 10, 2025

Merged via the queue into main with commit 21cc6c6 Nov 10, 2025
51 checks passed

hanabi1224 deleted the hm/do-not-mark-time-traveling-blocks-as-bad branch November 10, 2025 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: do not mark time traveling blocks as bad #6241

fix: do not mark time traveling blocks as bad #6241

Uh oh!

hanabi1224 commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Nov 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: do not mark time traveling blocks as bad #6241

fix: do not mark time traveling blocks as bad #6241

Uh oh!

Conversation

hanabi1224 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of changes

Reference issue to close (if applicable)

Other information and links

Change checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hanabi1224 commented Nov 10, 2025 •

edited

Loading

coderabbitai bot commented Nov 10, 2025 •

edited

Loading