Skip to content

Conversation

@hanabi1224
Copy link
Contributor

@hanabi1224 hanabi1224 commented Oct 15, 2025

Summary of changes

This fixes a rare condition when GC kicks in during node catchup and breaks the block validation

Changes introduced in this pull request:

  • also run cargo update -p half to update the yanked half@2.7.0

Reference issue to close (if applicable)

Closes

Other information and links

Change checklist

  • I have performed a self-review of my own code,
  • I have made corresponding changes to the documentation. All new code adheres to the team's documentation standards,
  • I have added tests that prove my fix is effective or that my feature works (if possible),
  • I have made sure the CHANGELOG is up-to-date. All user-facing changes should be reflected in this document.

Summary by CodeRabbit

  • New Features

    • Live sync status is now forwarded at startup so background services act on current node sync state.
  • Improvements

    • Snapshot garbage collector only runs when the node is near the network head.
    • Scheduling logs now include network head context.
    • Snapshot export logs report total duration for monitoring.
  • Bug Fixes

    • Export failures now reset state to allow re-triggering of export attempts.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Walkthrough

Threads a new SyncStatus (Arc<RwLock>) type through the codebase: exported from chain_sync, used in ChainFollower, passed into daemon start_services, stored in SnapshotGarbageCollector, and used to gate GC scheduling and improve logging/timing.

Changes

Cohort / File(s) Summary of changes
Daemon startup wiring
src/daemon/mod.rs
- Import SyncStatus and ChainFollower from crate::chain_sync.
- Change start_services callback signature to accept SyncStatus (impl FnOnce(&AppContext, SyncStatus)).
- Pass the chain follower's sync_status into start_services and call snap_gc.set_sync_status(sync_status).
Snapshot GC scheduler & logging
src/db/gc/snapshot.rs
- Add sync_status: RwLock<Option<crate::chain_sync::SyncStatus>> field and pub fn set_sync_status(&self, sync_status: crate::chain_sync::SyncStatus).
- Scheduler now reads network_head_epoch and current_head_epoch from SyncStatus, gates exports on sync flags, fork state, and epoch progression relative to CAR DB head.
- Add timing instrumentation and duration logs around lite snapshot and full export; adjust event loop error handling (reset running on export error).
Chain sync types & API
src/chain_sync/sync_status.rs, src/chain_sync/mod.rs, src/chain_sync/chain_follower.rs
- Add pub type SyncStatus = Arc<RwLock<SyncStatusReport>> and import parking_lot::RwLock.
- Re-export SyncStatus from chain_sync module.
- Change ChainFollower::sync_status field and chain_follower(...) parameter types to SyncStatus.
Health & RPC state updates
src/health/mod.rs, src/rpc/mod.rs
- Replace stored sync status types with crate::chain_sync::SyncStatus (ForestState.sync_status, RPCState.sync_status).
- Update imports and tests to use the new SyncStatus alias.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Daemon
  participant ChainFollower
  participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
  participant SnapGC as SnapshotGarbageCollector
  participant Services as "start_services callback"

  Daemon->>ChainFollower: start()
  ChainFollower-->>Daemon: chain_follower.sync_status (SyncStatus)
  Daemon->>SnapGC: set_sync_status(sync_status)
  Note right of SnapGC #e8f5e9: stores SyncStatus handle
  Daemon->>Services: start_services(ctx, sync_status)
Loading
sequenceDiagram
  autonumber
  participant SnapGC as SnapshotGarbageCollector
  participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
  participant CARDB as "CAR DB"
  participant Export as "Snapshot Export"

  loop Scheduler tick
    SnapGC->>SyncStatus: read current_head_epoch, network_head_epoch, sync flags
    SnapGC->>CARDB: read car_db_head_epoch
    alt in-sync && no forks && epoch progressed
      SnapGC->>Export: run export/GC
      Export-->>SnapGC: completed
      SnapGC->>SnapGC: compute & log duration
    else not scheduled
      SnapGC->>SnapGC: log "not scheduled (network_head_epoch=...)" 
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • elmattic
  • akaladarshi
  • LesnyRumcajs

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title clearly and accurately summarizes the primary change by stating that automatic garbage collection will only run when the node is in sync, matching the PR’s objective to prevent GC during catchup. It is concise, specific, and follows the project’s conventional commit style. This phrasing allows teammates to quickly understand the fix’s purpose when scanning the history.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch hm/schedule-gc-only-when-chain-in-sync

Comment @coderabbitai help to get the list of available commands and usage tips.

@hanabi1224 hanabi1224 marked this pull request as ready for review October 15, 2025 11:18
@hanabi1224 hanabi1224 requested a review from a team as a code owner October 15, 2025 11:18
@hanabi1224 hanabi1224 requested review from akaladarshi and elmattic and removed request for a team October 15, 2025 11:18
{
let head_epoch = head.epoch();
if head_epoch - car_db_head_epoch >= snap_gc_interval_epochs
const IN_SYNC_EPOCH_THRESHOLD: i64 = 2;
Copy link
Contributor Author

@hanabi1224 hanabi1224 Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice sync_status.status does not refect the reality after a long catchup (potential bug?), so checking head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch manually here

Copy link
Collaborator

@akaladarshi akaladarshi Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential bug SyncStatusReport, I will file an issue for it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because there we are using time diff only:

let time_diff = now_ts.saturating_sub(heaviest.min_timestamp());


if time_diff < seconds_per_epoch as u64 * SYNCED_EPOCH_THRESHOLD {
 NodeSyncStatus::Synced
} else {
  NodeSyncStatus::Syncing
}

But you can use

if sync_status.is_synced() && sync_status.active_forks.is_empty()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/db/gc/snapshot.rs (1)

141-151: Critical: running flag stays true on export failure, blocking future GC

If export_snapshot() returns Err, running is never reset, so the scheduler will never trigger again.

Proposed fix:

             } else {
                 self.running.store(true, Ordering::Relaxed);
-                if let Err(e) = self.export_snapshot().await {
-                    tracing::warn!("{e}");
-                }
+                match self.export_snapshot().await {
+                    Ok(_) => {}
+                    Err(e) => {
+                        tracing::warn!("{e}");
+                        // Ensure scheduler can re-trigger after a failed run
+                        self.running.store(false, Ordering::Relaxed);
+                    }
+                }
             }
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)

178-196: Scheduling gates look right; make threshold configurable

Logic correctly prevents GC unless in-sync and interval elapsed. Consider making the in-sync threshold tunable via env/config for different nets.

Apply within this block:

-                const IN_SYNC_EPOCH_THRESHOLD: i64 = 2;
-                let sync_status = &*sync_status.read();
+                let in_sync_epoch_threshold: i64 = std::env::var("FOREST_SNAPSHOT_GC_IN_SYNC_EPOCH_THRESHOLD")
+                    .ok()
+                    .and_then(|v| v.parse().ok())
+                    .unwrap_or(2);
+                let sync_status = &*sync_status.read();
                 let network_head_epoch = sync_status.network_head_epoch;
                 let head_epoch = sync_status.current_head_epoch;
                 if head_epoch > 0
-                    && head_epoch <= network_head_epoch
-                    && head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch
+                    && head_epoch <= network_head_epoch
+                    && head_epoch + in_sync_epoch_threshold >= network_head_epoch
                     && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs
                     && self.trigger_tx.try_send(()).is_ok()
                 {
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2f2b0d9 and 863ddf8.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • src/daemon/mod.rs (5 hunks)
  • src/db/gc/snapshot.rs (6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
  • start (532-574)
src/daemon/mod.rs (2)
src/db/gc/snapshot.rs (2)
  • set_db (129-131)
  • set_sync_status (137-139)
src/chain_sync/chain_follower.rs (1)
  • chain_follower (134-331)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: tests
  • GitHub Check: tests-release
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: Build Ubuntu
  • GitHub Check: Build MacOS
  • GitHub Check: All lint checks
🔇 Additional comments (4)
src/db/gc/snapshot.rs (2)

77-77: Sync status plumbing: LGTM

Field, init, and setter are correct and align with daemon propagation.

Also applies to: 115-116, 137-139


249-253: Export duration logging: LGTM

Clear, human-readable timing; consistent with other logs.

src/daemon/mod.rs (2)

565-569: Wiring sync_status into GC: LGTM

Correctly sets DB, sync_status, and CAR DB head epoch in the init callback.


581-581: Use FnOnce for the init callback
The callback runs once; FnOnce is more flexible (allows move captures) with no downside here.

-pub(super) async fn start_services(
+pub(super) async fn start_services(
     start_time: chrono::DateTime<chrono::Utc>,
     opts: &CliOpts,
     mut config: Config,
     shutdown_send: mpsc::Sender<()>,
-    on_app_context_and_db_initialized: impl Fn(&AppContext, Arc<RwLock<SyncStatusReport>>),
+    on_app_context_and_db_initialized: impl FnOnce(&AppContext, Arc<RwLock<SyncStatusReport>>),
) -> anyhow::Result<()> {

All call sites already pass closures compatible with FnOnce.

running: AtomicBool,
blessed_lite_snapshot: RwLock<Option<PathBuf>>,
db: RwLock<Option<Arc<DB>>>,
sync_status: RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will be better here?

RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>> -> OnceLock<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>

This way we can avoid doing: &*sync_status.read()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's set more than once after every GC

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I missed this GC restarts the node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need two RwLocks here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we simplify this type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmattic simplified type by adding pub type SyncStatus = Arc<RwLock<SyncStatusReport>>;

@hanabi1224 hanabi1224 force-pushed the hm/schedule-gc-only-when-chain-in-sync branch from 794c82a to 94f7193 Compare October 15, 2025 13:11
@hanabi1224 hanabi1224 force-pushed the hm/schedule-gc-only-when-chain-in-sync branch from 94f7193 to da3f9f6 Compare October 15, 2025 13:13
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)

186-189: Consider adding active forks check.

Based on past review discussions, it was suggested to also check sync_status.active_forks.is_empty() in addition to verifying the sync status. Running GC during active fork resolution could potentially interfere with block validation.

Apply this diff to add the active forks check:

             if head_epoch > 0
                 && head_epoch <= network_head_epoch
                 && sync_status.status == NodeSyncStatus::Synced
+                && sync_status.active_forks.is_empty()
                 && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs
                 && self.trigger_tx.try_send(()).is_ok()

Based on past review comments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 863ddf8 and 794c82a.

📒 Files selected for processing (2)
  • src/daemon/mod.rs (5 hunks)
  • src/db/gc/snapshot.rs (7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/daemon/mod.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
  • start (532-574)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: cargo-publish-dry-run
  • GitHub Check: All lint checks
  • GitHub Check: tests-release
  • GitHub Check: Build Ubuntu
  • GitHub Check: tests
  • GitHub Check: Build MacOS
🔇 Additional comments (1)
src/db/gc/snapshot.rs (1)

249-253: LGTM! Good observability improvement.

The addition of export duration logging is helpful for monitoring GC performance and diagnosing potential issues.

let sync_status = &*sync_status.read();
let network_head_epoch = sync_status.network_head_epoch;
let head_epoch = sync_status.current_head_epoch;
if head_epoch > 0
Copy link
Contributor

@elmattic elmattic Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this test condition is quite a beast, maybe it's time to break it down into functions and add some comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added comment to each condition

elmattic
elmattic previously approved these changes Oct 15, 2025
@hanabi1224 hanabi1224 force-pushed the hm/schedule-gc-only-when-chain-in-sync branch from a593c5c to 710f042 Compare October 15, 2025 17:41
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/db/gc/snapshot.rs (1)

148-149: Resetting running flag on export error — nice catch

This addresses the earlier concern about being unable to re-trigger GC after a failure.

🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)

181-196: Make the scheduling predicate clearer and reusable

Logic is correct. For readability and testability, consider extracting the predicate and making the non-negative gap explicit.

Apply this diff to factor the check and clarify the gap:

@@
-        loop {
-            if !self.running.load(Ordering::Relaxed)
-                && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read()
-                && let Some(sync_status) = &*self.sync_status.read()
-            {
-                let sync_status = &*sync_status.read();
-                let network_head_epoch = sync_status.network_head_epoch;
-                let head_epoch = sync_status.current_head_epoch;
-                if head_epoch > 0 // sync_status has been initialized
-                    && head_epoch <= network_head_epoch // head epoch is within a sane range
-                    && sync_status.is_synced() // chain is in sync
-                    && sync_status.active_forks.is_empty() // no active fork
-                    && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs // the gap between chain head and car_db head is above threshold
-                    && self.trigger_tx.try_send(()).is_ok()
-                {
-                    tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled");
-                } else {
-                    tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled");
-                }
-            }
+        loop {
+            if !self.running.load(Ordering::Relaxed)
+                && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read()
+                && let Some(sync_status) = &*self.sync_status.read()
+            {
+                let report = &*sync_status.read();
+                let network_head_epoch = report.network_head_epoch;
+                let head_epoch = report.current_head_epoch;
+                let gap_ok = head_epoch >= car_db_head_epoch
+                    && (head_epoch - car_db_head_epoch) >= snap_gc_interval_epochs;
+                if should_schedule_gc(head_epoch, network_head_epoch, report, gap_ok)
+                    && self.trigger_tx.try_send(()).is_ok()
+                {
+                    tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled");
+                } else {
+                    tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled");
+                }
+            }
             tokio::time::sleep(snap_gc_check_interval).await;
         }

Add this helper inside impl:

fn should_schedule_gc(
    head_epoch: ChainEpoch,
    network_head_epoch: ChainEpoch,
    report: &crate::chain_sync::SyncStatusReport,
    gap_ok: bool,
) -> bool {
    head_epoch > 0
        && head_epoch <= network_head_epoch
        && report.is_synced()
        && report.active_forks.is_empty()
        && gap_ok
}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4000863 and 710f042.

📒 Files selected for processing (7)
  • src/chain_sync/chain_follower.rs (2 hunks)
  • src/chain_sync/mod.rs (1 hunks)
  • src/chain_sync/sync_status.rs (2 hunks)
  • src/daemon/mod.rs (4 hunks)
  • src/db/gc/snapshot.rs (6 hunks)
  • src/health/mod.rs (3 hunks)
  • src/rpc/mod.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/rpc/mod.rs
  • src/daemon/mod.rs
  • src/health/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
src/chain_sync/chain_follower.rs (1)
src/chain_sync/tipset_syncer.rs (1)
  • validate_tipset (99-139)
src/db/gc/snapshot.rs (2)
src/chain_sync/chain_follower.rs (2)
  • new (83-110)
  • new (545-556)
src/daemon/mod.rs (1)
  • start (531-573)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: Devnet checks
  • GitHub Check: V1 snapshot export checks
  • GitHub Check: Calibnet no discovery checks
  • GitHub Check: db-migration-checks
  • GitHub Check: State migrations
  • GitHub Check: Wallet tests
  • GitHub Check: Bootstrap checks - Forest
  • GitHub Check: V2 snapshot export checks
  • GitHub Check: Bootstrap checks - Lotus
  • GitHub Check: Calibnet eth mapping check
  • GitHub Check: Calibnet api test-stateful check
  • GitHub Check: Diff snapshot export checks
  • GitHub Check: Calibnet kademlia checks
  • GitHub Check: Calibnet stateless RPC check
  • GitHub Check: Calibnet stateless mode check
  • GitHub Check: Forest CLI checks
  • GitHub Check: Calibnet check
  • GitHub Check: Build forest binaries on Linux AMD64
  • GitHub Check: All lint checks
  • GitHub Check: tests-release
  • GitHub Check: tests
🔇 Additional comments (5)
src/chain_sync/sync_status.rs (1)

10-10: Good aliasing of sync status with parking_lot RwLock

The SyncStatus alias and the parking_lot::RwLock import simplify signatures and keep locking lightweight. LGTM.

Based on learnings

Also applies to: 105-106

src/db/gc/snapshot.rs (2)

77-77: Plumbing SyncStatus into GC is clean and consistent

Storing and setting the SyncStatus handle looks good and aligns with the new alias across the codebase.

Also applies to: 115-116, 137-140


233-254: Nice timing instrumentation for snapshot export

Measuring and logging export duration improves observability. LGTM.

src/chain_sync/mod.rs (1)

19-21: Re-exporting SyncStatus is the right API polish

This avoids leaking concrete lock types across modules and simplifies imports.

src/chain_sync/chain_follower.rs (1)

47-48: Adopting SyncStatus alias throughout ChainFollower looks correct

Field, constructor, function signature, and update site are consistently switched to the alias; lock usage is appropriate and scoped.

Also applies to: 94-95, 137-138, 242-248

Copy link
Collaborator

@akaladarshi akaladarshi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Nice Work!

@hanabi1224 hanabi1224 enabled auto-merge October 16, 2025 10:02
@hanabi1224 hanabi1224 added this pull request to the merge queue Oct 16, 2025
Merged via the queue into main with commit e0eefb0 Oct 16, 2025
40 checks passed
@hanabi1224 hanabi1224 deleted the hm/schedule-gc-only-when-chain-in-sync branch October 16, 2025 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants