-
Notifications
You must be signed in to change notification settings - Fork 182
fix(gc): schedule automatic GC only when the node is in sync #6165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThreads a new Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Daemon
participant ChainFollower
participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
participant SnapGC as SnapshotGarbageCollector
participant Services as "start_services callback"
Daemon->>ChainFollower: start()
ChainFollower-->>Daemon: chain_follower.sync_status (SyncStatus)
Daemon->>SnapGC: set_sync_status(sync_status)
Note right of SnapGC #e8f5e9: stores SyncStatus handle
Daemon->>Services: start_services(ctx, sync_status)
sequenceDiagram
autonumber
participant SnapGC as SnapshotGarbageCollector
participant SyncStatus as "SyncStatus\n(Arc<RwLock<SyncStatusReport>>)"
participant CARDB as "CAR DB"
participant Export as "Snapshot Export"
loop Scheduler tick
SnapGC->>SyncStatus: read current_head_epoch, network_head_epoch, sync flags
SnapGC->>CARDB: read car_db_head_epoch
alt in-sync && no forks && epoch progressed
SnapGC->>Export: run export/GC
Export-->>SnapGC: completed
SnapGC->>SnapGC: compute & log duration
else not scheduled
SnapGC->>SnapGC: log "not scheduled (network_head_epoch=...)"
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
src/db/gc/snapshot.rs
Outdated
| { | ||
| let head_epoch = head.epoch(); | ||
| if head_epoch - car_db_head_epoch >= snap_gc_interval_epochs | ||
| const IN_SYNC_EPOCH_THRESHOLD: i64 = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I notice sync_status.status does not refect the reality after a long catchup (potential bug?), so checking head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch manually here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential bug SyncStatusReport, I will file an issue for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because there we are using time diff only:
let time_diff = now_ts.saturating_sub(heaviest.min_timestamp());
if time_diff < seconds_per_epoch as u64 * SYNCED_EPOCH_THRESHOLD {
NodeSyncStatus::Synced
} else {
NodeSyncStatus::Syncing
}But you can use
if sync_status.is_synced() && sync_status.active_forks.is_empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/db/gc/snapshot.rs (1)
141-151: Critical: running flag stays true on export failure, blocking future GCIf export_snapshot() returns Err, running is never reset, so the scheduler will never trigger again.
Proposed fix:
} else { self.running.store(true, Ordering::Relaxed); - if let Err(e) = self.export_snapshot().await { - tracing::warn!("{e}"); - } + match self.export_snapshot().await { + Ok(_) => {} + Err(e) => { + tracing::warn!("{e}"); + // Ensure scheduler can re-trigger after a failed run + self.running.store(false, Ordering::Relaxed); + } + } }
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
178-196: Scheduling gates look right; make threshold configurableLogic correctly prevents GC unless in-sync and interval elapsed. Consider making the in-sync threshold tunable via env/config for different nets.
Apply within this block:
- const IN_SYNC_EPOCH_THRESHOLD: i64 = 2; - let sync_status = &*sync_status.read(); + let in_sync_epoch_threshold: i64 = std::env::var("FOREST_SNAPSHOT_GC_IN_SYNC_EPOCH_THRESHOLD") + .ok() + .and_then(|v| v.parse().ok()) + .unwrap_or(2); + let sync_status = &*sync_status.read(); let network_head_epoch = sync_status.network_head_epoch; let head_epoch = sync_status.current_head_epoch; if head_epoch > 0 - && head_epoch <= network_head_epoch - && head_epoch + IN_SYNC_EPOCH_THRESHOLD >= network_head_epoch + && head_epoch <= network_head_epoch + && head_epoch + in_sync_epoch_threshold >= network_head_epoch && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs && self.trigger_tx.try_send(()).is_ok() {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
src/daemon/mod.rs(5 hunks)src/db/gc/snapshot.rs(6 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
start(532-574)
src/daemon/mod.rs (2)
src/db/gc/snapshot.rs (2)
set_db(129-131)set_sync_status(137-139)src/chain_sync/chain_follower.rs (1)
chain_follower(134-331)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: tests
- GitHub Check: tests-release
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: Build Ubuntu
- GitHub Check: Build MacOS
- GitHub Check: All lint checks
🔇 Additional comments (4)
src/db/gc/snapshot.rs (2)
77-77: Sync status plumbing: LGTMField, init, and setter are correct and align with daemon propagation.
Also applies to: 115-116, 137-139
249-253: Export duration logging: LGTMClear, human-readable timing; consistent with other logs.
src/daemon/mod.rs (2)
565-569: Wiring sync_status into GC: LGTMCorrectly sets DB, sync_status, and CAR DB head epoch in the init callback.
581-581: Use FnOnce for the init callback
The callback runs once; FnOnce is more flexible (allows move captures) with no downside here.-pub(super) async fn start_services( +pub(super) async fn start_services( start_time: chrono::DateTime<chrono::Utc>, opts: &CliOpts, mut config: Config, shutdown_send: mpsc::Sender<()>, - on_app_context_and_db_initialized: impl Fn(&AppContext, Arc<RwLock<SyncStatusReport>>), + on_app_context_and_db_initialized: impl FnOnce(&AppContext, Arc<RwLock<SyncStatusReport>>), ) -> anyhow::Result<()> {All call sites already pass closures compatible with FnOnce.
src/db/gc/snapshot.rs
Outdated
| running: AtomicBool, | ||
| blessed_lite_snapshot: RwLock<Option<PathBuf>>, | ||
| db: RwLock<Option<Arc<DB>>>, | ||
| sync_status: RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this will be better here?
RwLock<Option<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>> -> OnceLock<Arc<RwLock<crate::chain_sync::SyncStatusReport>>>
This way we can avoid doing: &*sync_status.read()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's set more than once after every GC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I missed this GC restarts the node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need two RwLocks here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we simplify this type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elmattic simplified type by adding pub type SyncStatus = Arc<RwLock<SyncStatusReport>>;
794c82a to
94f7193
Compare
94f7193 to
da3f9f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
186-189: Consider adding active forks check.Based on past review discussions, it was suggested to also check
sync_status.active_forks.is_empty()in addition to verifying the sync status. Running GC during active fork resolution could potentially interfere with block validation.Apply this diff to add the active forks check:
if head_epoch > 0 && head_epoch <= network_head_epoch && sync_status.status == NodeSyncStatus::Synced + && sync_status.active_forks.is_empty() && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs && self.trigger_tx.try_send(()).is_ok()Based on past review comments.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/daemon/mod.rs(5 hunks)src/db/gc/snapshot.rs(7 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/daemon/mod.rs
🧰 Additional context used
🧬 Code graph analysis (1)
src/db/gc/snapshot.rs (1)
src/daemon/mod.rs (1)
start(532-574)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: cargo-publish-dry-run
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: Build Ubuntu
- GitHub Check: tests
- GitHub Check: Build MacOS
🔇 Additional comments (1)
src/db/gc/snapshot.rs (1)
249-253: LGTM! Good observability improvement.The addition of export duration logging is helpful for monitoring GC performance and diagnosing potential issues.
src/db/gc/snapshot.rs
Outdated
| let sync_status = &*sync_status.read(); | ||
| let network_head_epoch = sync_status.network_head_epoch; | ||
| let head_epoch = sync_status.current_head_epoch; | ||
| if head_epoch > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this test condition is quite a beast, maybe it's time to break it down into functions and add some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added comment to each condition
a593c5c to
710f042
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/db/gc/snapshot.rs (1)
148-149: Resetting running flag on export error — nice catchThis addresses the earlier concern about being unable to re-trigger GC after a failure.
🧹 Nitpick comments (1)
src/db/gc/snapshot.rs (1)
181-196: Make the scheduling predicate clearer and reusableLogic is correct. For readability and testability, consider extracting the predicate and making the non-negative gap explicit.
Apply this diff to factor the check and clarify the gap:
@@ - loop { - if !self.running.load(Ordering::Relaxed) - && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read() - && let Some(sync_status) = &*self.sync_status.read() - { - let sync_status = &*sync_status.read(); - let network_head_epoch = sync_status.network_head_epoch; - let head_epoch = sync_status.current_head_epoch; - if head_epoch > 0 // sync_status has been initialized - && head_epoch <= network_head_epoch // head epoch is within a sane range - && sync_status.is_synced() // chain is in sync - && sync_status.active_forks.is_empty() // no active fork - && head_epoch - car_db_head_epoch >= snap_gc_interval_epochs // the gap between chain head and car_db head is above threshold - && self.trigger_tx.try_send(()).is_ok() - { - tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled"); - } else { - tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled"); - } - } + loop { + if !self.running.load(Ordering::Relaxed) + && let Some(car_db_head_epoch) = *self.car_db_head_epoch.read() + && let Some(sync_status) = &*self.sync_status.read() + { + let report = &*sync_status.read(); + let network_head_epoch = report.network_head_epoch; + let head_epoch = report.current_head_epoch; + let gap_ok = head_epoch >= car_db_head_epoch + && (head_epoch - car_db_head_epoch) >= snap_gc_interval_epochs; + if should_schedule_gc(head_epoch, network_head_epoch, report, gap_ok) + && self.trigger_tx.try_send(()).is_ok() + { + tracing::info!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC scheduled"); + } else { + tracing::debug!(%car_db_head_epoch, %head_epoch, %network_head_epoch, %snap_gc_interval_epochs, "Snap GC not scheduled"); + } + } tokio::time::sleep(snap_gc_check_interval).await; }Add this helper inside impl:
fn should_schedule_gc( head_epoch: ChainEpoch, network_head_epoch: ChainEpoch, report: &crate::chain_sync::SyncStatusReport, gap_ok: bool, ) -> bool { head_epoch > 0 && head_epoch <= network_head_epoch && report.is_synced() && report.active_forks.is_empty() && gap_ok }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/chain_sync/chain_follower.rs(2 hunks)src/chain_sync/mod.rs(1 hunks)src/chain_sync/sync_status.rs(2 hunks)src/daemon/mod.rs(4 hunks)src/db/gc/snapshot.rs(6 hunks)src/health/mod.rs(3 hunks)src/rpc/mod.rs(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/rpc/mod.rs
- src/daemon/mod.rs
- src/health/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
src/chain_sync/chain_follower.rs (1)
src/chain_sync/tipset_syncer.rs (1)
validate_tipset(99-139)
src/db/gc/snapshot.rs (2)
src/chain_sync/chain_follower.rs (2)
new(83-110)new(545-556)src/daemon/mod.rs (1)
start(531-573)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: Devnet checks
- GitHub Check: V1 snapshot export checks
- GitHub Check: Calibnet no discovery checks
- GitHub Check: db-migration-checks
- GitHub Check: State migrations
- GitHub Check: Wallet tests
- GitHub Check: Bootstrap checks - Forest
- GitHub Check: V2 snapshot export checks
- GitHub Check: Bootstrap checks - Lotus
- GitHub Check: Calibnet eth mapping check
- GitHub Check: Calibnet api test-stateful check
- GitHub Check: Diff snapshot export checks
- GitHub Check: Calibnet kademlia checks
- GitHub Check: Calibnet stateless RPC check
- GitHub Check: Calibnet stateless mode check
- GitHub Check: Forest CLI checks
- GitHub Check: Calibnet check
- GitHub Check: Build forest binaries on Linux AMD64
- GitHub Check: All lint checks
- GitHub Check: tests-release
- GitHub Check: tests
🔇 Additional comments (5)
src/chain_sync/sync_status.rs (1)
10-10: Good aliasing of sync status with parking_lot RwLockThe
SyncStatusalias and theparking_lot::RwLockimport simplify signatures and keep locking lightweight. LGTM.Based on learnings
Also applies to: 105-106
src/db/gc/snapshot.rs (2)
77-77: Plumbing SyncStatus into GC is clean and consistentStoring and setting the
SyncStatushandle looks good and aligns with the new alias across the codebase.Also applies to: 115-116, 137-140
233-254: Nice timing instrumentation for snapshot exportMeasuring and logging export duration improves observability. LGTM.
src/chain_sync/mod.rs (1)
19-21: Re-exporting SyncStatus is the right API polishThis avoids leaking concrete lock types across modules and simplifies imports.
src/chain_sync/chain_follower.rs (1)
47-48: Adopting SyncStatus alias throughout ChainFollower looks correctField, constructor, function signature, and update site are consistently switched to the alias; lock usage is appropriate and scoped.
Also applies to: 94-95, 137-138, 242-248
akaladarshi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice Work!
Summary of changes
This fixes a rare condition when GC kicks in during node catchup and breaks the block validation
Changes introduced in this pull request:
cargo update -p halfto update the yankedhalf@2.7.0Reference issue to close (if applicable)
Closes
Other information and links
Change checklist
Summary by CodeRabbit
New Features
Improvements
Bug Fixes