perf(l1): download bytecodes concurrently with storage during snap sync#6205
perf(l1): download bytecodes concurrently with storage during snap sync#6205ilitteri wants to merge 2 commits into
Conversation
…after healing Instead of downloading bytecodes as a sequential phase after healing (15 min on mainnet), spawn the download right after account insertion. Bytecodes are content-addressed (hash=key) so they're safe to download while storage download, insertion, and healing proceed in parallel. Any additional bytecodes discovered during healing are downloaded separately after healing completes. Both geth and nethermind download bytecodes concurrently with storage.
fix metrics conflict, and add progress logging - Filter out already-stored bytecodes before downloading to avoid redundant work on retries or when healing already fetched codes. - Start background bytecode download during account trie building (pass 2) instead of after all insertion completes, using a Notify to coordinate between the code hash collection pass and the trie building pass in insert_accounts. - Add update_metrics parameter to request_bytecodes so the background download doesn't overwrite current_step during concurrent storage operations. - Add cumulative progress logging to download_bytecodes showing downloaded/total with percentage after each chunk.
Greptile OverviewGreptile SummaryMoves bytecode download from sequential execution after healing to concurrent execution during account insertion, reducing snap sync time by ~3 minutes (-3.3%). The implementation spawns a background task that receives code hashes via oneshot channel after the account collection pass, then downloads bytecodes while trie building proceeds. This is safe because bytecodes are content-addressed. Key changes:
Architecture: Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/networking/p2p/snap/client.rs | Added update_metrics parameter to prevent metric conflicts during concurrent bytecode downloads |
| crates/networking/p2p/sync/code_collector.rs | Added flush_all() method and dir() accessor to support mid-insertion flushing for concurrent bytecode download |
| crates/networking/p2p/sync/snap_sync.rs | Refactored to spawn background bytecode download task during account insertion, with new helper functions for reading and downloading bytecodes |
Sequence Diagram
sequenceDiagram
participant Main as Main Thread
participant BG as Background Task
participant Peers as Peer Network
participant Store as Storage
Main->>Main: Download account ranges
Main->>Main: Start account insertion
Main->>Main: Collect code hashes (pass 1)
Main->>Main: flush_all() code hashes
Main->>Main: read_code_hashes_from_dir()
Main->>BG: Send code hashes via channel
Note over BG: Background task starts
par Concurrent Operations
BG->>Peers: Request bytecodes (chunks)
Peers-->>BG: Return bytecodes
BG->>Store: Write bytecodes
Note over BG: Continues downloading
and
Main->>Main: Build account trie (pass 2)
Main->>Main: Download storage ranges
Main->>Main: Insert storage tries
Main->>Main: Heal state & storage tries
end
Main->>Main: Finish healing, new collector
Main->>Main: Download healing bytecodes
Main->>BG: Await background task
BG-->>Main: Complete
Main->>Main: Snap sync complete
Last reviewed commit: 6582a57
There was a problem hiding this comment.
Pull request overview
This PR optimizes snap sync by downloading bytecodes concurrently with storage ranges instead of sequentially after healing, reducing total sync time by approximately 3 minutes (~3.3%) on Ethereum mainnet. Since bytecodes are content-addressed (hash=key), they can be safely downloaded regardless of pivot staleness, making concurrent download safe.
Changes:
- Spawns a background task to download bytecodes during account insertion that runs concurrently with storage download/insertion
- Adds ability to flush code hash collector without consuming it, enabling code hash transmission to background task mid-insertion
- Adds
update_metricsparameter torequest_bytecodes()to prevent metric conflicts during concurrent execution
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| crates/networking/p2p/sync/snap_sync.rs | Spawns background bytecode download task, adds helper functions for reading/downloading bytecodes, modifies insert_accounts to send code hashes via oneshot channel, handles healing-discovered bytecodes separately |
| crates/networking/p2p/sync/code_collector.rs | Adds flush_all() for non-consuming flush and dir() accessor to support code hash transmission |
| crates/networking/p2p/snap/client.rs | Adds update_metrics parameter to request_bytecodes() to conditionally update metrics, preventing concurrent overwrites of current_step |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Wait for code hashes to be sent from insert_accounts (between passes) | ||
| let code_hashes = code_hashes_rx.await.map_err(|_| SyncError::BytecodesNotFound)?; |
There was a problem hiding this comment.
The error mapping here is misleading. A oneshot receiver error (RecvError) means the sender was dropped without sending a value, which indicates a programming error or unexpected panic in insert_accounts, not that bytecodes weren't found from peers. Consider creating a more specific error variant like SyncError::CodeHashChannelClosed or SyncError::InternalChannelError to better represent this failure mode.
| // Wait for code hashes to be sent from insert_accounts (between passes) | |
| let code_hashes = code_hashes_rx.await.map_err(|_| SyncError::BytecodesNotFound)?; | |
| // Wait for code hashes to be sent from insert_accounts (between passes). | |
| // If the sender is dropped without sending, this indicates a logic error or panic | |
| // in the producer, so we treat it as a hard failure rather than a "bytecodes | |
| // not found from peers" error. | |
| let code_hashes = code_hashes_rx | |
| .await | |
| .expect("code_hashes sender dropped before sending; this indicates a bug or panic in insert_accounts"); |
| if let Some(tx) = code_hashes_tx { | ||
| code_hash_collector.flush_all().await?; | ||
| let code_hashes = read_code_hashes_from_dir(code_hash_collector.dir())?; | ||
| let _ = tx.send(code_hashes); |
There was a problem hiding this comment.
let _ = tx.send(code_hashes) silently drops a send failure. If the background task panicked or its receiver was dropped, the code hashes are lost — and since the snapshot dir is cleaned up and recreated immediately after insert_accounts returns (line 368), they can't be re-read. Bytecodes would silently never be downloaded for this sync attempt.
Suggestion: at minimum, log a warning if the send fails. Better yet, propagate the error:
tx.send(code_hashes).map_err(|_| SyncError::BytecodesNotFound)?;(Same issue in the non-rocksdb path at line 860.)
| // Filter out already-stored bytecodes to avoid re-downloading on retry | ||
| let code_hashes: Vec<H256> = code_hashes | ||
| .into_iter() | ||
| .filter(|hash| !store.get_account_code(*hash).is_ok_and(|code| code.is_some())) |
There was a problem hiding this comment.
This filter calls store.get_account_code(*hash) individually for each of the ~2.1M code hashes. Each call acquires a Mutex lock on account_code_cache and potentially does a RocksDB read — that's a lot of synchronous blocking I/O in an async context.
For the common case (fresh sync, no codes stored yet), every single check will miss the cache and hit the DB, only to return None. Consider short-circuiting: skip the filter entirely when is_fresh_sync or similar flag is set, and only enable it for retry scenarios where some codes may already exist.
| let bytecode_store = store.clone(); | ||
| tokio::spawn(async move { | ||
| // Wait for code hashes to be sent from insert_accounts (between passes) | ||
| let code_hashes = code_hashes_rx.await.map_err(|_| SyncError::BytecodesNotFound)?; |
There was a problem hiding this comment.
If the oneshot sender is dropped (e.g., insert_accounts returns early with an error before reaching the tx.send), this converts the RecvError to SyncError::BytecodesNotFound, which is misleading — the bytecodes weren't "not found", the code hash channel was closed. This could confuse debugging since the real error would be the one from insert_accounts, but the background task's error might surface first (or be logged) as a confusing "BytecodesNotFound".
nit: Consider a more descriptive error variant, e.g. SyncError::BytecodeChannelClosed.
- Add §1.18 observability tooling (PR #6470) - Add §1.19 pivot update reliability (PR #6475, issue #6474) - Add §1.20 big-account within-trie parallelization (issue #6477) - Add §1.21 small-account batching (issue #6476) - Add §1.22 decoded TrieLayerCache (PR #6348) - Add §1.23 bloom filter for non-existent storage (PR #6288) - Add §1.24 adaptive request sizing + bisection (PR #6181) - Add §1.25 concurrent bytecode + storage (PR #6205) - Add §1.26 phase completion markers (PR #6189) - Add §2.18 StorageTrieTracker refactor (PR #6171) - Update current-state bottleneck table with small-account and pivot-update findings - Reprioritize timeline: pivot-update crash fix is now priority 0 - Add two risks (pivot crash masks perf work, DB corruption on every crash) - Bump doc version to 1.3
Summary
Description
During snap sync, bytecodes are content-addressed (hash = key), so they're safe to download regardless of pivot staleness. Previously, bytecodes were downloaded sequentially after all healing completed (~6 min on mainnet). This PR spawns a background task that starts downloading bytecodes during account insertion and runs concurrently with storage download/insertion.
Changes:
snap_sync.rs: Spawn background bytecode download task usingoneshotchannel to receive code hashes frominsert_accounts. Extractdownload_bytecodes()andread_code_hashes_from_dir()helpers. Handle healing-discovered bytecodes separately.code_collector.rs: Addflush_all()(non-consuming flush for mid-insertion use) anddir()accessor.client.rs: Addupdate_metricsparameter torequest_bytecodes()to prevent overwritingcurrent_stepduring concurrent download.Benchmark (Ethereum mainnet, ethrex-mainnet-4):
Bytecodes (2.1M hashes, 8 min download) finished 46 minutes before storage insertion ended.
How to Test