Sync peer attribution #7733

pawanjay176 · 2025-07-10T22:41:13Z

Issue Addressed

Which issue # does this PR address?

Proposed Changes

Improvements to range sync including:

Contain column requests only to peers that are part of the SyncingChain
Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary

Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.

Need to do more testing and possibly add some unit tests.

beacon_node/lighthouse_network/src/peer_manager/peerdb/sync_status.rs

beacon_node/lighthouse_network/src/peer_manager/peerdb.rs

beacon_node/network/src/sync/range_sync/chain.rs

beacon_node/network/src/sync/block_sidecar_coupling.rs

jimmygchen · 2025-07-11T04:56:12Z

I wonder if we can now remove this check given we now have column retries?

lighthouse/beacon_node/network/src/sync/range_sync/chain.rs

Lines 1149 to 1156 in 3182646

    
           // don't send batch requests until we have peers on sampling subnets 
        
           // TODO(das): this is a workaround to avoid sending out excessive block requests because 
        
           // block and data column requests are currently coupled. This can be removed once we find a 
        
           // way to decouple the requests and do retries individually, see issue #6258. 
        
           if !self.good_peers_on_sampling_subnets(self.to_be_downloaded, network) { 
        
               debug!("Waiting for peers to be available on custody column subnets"); 
        
               return None; 
        
           }

beacon_node/network/src/sync/range_sync/chain.rs

jimmygchen · 2025-07-11T05:36:33Z

beacon_node/lighthouse_network/src/peer_manager/peerdb.rs

+    ) -> impl Iterator<Item = &'a PeerId> {
+        self.peers
+            .iter()
+            .filter(move |(peer_id, info)| {


we no longer use the return value here, maybe we could optimise this function with any instead of filter, and return bool instead?

I didn't get this. You mean something like

fn is_good_range_sync_custody_subnet_peer(&self, peer_id: PeerId, subnet: DataColumnSubnetId) -> bool

and pass all peers from the peer list and just filter them at the caller?

This function is only used here to determine if there's any peers on a particular custody subnet, we don't really use the return value other than checking the count > 0 - so perhaps we could change this function to has_good_range_sync_peer_on_custody_subnet and short-circuit it (using any) without iterating through all peers (128 times for superdnoes)

lighthouse/beacon_node/network/src/sync/range_sync/chain.rs

Lines 1098 to 1110 in 3182646

let peers_on_all_custody_subnets = network

.network_globals()

.sampling_subnets()

.iter()

.all(|subnet_id| {

let peer_count = network

.network_globals()

.peers

.read()

.good_range_sync_custody_subnet_peer(*subnet_id, &self.peers)

.count();

peer_count > 0

});

(Also this is likely a micro optimisation I wouldn’t be too worried about for this PR)

jimmygchen

Nice, the fixees look simple enough and seems to work really well on devnet.
My comments are mostly minor - let me know what you think.

Squashed commit of the following: commit aa63589 Author: Jimmy Chen <jchen.tc@gmail.com> Date: Fri Jul 11 15:22:48 2025 +1000 Fix fmt and range sync tests. commit eb5fa36 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 10 15:31:15 2025 -0700 lint commit 3182646 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 19:57:12 2025 -0700 More cleanup commit c098796 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 19:40:50 2025 -0700 Downscore unique peers commit 32722eb Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 18:54:05 2025 -0700 Cleanup commit b257039 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 13:08:03 2025 -0700 Add a log commit 9c8d868 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 13:04:37 2025 -0700 Remove debugging logs commit 6dbead8 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 12:16:12 2025 -0700 Remove unsafe optimisations commit e08b260 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Tue Jul 8 18:59:12 2025 -0700 Fix issues commit b14bb9a Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Fri Jul 4 15:19:21 2025 -0700 Request batches from peers that have the earliest available slot commit a7e88fc Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Fri Jul 4 01:04:44 2025 -0700 Fix some issues commit 3fd5a44 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 3 19:25:53 2025 -0700 compiles commit 8f1cfb9 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 3 16:11:04 2025 -0700 Initial commit

jimmygchen

Looking really good and able to sync on devnet 👍
Happy to address backfill issues in a separate PR.

dapplion

Sorry I now realized I had a pending review in the other PR you closed. The main concern I have is the lack of a retry counter, where you can retry to infinity

dapplion · 2025-07-13T11:34:00Z

beacon_node/network/src/sync/range_sync/chain.rs

+                .cloned()
+                .collect::<HashSet<_>>();
+
+            match network.retry_columns_by_range(


This retry logic only applies to forward sync and not backfill sync?

Yeah i started working on it but it got merged before i could complete it lol. Will open another PR.

dapplion · 2025-07-13T11:34:10Z

beacon_node/lighthouse_network/src/service/api_types.rs

+    ///
+    /// This is useful to penalize the peer at a later point if it returned data columns that
+    /// did not match with the verified block.
+    pub peer: PeerId,


Adding the peer here is more for convenience of not having to pass another variable that maps DataColumnsByRangeRequestId -> PeerId?

Pretty much yeah.

dapplion · 2025-07-13T11:35:35Z

beacon_node/network/src/sync/block_sidecar_coupling.rs

+                    // For now, we always assume that the block peer is right.
+                    // This is potentially dangerous as we can get isolated on a chain with a
+                    // malicious block peer.
+                    // TODO: fix this by checking the proposer signature before downloading columns.


Even if we check the signature it doesn't defend against malicious proposers. The long term solution is tree-sync

True, but its still good to have protection against malicious peers at least.

Unless you want to check BLS signatures on the sync thread that would require you to send a beacon processor Work job. I suspect it will be a lot of code and complexity for little gain

dapplion · 2025-07-13T11:35:46Z

beacon_node/network/src/sync/block_sidecar_coupling.rs

+                    let responsible_peers = column_to_peer.iter().map(|c| (*c.0, *c.1)).collect();
+                    return Err(CouplingError {
+                        msg: format!("No columns for block {block_root:?} with data"),
+                        column_and_peer: Some((responsible_peers, PeerAction::LowToleranceError)),


LowToleranceError is very harsh we can disconnect everyone very quickly in case of bugs

Yeah i think this might be the case we're seeing on devnet-2.

Why though? Not sending an error if you don't have something is malicious behaviour imo. Right now, its a bug on the majority of the clients that need to be fixed.

dapplion · 2025-07-13T11:35:55Z

beacon_node/network/src/sync/block_sidecar_coupling.rs

+                if !naughty_peers.is_empty() {
+                    return Err(CouplingError {
+                        msg: format!("Peers did not return column for block_root {block_root:?} {naughty_peers:?}"),
+                        column_and_peer: Some((naughty_peers, PeerAction::LowToleranceError)),


Same as above

dapplion · 2025-07-13T11:37:04Z

beacon_node/network/src/sync/network_context.rs

+        debug!(
+            ?failed_columns,
+            "Retrying only failed column requests from other peers"
+        );


You should log the components_by_range ID here

dapplion · 2025-07-13T11:37:12Z

beacon_node/network/src/sync/network_context.rs

+    /// the batch.
+    pub fn retry_columns_by_range(
+        &mut self,
+        request_id: Id,


I tried never to pass just Id around to prevent mismatches and instead use the type *RequestId types. Since there are so many Id floating around a mismatch will cause this function to do nothing.

Okay I'll look into this. Tbh, I am pretty confused by all the different id types, but this is helpful feedback

dapplion · 2025-07-13T11:37:25Z

beacon_node/network/src/sync/network_context.rs

+            if blocks_result.is_ok() {
+                // remove the entry only if it coupled successfully with
+                // no errors
+                entry.remove();


Is there a retry counter somewhere? A peer should not be able cause a request to hang here forever in an infinite retry loop

In this version, I am assuming that there are some good peers that will eventually move the chain forward or all peers will get kicked and the chain will get removed.
Can add a retry counter in another PR.

Not all errors result in downscore, so it possible to have infinite loops for chains with a single peer

dapplion · 2025-07-13T11:37:48Z

beacon_node/network/src/sync/range_sync/chain.rs

+                    let mut failed_peers = HashSet::new();
+                    for (column, peer) in column_and_peer {
+                        failed_columns.insert(*column);
+                        failed_peers.insert(*peer);


Are these failed peers persisted between retries? Or they only the failed peers of retry N apply for the retry N+1?

No they aren't. I tried to keep it simple here, but will try to add more retrying logic that holds some state in a subsequent PR.

dapplion · 2025-07-13T11:38:22Z

beacon_node/network/src/sync/range_sync/chain.rs

                .peers
                .read()
-                .synced_peers()
+                .synced_peers_for_epoch(batch_id, &self.peers)


Instead, you can iterate self.peers here and filter by peers that have epoch batch_id. All peers in the batch are synced connected peers

Ohh yeah good point, i'll make the changes.

jimmygchen · 2025-07-15T03:34:03Z

Sorry I now realized I had a pending review in the other PR you closed. The main concern I have is the lack of a retry counter, where you can retry to infinity

Thanks for the review @dapplion I've created an issue to keep track of this so we don't lose them #7744.

Which issue # does this PR address? Closes sigp#7604 Improvements to range sync including: 1. Contain column requests only to peers that are part of the SyncingChain 2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request 3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch 4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well. Also tested this a little on mainnet. Need to do more testing and possibly add some unit tests.

Partly addresses #7744 Implement similar peer sync attribution like in #7733 for backfill sync.

pawanjay176 added 12 commits July 10, 2025 15:09

Initial commit

8f1cfb9

compiles

3fd5a44

Fix some issues

a7e88fc

Request batches from peers that have the earliest available slot

b14bb9a

Fix issues

e08b260

Remove unsafe optimisations

6dbead8

Remove debugging logs

9c8d868

Add a log

b257039

Cleanup

32722eb

Downscore unique peers

c098796

More cleanup

3182646

lint

eb5fa36

pawanjay176 requested a review from jxs as a code owner July 10, 2025 22:41

jimmygchen added syncing das Data Availability Sampling fulu Required for the upcoming Fulu hard fork labels Jul 11, 2025