-
Notifications
You must be signed in to change notification settings - Fork 959
Sync peer attribution #7733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sync peer attribution #7733
Conversation
beacon_node/lighthouse_network/src/peer_manager/peerdb/sync_status.rs
Outdated
Show resolved
Hide resolved
beacon_node/lighthouse_network/src/peer_manager/peerdb/sync_status.rs
Outdated
Show resolved
Hide resolved
|
I wonder if we can now remove this check given we now have column retries? lighthouse/beacon_node/network/src/sync/range_sync/chain.rs Lines 1149 to 1156 in 3182646
|
| ) -> impl Iterator<Item = &'a PeerId> { | ||
| self.peers | ||
| .iter() | ||
| .filter(move |(peer_id, info)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we no longer use the return value here, maybe we could optimise this function with any instead of filter, and return bool instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't get this. You mean something like
fn is_good_range_sync_custody_subnet_peer(&self, peer_id: PeerId, subnet: DataColumnSubnetId) -> booland pass all peers from the peer list and just filter them at the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only used here to determine if there's any peers on a particular custody subnet, we don't really use the return value other than checking the count > 0 - so perhaps we could change this function to has_good_range_sync_peer_on_custody_subnet and short-circuit it (using any) without iterating through all peers (128 times for superdnoes)
lighthouse/beacon_node/network/src/sync/range_sync/chain.rs
Lines 1098 to 1110 in 3182646
| let peers_on_all_custody_subnets = network | |
| .network_globals() | |
| .sampling_subnets() | |
| .iter() | |
| .all(|subnet_id| { | |
| let peer_count = network | |
| .network_globals() | |
| .peers | |
| .read() | |
| .good_range_sync_custody_subnet_peer(*subnet_id, &self.peers) | |
| .count(); | |
| peer_count > 0 | |
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also this is likely a micro optimisation I wouldn’t be too worried about for this PR)
jimmygchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the fixees look simple enough and seems to work really well on devnet.
My comments are mostly minor - let me know what you think.
Squashed commit of the following: commit aa63589 Author: Jimmy Chen <jchen.tc@gmail.com> Date: Fri Jul 11 15:22:48 2025 +1000 Fix fmt and range sync tests. commit eb5fa36 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 10 15:31:15 2025 -0700 lint commit 3182646 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 19:57:12 2025 -0700 More cleanup commit c098796 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 19:40:50 2025 -0700 Downscore unique peers commit 32722eb Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 18:54:05 2025 -0700 Cleanup commit b257039 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 13:08:03 2025 -0700 Add a log commit 9c8d868 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 13:04:37 2025 -0700 Remove debugging logs commit 6dbead8 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Wed Jul 9 12:16:12 2025 -0700 Remove unsafe optimisations commit e08b260 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Tue Jul 8 18:59:12 2025 -0700 Fix issues commit b14bb9a Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Fri Jul 4 15:19:21 2025 -0700 Request batches from peers that have the earliest available slot commit a7e88fc Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Fri Jul 4 01:04:44 2025 -0700 Fix some issues commit 3fd5a44 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 3 19:25:53 2025 -0700 compiles commit 8f1cfb9 Author: Pawan Dhananjay <pawandhananjay@gmail.com> Date: Thu Jul 3 16:11:04 2025 -0700 Initial commit
jimmygchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking really good and able to sync on devnet 👍
Happy to address backfill issues in a separate PR.
dapplion
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I now realized I had a pending review in the other PR you closed. The main concern I have is the lack of a retry counter, where you can retry to infinity
| .cloned() | ||
| .collect::<HashSet<_>>(); | ||
|
|
||
| match network.retry_columns_by_range( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This retry logic only applies to forward sync and not backfill sync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i started working on it but it got merged before i could complete it lol. Will open another PR.
| /// | ||
| /// This is useful to penalize the peer at a later point if it returned data columns that | ||
| /// did not match with the verified block. | ||
| pub peer: PeerId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the peer here is more for convenience of not having to pass another variable that maps DataColumnsByRangeRequestId -> PeerId?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much yeah.
| // For now, we always assume that the block peer is right. | ||
| // This is potentially dangerous as we can get isolated on a chain with a | ||
| // malicious block peer. | ||
| // TODO: fix this by checking the proposer signature before downloading columns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we check the signature it doesn't defend against malicious proposers. The long term solution is tree-sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but its still good to have protection against malicious peers at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you want to check BLS signatures on the sync thread that would require you to send a beacon processor Work job. I suspect it will be a lot of code and complexity for little gain
| let responsible_peers = column_to_peer.iter().map(|c| (*c.0, *c.1)).collect(); | ||
| return Err(CouplingError { | ||
| msg: format!("No columns for block {block_root:?} with data"), | ||
| column_and_peer: Some((responsible_peers, PeerAction::LowToleranceError)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LowToleranceError is very harsh we can disconnect everyone very quickly in case of bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i think this might be the case we're seeing on devnet-2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why though? Not sending an error if you don't have something is malicious behaviour imo. Right now, its a bug on the majority of the clients that need to be fixed.
| if !naughty_peers.is_empty() { | ||
| return Err(CouplingError { | ||
| msg: format!("Peers did not return column for block_root {block_root:?} {naughty_peers:?}"), | ||
| column_and_peer: Some((naughty_peers, PeerAction::LowToleranceError)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
| debug!( | ||
| ?failed_columns, | ||
| "Retrying only failed column requests from other peers" | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should log the components_by_range ID here
| /// the batch. | ||
| pub fn retry_columns_by_range( | ||
| &mut self, | ||
| request_id: Id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried never to pass just Id around to prevent mismatches and instead use the type *RequestId types. Since there are so many Id floating around a mismatch will cause this function to do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'll look into this. Tbh, I am pretty confused by all the different id types, but this is helpful feedback
| if blocks_result.is_ok() { | ||
| // remove the entry only if it coupled successfully with | ||
| // no errors | ||
| entry.remove(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a retry counter somewhere? A peer should not be able cause a request to hang here forever in an infinite retry loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this version, I am assuming that there are some good peers that will eventually move the chain forward or all peers will get kicked and the chain will get removed.
Can add a retry counter in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all errors result in downscore, so it possible to have infinite loops for chains with a single peer
| let mut failed_peers = HashSet::new(); | ||
| for (column, peer) in column_and_peer { | ||
| failed_columns.insert(*column); | ||
| failed_peers.insert(*peer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these failed peers persisted between retries? Or they only the failed peers of retry N apply for the retry N+1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they aren't. I tried to keep it simple here, but will try to add more retrying logic that holds some state in a subsequent PR.
| .peers | ||
| .read() | ||
| .synced_peers() | ||
| .synced_peers_for_epoch(batch_id, &self.peers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead, you can iterate self.peers here and filter by peers that have epoch batch_id. All peers in the batch are synced connected peers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh yeah good point, i'll make the changes.
Which issue # does this PR address? Closes sigp#7604 Improvements to range sync including: 1. Contain column requests only to peers that are part of the SyncingChain 2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request 3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch 4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well. Also tested this a little on mainnet. Need to do more testing and possibly add some unit tests.
Issue Addressed
Which issue # does this PR address?
Closes #7604
Proposed Changes
Improvements to range sync including:
Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.
Need to do more testing and possibly add some unit tests.