Skip to content

Conversation

@pawanjay176
Copy link
Member

@pawanjay176 pawanjay176 commented Jul 10, 2025

Issue Addressed

Which issue # does this PR address?

Closes #7604

Proposed Changes

Improvements to range sync including:

  1. Contain column requests only to peers that are part of the SyncingChain
  2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
  3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
  4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary

Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.

Need to do more testing and possibly add some unit tests.

@pawanjay176 pawanjay176 requested a review from jxs as a code owner July 10, 2025 22:41
@jimmygchen jimmygchen added syncing das Data Availability Sampling fulu Required for the upcoming Fulu hard fork labels Jul 11, 2025
@jimmygchen
Copy link
Member

I wonder if we can now remove this check given we now have column retries?

// don't send batch requests until we have peers on sampling subnets
// TODO(das): this is a workaround to avoid sending out excessive block requests because
// block and data column requests are currently coupled. This can be removed once we find a
// way to decouple the requests and do retries individually, see issue #6258.
if !self.good_peers_on_sampling_subnets(self.to_be_downloaded, network) {
debug!("Waiting for peers to be available on custody column subnets");
return None;
}

) -> impl Iterator<Item = &'a PeerId> {
self.peers
.iter()
.filter(move |(peer_id, info)| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we no longer use the return value here, maybe we could optimise this function with any instead of filter, and return bool instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get this. You mean something like

fn is_good_range_sync_custody_subnet_peer(&self, peer_id: PeerId, subnet: DataColumnSubnetId) -> bool

and pass all peers from the peer list and just filter them at the caller?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used here to determine if there's any peers on a particular custody subnet, we don't really use the return value other than checking the count > 0 - so perhaps we could change this function to has_good_range_sync_peer_on_custody_subnet and short-circuit it (using any) without iterating through all peers (128 times for superdnoes)

let peers_on_all_custody_subnets = network
.network_globals()
.sampling_subnets()
.iter()
.all(|subnet_id| {
let peer_count = network
.network_globals()
.peers
.read()
.good_range_sync_custody_subnet_peer(*subnet_id, &self.peers)
.count();
peer_count > 0
});

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also this is likely a micro optimisation I wouldn’t be too worried about for this PR)

Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the fixees look simple enough and seems to work really well on devnet.
My comments are mostly minor - let me know what you think.

@jimmygchen jimmygchen added the waiting-on-author The reviewer has suggested changes and awaits thier implementation. label Jul 11, 2025
jimmygchen added a commit that referenced this pull request Jul 11, 2025
Squashed commit of the following:

commit aa63589
Author: Jimmy Chen <jchen.tc@gmail.com>
Date:   Fri Jul 11 15:22:48 2025 +1000

    Fix fmt and range sync tests.

commit eb5fa36
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Thu Jul 10 15:31:15 2025 -0700

    lint

commit 3182646
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 19:57:12 2025 -0700

    More cleanup

commit c098796
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 19:40:50 2025 -0700

    Downscore unique peers

commit 32722eb
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 18:54:05 2025 -0700

    Cleanup

commit b257039
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 13:08:03 2025 -0700

    Add a log

commit 9c8d868
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 13:04:37 2025 -0700

    Remove debugging logs

commit 6dbead8
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Wed Jul 9 12:16:12 2025 -0700

    Remove unsafe optimisations

commit e08b260
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Tue Jul 8 18:59:12 2025 -0700

    Fix issues

commit b14bb9a
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jul 4 15:19:21 2025 -0700

    Request batches from peers that have the earliest available slot

commit a7e88fc
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Fri Jul 4 01:04:44 2025 -0700

    Fix some issues

commit 3fd5a44
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Thu Jul 3 19:25:53 2025 -0700

    compiles

commit 8f1cfb9
Author: Pawan Dhananjay <pawandhananjay@gmail.com>
Date:   Thu Jul 3 16:11:04 2025 -0700

    Initial commit
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking really good and able to sync on devnet 👍
Happy to address backfill issues in a separate PR.

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jul 12, 2025
@mergify mergify bot merged commit 90ff643 into sigp:unstable Jul 12, 2025
36 of 37 checks passed
Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I now realized I had a pending review in the other PR you closed. The main concern I have is the lack of a retry counter, where you can retry to infinity

.cloned()
.collect::<HashSet<_>>();

match network.retry_columns_by_range(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This retry logic only applies to forward sync and not backfill sync?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i started working on it but it got merged before i could complete it lol. Will open another PR.

///
/// This is useful to penalize the peer at a later point if it returned data columns that
/// did not match with the verified block.
pub peer: PeerId,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the peer here is more for convenience of not having to pass another variable that maps DataColumnsByRangeRequestId -> PeerId?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much yeah.

// For now, we always assume that the block peer is right.
// This is potentially dangerous as we can get isolated on a chain with a
// malicious block peer.
// TODO: fix this by checking the proposer signature before downloading columns.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we check the signature it doesn't defend against malicious proposers. The long term solution is tree-sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but its still good to have protection against malicious peers at least.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you want to check BLS signatures on the sync thread that would require you to send a beacon processor Work job. I suspect it will be a lot of code and complexity for little gain

let responsible_peers = column_to_peer.iter().map(|c| (*c.0, *c.1)).collect();
return Err(CouplingError {
msg: format!("No columns for block {block_root:?} with data"),
column_and_peer: Some((responsible_peers, PeerAction::LowToleranceError)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LowToleranceError is very harsh we can disconnect everyone very quickly in case of bugs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i think this might be the case we're seeing on devnet-2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why though? Not sending an error if you don't have something is malicious behaviour imo. Right now, its a bug on the majority of the clients that need to be fixed.

if !naughty_peers.is_empty() {
return Err(CouplingError {
msg: format!("Peers did not return column for block_root {block_root:?} {naughty_peers:?}"),
column_and_peer: Some((naughty_peers, PeerAction::LowToleranceError)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

debug!(
?failed_columns,
"Retrying only failed column requests from other peers"
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should log the components_by_range ID here

/// the batch.
pub fn retry_columns_by_range(
&mut self,
request_id: Id,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried never to pass just Id around to prevent mismatches and instead use the type *RequestId types. Since there are so many Id floating around a mismatch will cause this function to do nothing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I'll look into this. Tbh, I am pretty confused by all the different id types, but this is helpful feedback

if blocks_result.is_ok() {
// remove the entry only if it coupled successfully with
// no errors
entry.remove();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a retry counter somewhere? A peer should not be able cause a request to hang here forever in an infinite retry loop

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this version, I am assuming that there are some good peers that will eventually move the chain forward or all peers will get kicked and the chain will get removed.
Can add a retry counter in another PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all errors result in downscore, so it possible to have infinite loops for chains with a single peer

let mut failed_peers = HashSet::new();
for (column, peer) in column_and_peer {
failed_columns.insert(*column);
failed_peers.insert(*peer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these failed peers persisted between retries? Or they only the failed peers of retry N apply for the retry N+1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they aren't. I tried to keep it simple here, but will try to add more retrying logic that holds some state in a subsequent PR.

.peers
.read()
.synced_peers()
.synced_peers_for_epoch(batch_id, &self.peers)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, you can iterate self.peers here and filter by peers that have epoch batch_id. All peers in the batch are synced connected peers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh yeah good point, i'll make the changes.

@jimmygchen
Copy link
Member

Sorry I now realized I had a pending review in the other PR you closed. The main concern I have is the lack of a retry counter, where you can retry to infinity

Thanks for the review @dapplion I've created an issue to keep track of this so we don't lose them #7744.

danielrachi1 pushed a commit to danielrachi1/lighthouse that referenced this pull request Jul 18, 2025
Which issue # does this PR address?

Closes sigp#7604


  Improvements to range sync including:

1. Contain column requests only to peers that are part of the SyncingChain
2. Attribute the fault to the correct peer and downscore them if they don't return the data columns for the request
3. Improve sync performance by retrying only the failed columns from other peers instead of failing the entire batch
4. Uses the earliest_available_slot to make requests to peers that claim to have the epoch. Note: if no earliest_available_slot info is available, fallback to using previous logic i.e. assume peer has everything backfilled upto WS checkpoint/da boundary

Tested this on fusaka-devnet-2 with a full node and supernode and the recovering logic seems to works well.
Also tested this a little on mainnet.

Need to do more testing and possibly add some unit tests.
mergify bot pushed a commit that referenced this pull request Aug 12, 2025
Partly addresses #7744


  Implement similar peer sync attribution like in #7733 for backfill sync.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

das Data Availability Sampling fulu Required for the upcoming Fulu hard fork ready-for-merge This PR is ready to merge. syncing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants