[Merged by Bors] - fix deneb sync bug by realbigsean · Pull Request #4869 · sigp/lighthouse

realbigsean · 2023-10-20T14:47:31Z

Issue Addressed

I observed our forward sync on devnet 9 would stall when we would hit this log:

250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036

Proposed Changes

range_sync_block_and_blob_response here removes the request from the sync manager. later, however if there's an error, inject_error here expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.

The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync

1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root. I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block. Example output: ``` Oct 20 21:13:36.700 DEBG Successfully verified gossip blob commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568 Oct 20 21:13:36.785 DEBG Successfully verified gossip block commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon ``` Example output in a block with no blobs (this will show up pre-deneb): ``` 426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765 ``` 2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug ``` AvailabilityCheckError::KzgCommitmentMismatch { blob_commitment: KzgCommitment, block_commitment: KzgCommitment, }, ``` which is output as just `AvailabilityCheckError` 3. Some additional misc sync logs I found useful in debugging #4869 4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests Co-authored-by: realbigsean <seananderson33@gmail.com>

ethDreamer

LGTM! Not that I understand sync all that well but this seems to make sense. Lighthouse was continuously stalling yesterday when syncing devnet-10 and then pawan pointed me to this PR. Unfortunately devnet-10 was shut down by the time I tried this PR this morning but lighthouse does seem to sync fine on devnet-11. This devnet is much newer though so we might have to wait a bit before we can hit this bug again on devnet-11 and retry..

realbigsean · 2023-10-31T19:27:32Z

bors r+

## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>

bors · 2023-10-31T19:45:15Z

Build failed:

release-tests-windows

realbigsean · 2023-10-31T21:04:00Z

bors retry

## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>

bors · 2023-10-31T21:35:51Z

Pull request successfully merged into unstable.

Build succeeded!

The publicly hosted instance of bors-ng is deprecated and will go away soon.

If you want to self-host your own instance, instructions are here.
For more help, visit the forum.

If you want to switch to GitHub's built-in merge queue, visit their help page.

realbigsean added 2 commits October 20, 2023 10:35

fix sync bug

1cb02a1

add intos on error type conversion

9770a1e

realbigsean added ready-for-review The code is ready for review deneb labels Oct 20, 2023

realbigsean mentioned this pull request Oct 20, 2023

[Merged by Bors] - deneb related logging improvements #4859

Closed

realbigsean added the bug Something isn't working label Oct 20, 2023

ethDreamer approved these changes Oct 31, 2023

View reviewed changes

bors bot changed the title ~~fix deneb sync bug~~ [Merged by Bors] - fix deneb sync bug Oct 31, 2023

bors bot closed this Oct 31, 2023

realbigsean deleted the fix-deneb-sync-bug branch November 21, 2023 16:14

dapplion mentioned this pull request Apr 11, 2024

Sync lookup dedup range and blobs #5561

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - fix deneb sync bug#4869

[Merged by Bors] - fix deneb sync bug#4869
realbigsean wants to merge 2 commits intosigp:unstablefrom
realbigsean:fix-deneb-sync-bug

realbigsean commented Oct 20, 2023

Uh oh!

ethDreamer left a comment

Uh oh!

realbigsean commented Oct 31, 2023

Uh oh!

bors bot commented Oct 31, 2023

Uh oh!

realbigsean commented Oct 31, 2023

Uh oh!

bors bot commented Oct 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realbigsean commented Oct 20, 2023

Issue Addressed

Proposed Changes

Uh oh!

ethDreamer left a comment

Choose a reason for hiding this comment

Uh oh!

realbigsean commented Oct 31, 2023

Uh oh!

bors bot commented Oct 31, 2023

Uh oh!

realbigsean commented Oct 31, 2023

Uh oh!

bors bot commented Oct 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants