[Merged by Bors] - fix deneb sync bug#4869
[Merged by Bors] - fix deneb sync bug#4869realbigsean wants to merge 2 commits intosigp:unstablefrom
Conversation
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root.
I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block.
Example output:
```
Oct 20 21:13:36.700 DEBG Successfully verified gossip blob commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568
Oct 20 21:13:36.785 DEBG Successfully verified gossip block commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon
```
Example output in a block with no blobs (this will show up pre-deneb):
```
426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765
```
2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug
```
AvailabilityCheckError::KzgCommitmentMismatch {
blob_commitment: KzgCommitment,
block_commitment: KzgCommitment,
},
```
which is output as just `AvailabilityCheckError`
3. Some additional misc sync logs I found useful in debugging #4869
4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests
Co-authored-by: realbigsean <seananderson33@gmail.com>
1. Add commitments to logs and update the `Display` implementation of `KzgCommitment` to become truncated similarly to block root.
I've been finding it difficult to debug scenarios involving multiple blobs for the same `(index, block_root)`. Logging the commitment will help with this, we can match it to what exists in the block.
Example output:
```
Oct 20 21:13:36.700 DEBG Successfully verified gossip blob commitment: 0xa3c1…1cd8, index: 0, root: 0xf31e…f9de, slot: 154568
Oct 20 21:13:36.785 DEBG Successfully verified gossip block commitments: [0xa3c1…1cd8, 0x8655…02ff, 0x8d6a…955a, 0x84ac…3a1b, 0x9752…629b, 0xb9fc…20fb], root: 0xf31eeb732702e429e89057b15e1c0c631e8452e09e03cb1924353f536ef4f9de, slot: 154568, graffiti: teku/besu, service: beacon
```
Example output in a block with no blobs (this will show up pre-deneb):
```
426734:Oct 20 21:15:24.113 DEBG Successfully verified gossip block, commitments: [], root: 0x619db1360ba0e8d44ae2a0f2450ebca47e167191feecffcfac0e8d7b6c39623c, slot: 154577, graffiti: teku/nethermind, service: beacon, module: beacon_chain::beacon_chain:2765
```
2. Remove `strum::IntoStaticStr` from `AvailabilityCheckError`. This is because `IntoStaticStr` end up dropping information inside the enum. So kzg commitments in this error are dropped, making it more difficult to debug
```
AvailabilityCheckError::KzgCommitmentMismatch {
blob_commitment: KzgCommitment,
block_commitment: KzgCommitment,
},
```
which is output as just `AvailabilityCheckError`
3. Some additional misc sync logs I found useful in debugging #4869
4. This downgrades ”Block returned for single block lookup not present” to debug because I don’t think we can fix the scenario that causes this unless we can cancel inflight rpc requests
Co-authored-by: realbigsean <seananderson33@gmail.com>
ethDreamer
left a comment
There was a problem hiding this comment.
LGTM! Not that I understand sync all that well but this seems to make sense. Lighthouse was continuously stalling yesterday when syncing devnet-10 and then pawan pointed me to this PR. Unfortunately devnet-10 was shut down by the time I tried this PR this morning but lighthouse does seem to sync fine on devnet-11. This devnet is much newer though so we might have to wait a bit before we can hit this bug again on devnet-11 and retry..
|
bors r+ |
## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>
|
Build failed: |
|
bors retry |
## Issue Addressed I observed our forward sync on devnet 9 would stall when we would hit this log: ``` 250425:Oct 19 00:54:17.133 WARN Blocks and blobs request for range received invalid data, error: KzgCommitmentMismatch, batch_id: 4338, peer_id: 16Uiu2HAmHbmkEQFDrJfNuy1aYyAfHkNUwSD9FN7EVAqGJ8YTF9Mh, service: sync, module: network::sync::manager:1036 ``` ## Proposed Changes `range_sync_block_and_blob_response` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1013) removes the request from the sync manager. later, however if there's an error, `inject_error` [here](https://github.com/sigp/lighthouse/blob/1cb02a13a53d0e603ad5920c03832e5779c3df61/beacon_node/network/src/sync/manager.rs#L1055) expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here. The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync Co-authored-by: realbigsean <seananderson33@gmail.com>
|
Pull request successfully merged into unstable. Build succeeded! The publicly hosted instance of bors-ng is deprecated and will go away soon. If you want to self-host your own instance, instructions are here. If you want to switch to GitHub's built-in merge queue, visit their help page.
|
Issue Addressed
I observed our forward sync on devnet 9 would stall when we would hit this log:
Proposed Changes
range_sync_block_and_blob_responsehere removes the request from the sync manager. later, however if there's an error,inject_errorhere expects the request to exist so we can handle retry logic. So this PR just re-inserts the request (withthout any accumulated blobs or blocks) when we hit an error here.The issue is unique to block+blob sync because the error here is only possible from mismatches between blocks + blobs after we've downloaded both, there's no equivalent error in block sync