[Merged by Bors] - Handle processing results of non faulty batches by divagant-martian · Pull Request #3439 · sigp/lighthouse

divagant-martian · 2022-08-08T20:00:32Z

Issue Addressed

So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that.

Proposed Changes

Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized.

Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches.

Improves some logging as well.

Additional Info

We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync

This also fixes a return that reports which batch failed and caused us some confusion checking the logs

…batch when an invalid batch is found

pawanjay176

LGTM. Just minor nits

beacon_node/network/src/sync/range_sync/batch.rs

pawanjay176 · 2022-08-08T23:59:42Z

Doing some testing with this PR.

pawanjay176

LGTM! I have tested this with pre-merge and post-merge scenarios and it is working as expected 🎉

pawanjay176 · 2022-08-09T22:11:00Z

beacon_node/network/src/sync/manager.rs

+    },
    /// The batch processing failed. It carries whether the processing imported any block.
-    Failed {
+    FaultyFailure {


Very nice! 👌

paulhauner · 2022-08-12T00:56:23Z

bors r+

@pawanjay176

## Issue Addressed Solves #3390 So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that. ## Proposed Changes Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized. Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches. Improves some logging as well. ## Additional Info We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync This also fixes a return that reports which batch failed and caused us some confusion checking the logs

bors · 2022-08-12T03:49:32Z

Pull request successfully merged into unstable.

Build succeeded:

## Issue Addressed NA ## Proposed Changes Bump versions to v3.0.0 ## Additional Info - ~~Blocked on #3439~~ - ~~Blocked on #3459~~ - ~~Blocked on #3463~~ - ~~Blocked on #3462~~ - ~~Requires further testing~~

## Issue Addressed NA ## Proposed Changes Bump versions to v3.0.0 ## Additional Info - ~~Blocked on #3439~~ - ~~Blocked on #3459~~ - ~~Blocked on #3463~~ - ~~Blocked on #3462~~ - ~~Requires further testing~~ Co-authored-by: Michael Sproul <michael@sigmaprime.io>

@pawanjay176

## Issue Addressed #3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of #3094 Will work best with #3439 Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>

@pawanjay176

## Issue Addressed Solves sigp#3390 So after checking some logs @pawanjay176 got, we conclude that this happened because we blacklisted a chain after trying it "too much". Now here, in all occurrences it seems that "too much" means we got too many download failures. This happened very slowly, exactly because the batch is allowed to stay alive for very long times after not counting penalties when the ee is offline. The error here then was not that the batch failed because of offline ee errors, but that we blacklisted a chain because of download errors, which we can't pin on the chain but on the peer. This PR fixes that. ## Proposed Changes Adds a missing piece of logic so that if a chain fails for errors that can't be attributed to an objectively bad behavior from the peer, it is not blacklisted. The issue at hand occurred when new peers arrived claiming a head that had wrongfully blacklisted, even if the original peers participating in the chain were not penalized. Another notable change is that we need to consider a batch invalid if it processed correctly but its next non empty batch fails processing. Now since a batch can fail processing in non empty ways, there is no need to mark as invalid previous batches. Improves some logging as well. ## Additional Info We should do this regardless of pausing sync on ee offline/unsynced state. This is because I think it's almost impossible to ensure a processing result will reach in a predictable order with a synced notification from the ee. Doing this handles what I think are inevitable data races when we actually pause sync This also fixes a return that reports which batch failed and caused us some confusion checking the logs

## Issue Addressed NA ## Proposed Changes Bump versions to v3.0.0 ## Additional Info - ~~Blocked on sigp#3439~~ - ~~Blocked on sigp#3459~~ - ~~Blocked on sigp#3463~~ - ~~Blocked on sigp#3462~~ - ~~Requires further testing~~ Co-authored-by: Michael Sproul <michael@sigmaprime.io>

@pawanjay176

## Issue Addressed sigp#3032 ## Proposed Changes Pause sync when ee is offline. Changes include three main parts: - Online/offline notification system - Pause sync - Resume sync #### Online/offline notification system - The engine state is now guarded behind a new struct `State` that ensures every change is correctly notified. Notifications are only sent if the state changes. The new `State` is behind a `RwLock` (as before) as the synchronization mechanism. - The actual notification channel is a [tokio::sync::watch](https://docs.rs/tokio/latest/tokio/sync/watch/index.html) which ensures only the last value is in the receiver channel. This way we don't need to worry about message order etc. - Sync waits for state changes concurrently with normal messages. #### Pause Sync Sync has four components, pausing is done differently in each: - **Block lookups**: Disabled while in this state. We drop current requests and don't search for new blocks. Block lookups are infrequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Parent lookups**: Disabled while in this state. We drop current requests and don't search for new parents. Parent lookups are even less frequent and I don't think it's worth the extra logic of keeping these and delaying processing. If we later see that this is required, we can add it. - **Range**: Chains don't send batches for processing to the beacon processor. This is easily done by guarding the channel to the beacon processor and giving it access only if the ee is responsive. I find this the simplest and most powerful approach since we don't need to deal with new sync states and chain segments that are added while the ee is offline will follow the same logic without needing to synchronize a shared state among those. Another advantage of passive pause vs active pause is that we can still keep track of active advertised chain segments so that on resume we don't need to re-evaluate all our peers. - **Backfill**: Not affected by ee states, we don't pause. #### Resume Sync - **Block lookups**: Enabled again. - **Parent lookups**: Enabled again. - **Range**: Active resume. Since the only real pause range does is not sending batches for processing, resume makes all chains that are holding read-for-processing batches send them. - **Backfill**: Not affected by ee states, no need to resume. ## Additional Info **QUESTION**: Originally I made this to notify and change on synced state, but @pawanjay176 on talks with @paulhauner concluded we only need to check online/offline states. The upcheck function mentions extra checks to have a very up to date sync status to aid the networking stack. However, the only need the networking stack would have is this one. I added a TODO to review if the extra check can be removed Next gen of sigp#3094 Will work best with sigp#3439 Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com>

divagant-martian added 2 commits August 8, 2022 14:59

prevent blacklisting chains when it's not needed

5a72956

lil self review

9c58cc7

divagant-martian marked this pull request as ready for review August 8, 2022 20:15

divagant-martian requested a review from pawanjay176 August 8, 2022 20:19

divagant-martian changed the title ~~prevent blacklisting chains when it's not needed~~ Prevent blacklisting range chains when it's not needed Aug 8, 2022

keep the failing batch for log purposes and return the right failing …

b2e0ba4

…batch when an invalid batch is found

divagant-martian requested review from AgeManning and paulhauner August 8, 2022 23:11

pawanjay176 approved these changes Aug 8, 2022

View reviewed changes

beacon_node/network/src/sync/range_sync/batch.rs Outdated Show resolved Hide resolved

beacon_node/network/src/sync/range_sync/batch.rs Outdated Show resolved Hide resolved

divagant-martian removed the request for review from AgeManning August 9, 2022 01:44

divagant-martian added 2 commits August 9, 2022 12:10

handle a non faulty processing failure

4c4f87e

review comments

6869ba4

divagant-martian requested a review from pawanjay176 August 9, 2022 17:17

fmt

19e2205

divagant-martian changed the title ~~Prevent blacklisting range chains when it's not needed~~ Handle processing results of non faulty batches Aug 9, 2022

remove unnecesary log

1429b13

pawanjay176 approved these changes Aug 9, 2022

View reviewed changes

divagant-martian added the ready-for-merge This PR is ready to merge. label Aug 10, 2022

divagant-martian mentioned this pull request Aug 10, 2022

[Merged by Bors] - Pause sync when EE is offline #3428

Closed

paulhauner added the v3.0.0 🐼 Required for the v3.0.0 release label Aug 12, 2022

paulhauner mentioned this pull request Aug 12, 2022

[Merged by Bors] - v3.0.0 #3464

Closed

bors bot changed the title ~~Handle processing results of non faulty batches~~ [Merged by Bors] - Handle processing results of non faulty batches Aug 12, 2022

bors bot closed this Aug 12, 2022

divagant-martian mentioned this pull request Sep 7, 2022

Sync stuck on execution engine delayed restart #3390

Closed

divagant-martian added the networking label Sep 9, 2022

divagant-martian added Networking and removed networking labels Oct 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Handle processing results of non faulty batches#3439

[Merged by Bors] - Handle processing results of non faulty batches#3439
divagant-martian wants to merge 7 commits intosigp:unstablefrom
divagant-martian:improve_failed_chain_blacklist

divagant-martian commented Aug 8, 2022 •

edited

Loading

Uh oh!

pawanjay176 left a comment

Uh oh!

Uh oh!

Uh oh!

pawanjay176 commented Aug 8, 2022

Uh oh!

pawanjay176 left a comment

Uh oh!

pawanjay176 Aug 9, 2022

Uh oh!

paulhauner commented Aug 12, 2022

Uh oh!

bors bot commented Aug 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

divagant-martian commented Aug 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pawanjay176 commented Aug 8, 2022

Uh oh!

pawanjay176 left a comment

Choose a reason for hiding this comment

Uh oh!

pawanjay176 Aug 9, 2022

Choose a reason for hiding this comment

Uh oh!

paulhauner commented Aug 12, 2022

Uh oh!

bors bot commented Aug 12, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

divagant-martian commented Aug 8, 2022 •

edited

Loading