Check da_checker before doing a block lookup request#5681
Check da_checker before doing a block lookup request#5681mergify[bot] merged 3 commits intosigp:unstablefrom
Conversation
Squashed commit of the following: commit 4a23356 Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 20:23:56 2024 +0900 Ensure consistent handling of lookup result commit 2a314ee Author: dapplion <35266934+dapplion@users.noreply.github.com> Date: Wed May 1 18:30:50 2024 +0900 Check da_checker before doing a block lookup request
| request: BlocksByRootSingleRequest, | ||
| block_root: Hash256, | ||
| ) -> Result<bool, &'static str> { | ||
| if self.chain.data_availability_checker.has_block(&block_root) { |
There was a problem hiding this comment.
don't we want to use the reqresp_pre_import_cache? Otherwise we'll miss blocks that we're in the process of verifying execution
There was a problem hiding this comment.
the check in stable is essentially this because data_availability_checker.has_block() used to check the processing cache
There was a problem hiding this comment.
lion agreed offline, so I've pushed the change here: 5db3e1b
|
@mergify queue |
✅ The pull request has been merged automaticallyDetailsThe pull request has been merged automatically at 01e4e35 |
* Fix execution integration tests (#5647) * update waiting status * revert to old nethermind version * Add electra presets to beacon API (#5630) * add presets to API * add extra fields to config spec in beacon API * remove unused * add mainnet presets for gnosis and fix minimal preset default values * Rename `Merge` to `Bellatrix` (#5601) * Rename Merge to Bellatrix * Remove tree-hash-cache which got readded from the rebase * Deterministic block generation for tests (#5654) * Deterministic block generation for tests * Electra other containers (#5652) * add new fields to execution payload and header * beacon state changes * partial beacon state * safe arith in upgrade to electra * initialize balances cache in interop genesis state * Revert "initialize balances cache in interop genesis state" This reverts commit c60b522. * always initialize balances cache if necessary in electra upgrade * build cache earlier * fix block test * per fork NUM_FIELDS_POW2 * Merge branch 'unstable' of https://github.com/sigp/lighthouse into electra-other-containers * fix lints * get fields based on post state, as is spec'd * fix type and move cache build * Add more electra helpers (#5653) * Add new helpers * Fix some stuff * Fix compilation errors * lint * Address review * Ignore gossip blob already imported (#5656) * Ignore gossip blob already imported * Beta compiler fix (#5659) * fix beta compiler compilation * remove unused import * Revert "remove unused import" This reverts commit 0bef36b. * Revert "fix beta compiler compilation" This reverts commit 23152cf. * rename ununsed fields * allow dead code on some error variants * remove unused blob download queue * add back debug to backfill error * more allow dead code on errors * fix(validator_client): raise soft fd limit (#4796) * fix(validator_client): raise soft fd limit * Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit * cargo lock * Merge branch 'unstable' of https://github.com/sigp/lighthouse into rkrasiuk/raise-vc-fdlimit * Proposer and attester slashing sse events (#5327) * default vc to block v3 endpoint and deprecate block-v3 flag * Merge branch 'unstable' of https://github.com/sigp/lighthouse into unstable * add proposer and attester event variants * add TOOOs * add tests, event triggers * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * revert * revert * remove double event tracking * Merge branch 'unstable' into proposer-and-attester-slashing-sse-events * remove todo, fix test * resolve merge conflicts * Merge branch 'proposer-and-attester-slashing-sse-events' of https://github.com/eserilev/lighthouse into proposer-and-attester-slashing-sse-events * leftover debugging * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * Merge branch 'unstable' of https://github.com/sigp/lighthouse into proposer-and-attester-slashing-sse-events * pin macos release runner to `macos-13` (#5665) * pin macos release runner to `macos-13` * Update .github/workflows/release.yml * Remove snapshot cache related code (#5661) * Remove snapshot cache and other references. * Fix default state cache size in docs * Remove cache miss comment entirely * Add state cache CLI tests * Uncomment self_hosted_runner after PR Merge #5137 (#5291) * Uncomment self_hosted_runner after PR Merge #5137 * Merge branch 'unstable' into fix_todo * Merge branch 'unstable' of https://github.com/sigp/lighthouse into fix_todo * Only `portable` builds (binaries) (#5615) * release workflow: portable builds by default * Delete outdated comment * Merge branch 'unstable' into portable-builds-binaries # Conflicts: # .github/workflows/release.yml * Merge parent and current sync lookups (#5655) * Drop lookup type trait for a simple arg * Drop reconstructed for processing * Send parent blocks one by one * Merge current and parent lookups * Merge current and parent lookups clean up todos * Merge current and parent lookups tests * Merge remote-tracking branch 'origin/unstable' into sync-merged-lookup * Merge branch 'unstable' of https://github.com/sigp/lighthouse into sync-merged-lookup * fix compile after merge * #5655 pr review (#26) * fix compile after merge * remove todos, fix typos etc * fix compile * stable rng * delete TODO and unfilled out test * make download result a struct * enums instead of bools as params * fix comment * Various fixes * Track ignored child components * Track dropped lookup reason as metric * fix test * add comment describing behavior of avail check error * update ordering * delete spammy log (#5672) * delete spammy log * Ensure block only range requests don't fail on download (#5675) * ensure pruned blobs don't fail on download * Typo * Improve ENR updates (#5483) * Improve ENR updates * forever fmt * Appease my old friend clippy * Merge network unstable * Check da_checker before doing a block lookup request (#5681) * Check da_checker before doing a block lookup request * Ensure consistent handling of lookup result * use req resp pre import cache rather than da checker * Update Cargo.lock (#5670) * update rust-yamux * update Cargo.lock * Merge branch 'unstable' of github.com:jxs/lighthouse into update-cargo * Merge branch 'unstable' of github.com:sigp/lighthouse into update-cargo * update to new libp2p versions * Add metric for current epoch total balance (#5688) * Add metric for current epoch total balance --------- Co-authored-by: realbigsean <sean@sigmaprime.io> Co-authored-by: Mac L <mjladson@pm.me> Co-authored-by: Lion - dapplion <35266934+dapplion@users.noreply.github.com> Co-authored-by: Pawan Dhananjay <pawandhananjay@gmail.com> Co-authored-by: Roman Krasiuk <rokrassyuk@gmail.com> Co-authored-by: Eitan Seri-Levi <eserilev@ucsc.edu> Co-authored-by: antondlr <anton@delaruelle.net> Co-authored-by: Jimmy Chen <jchen.tc@gmail.com> Co-authored-by: Ærvin <arvin.morawej@gmail.com> Co-authored-by: Age Manning <Age@AgeManning.com> Co-authored-by: João Oliveira <hello@jxs.pt> Co-authored-by: Michael Sproul <michael@sigmaprime.io>
* Check da_checker before doing a block lookup request * Ensure consistent handling of lookup result * use req resp pre import cache rather than da checker
| if self | ||
| .chain | ||
| .reqresp_pre_import_cache | ||
| .read() | ||
| .contains_key(&block_root) | ||
| { | ||
| return Ok(false); | ||
| } |
There was a problem hiding this comment.
I think this is the cause of the recent de-sync issues.
Consider the following sequence of events:
- Block arrives on gossip, we add it to the
reqresp_pre_import_cache. - We send the block's payload to the EL for verification.
- We receive blobs on gossip and add them to the DA cache.
- The EL times out. Block processing is aborted as a result.
- Subsequent attempts to look up the block and re-process it are blocked by the entry in the
reqresp_pre_import_cache. - We stall and are unable to reprocess the block until range sync and/or a blocks by range request is made in spite of the check above.
I've confirmed that this happened for block 0xfc4a1afe39d84a410c61f213cf15cb405fdd15b96ca54f0c826d4054671b1275 at slot 1574856 on Holesky. We get the block and all the blobs, then the EL times out and everything gets stuck:
May 04 05:31:20.803 DEBG Failed to verify execution payload error: ExecutionPayloadError(RequestFailed(EngineError(Api { error: HttpClient(url: http://127.0.0.1:8551/, kind: timeout, detail: operation timed out) })))
There was a problem hiding this comment.
Actually scratch that, @jimmygchen pointed out that we remove from the reqresp_pre_import_cache on failure here:
lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 2945 to 2950 in 63fad7e
Issue Addressed
Testing of v5.2.0 shows many more
rpc_blockthan usual. This is due this sequence of events:UnknownBlockHashFromAttestationis sent at least once every block (this is likely a bug and should be fixed separately)In current stable the single lookup checks against the da_checker if the block is known before sending a request, but after
this check was dropped by mistake
Proposed Changes
SyncNetworkContext::block_lookup_requestand skip the download if the block is already knownHowever, now we need to handle the case where a lookup goes directly from created -> completed. This takes a bit of refactoring, and to handle this gracefully and safely for all cases I have introduced the
LookupResultwith amust_usedirective.