Improve beacon node failover in validator client [tracking issue]

## Description

I think the beacon node failover feature is in need of some love, particularly now that we are post-merge and it is complicated by the addition of the execution node.

There are several issues:

- Beacon nodes continue to self-report their status as synced when the execution node goes offline. This means they will be used by validator clients to produce sub-optimal attestations.
- Beacon nodes self-report their status as synced for [_eight epochs_](https://github.com/sigp/lighthouse/blob/01e84b71f524968f5b940fbd2fa31d29408b6581/beacon_node/http_api/src/lib.rs#L70-L75) after their internal sync state switches to syncing. Again this results in sub-optimal attestations.
- Beacon nodes return nasty 500 internal server errors when the head is optimistic, e.g.

```
CRIT Error during attestation routine        slot: 4747474, committee_index: 11, error: "All endpoints failed http://localhost:5052/ => RequestFailed("Failed to produce attestation data: ServerMessage(ErrorMessage { code: 500, message: \"UNHANDLED_ERROR: HeadBlockNotFullyVerified { beacon_block_root: 0xbef1f36d129ca41de0a2da31962f5cb5025262f8813c6be2448524fc75be9947, execution_status: Optimistic(0x93219f971e74377cb48ba8050f1f193390d0313afd23c8cc4b1e8a119fb32fa1) }\", stacktraces: [] })")"
```

This error is probably handled _most_ gracefully by a VC with redundant beacon nodes, as it should failover to the next BN (need to double check this, the log above is from a node without failover BNs).

## Option 1: Fail Fast

One way to address all of these issues would be to add a flag to the beacon node like `--fail-fast` which makes it report its status as unsynced more readily:

- If the execution node is offline, report BN status as unsynced. Extension of https://github.com/sigp/lighthouse/pull/3428.
- If the internal sync state is anything but fully synced, report sync status as unsynced (ignore `SYNC_TOLERANCE_EPOCHS`).

The downside of this approach is that it adds configuration complexity: most beacon nodes used for redundancy should be configured with `--fail-fast`, but _at least one_ beacon node per cluster should _not be_, in order to not cripple liveness in case of sync issues (which is what `SYNC_TOLERANCE_EPOCHS` is trying to guard against).

The `--fail-fast` solution may also still not fail fast enough for some users, Lighthouse will usually consider itself synced for a few epochs (2?) without any blocks, during which time attestations will _still_ miss.

## Option 2: Quality Control

Rather than changing the behaviour of the beacon node, we could change the behaviour of the validator client to make it smarter about which beacon nodes to try first. Instead of using a binary "synced" or "not synced" differentiation the VC could use other data to make its choice, e.g. sync distance, optimistic status, execution layer online/offline status.

The complication with this approach is that we need to work within the confines of standard APIs provided by beacon nodes (which e.g. doesn't include EL status), and we need to avoid opening ourselves to attacks where an attacker tricks us into following their chain by proposing blocks closer to the current wall clock slot than the canonical chain.

I think I have a preference for option 2 currently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve beacon node failover in validator client [tracking issue] #3613

Description

Option 1: Fail Fast

Option 2: Quality Control

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve beacon node failover in validator client [tracking issue] #3613

Description

Description

Option 1: Fail Fast

Option 2: Quality Control

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions